| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| README.html | 2020-09-25 11:50 | 11K | ||
| man/ | 2020-09-25 11:50 | - | ||

The package htmldf contains a single function html_df() which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:
To install the package:
To use html_df
## Warning: package 'dplyr' was built under R version 4.0.2
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
"https://www.tensorflow.org/tutorials/images/cnn",
"https://www.robertmylesmcdonnell.com/content/posts/mtcars/")
z <- html_df(urlx, show_progress = FALSE)
z## # A tibble: 3 x 15
## url title lang url2 links rss images social code_lang size server
## <chr> <chr> <chr> <chr> <lis> <chr> <list> <list> <dbl> <int> <chr>
## 1 http… Visu… en http… <tib… http… <tibb… <tibb… 1 38445 GitHu…
## 2 http… Conv… en http… <tib… <NA> <tibb… <tibb… -0.936 110231 Googl…
## 3 http… Robe… en http… <tib… <NA> <tibb… <tibb… 1 291099 Netli…
## # … with 4 more variables: accessed <dttm>, published <dttm>, generator <chr>,
## # source <chr>
Page titles
## # A tibble: 3 x 2
## title url2
## <chr> <chr>
## 1 Visualising Tour De France Data I… https://alastairrushworth.github.io/Visual…
## 2 Convolutional Neural Network (CNN… https://www.tensorflow.org/tutorials/image…
## 3 Robert Myles McDonnell https://www.robertmylesmcdonnell.com/conte…
RSS feeds
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA
## [3] NA
Social profiles
## [[1]]
## # A tibble: 3 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @rushworth_a https://twitter.com/rushworth_a
## 2 linkedin @alastair-rushworth-2531… https://linkedin.com/in/alastair-rushworth…
## 3 github @alastairrushworth https://github.com/alastairrushworth
##
## [[2]]
## # A tibble: 1 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @tensorflow https://twitter.com/tensorflow
##
## [[3]]
## # A tibble: 4 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @robertmylesmc https://twitter.com/robertmylesmc
## 2 linkedin @robert-mcdonnell-7475b… https://linkedin.com/in/robert-mcdonnell-74…
## 3 github @coolbutuseless https://github.com/coolbutuseless
## 4 github @robertmyles https://github.com/robertmyles
Inferred code language (near 1 = R; near -1 = Python)
## # A tibble: 3 x 2
## code_lang url2
## <dbl> <chr>
## 1 1 https://alastairrushworth.github.io/Visualising-Tour-de-France-data…
## 2 -0.936 https://www.tensorflow.org/tutorials/images/cnn
## 3 1 https://www.robertmylesmcdonnell.com/content/posts/mtcars/
Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.