Title: | Easily Harvest (Scrape) Web Pages |
---|---|
Description: | Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML. |
Authors: | Hadley Wickham [aut, cre], Posit Software, PBC [cph, fnd] |
Maintainer: | Hadley Wickham <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.4.9000 |
Built: | 2024-10-25 16:18:03 UTC |
Source: | https://github.com/tidyverse/rvest |
html_attr()
gets a single attribute; html_attrs()
gets all attributes.
html_attr(x, name, default = NA_character_) html_attrs(x)
html_attr(x, name, default = NA_character_) html_attrs(x)
x |
A document (from |
name |
Name of attribute to retrieve. |
default |
A string used as a default value when the attribute does not exist in every element. |
A character vector (for html_attr()
) or list (html_attrs()
)
the same length as x
.
html <- minimal_html('<ul> <li><a href="https://a.com" class="important">a</a></li> <li class="active"><a href="https://c.com">b</a></li> <li><a href="https://c.com">b</a></li> </ul>') html %>% html_elements("a") %>% html_attrs() html %>% html_elements("a") %>% html_attr("href") html %>% html_elements("li") %>% html_attr("class") html %>% html_elements("li") %>% html_attr("class", default = "inactive")
html <- minimal_html('<ul> <li><a href="https://a.com" class="important">a</a></li> <li class="active"><a href="https://c.com">b</a></li> <li><a href="https://c.com">b</a></li> </ul>') html %>% html_elements("a") %>% html_attrs() html %>% html_elements("a") %>% html_attr("href") html %>% html_elements("li") %>% html_attr("class") html %>% html_elements("li") %>% html_attr("class", default = "inactive")
Get element children
html_children(x)
html_children(x)
x |
A document (from |
html <- minimal_html("<ul><li>1<li>2<li>3</ul>") ul <- html_elements(html, "ul") html_children(ul) html <- minimal_html("<p>Hello <b>Hadley</b><i>!</i>") p <- html_elements(html, "p") html_children(p)
html <- minimal_html("<ul><li>1<li>2<li>3</ul>") ul <- html_elements(html, "ul") html_children(ul) html <- minimal_html("<p>Hello <b>Hadley</b><i>!</i>") p <- html_elements(html, "p") html_children(p)
html_element()
and html_elements()
find HTML element using CSS selectors
or XPath expressions. CSS selectors are particularly useful in conjunction
with https://selectorgadget.com/, which makes it very easy to discover the
selector you need.
html_element(x, css, xpath) html_elements(x, css, xpath)
html_element(x, css, xpath) html_elements(x, css, xpath)
x |
Either a document, a node set or a single node. |
css , xpath
|
Elements to select. Supply one of |
html_element()
returns a nodeset the same length as the input.
html_elements()
flattens the output so there's no direct way to map
the output to the input.
CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.
It implements the majority of CSS3 selectors, as described in https://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:
Pseudo selectors that require interactivity are ignored:
:hover
, :active
, :focus
, :target
, :visited
.
The following pseudo classes don't work with the wild card element, *:
*:first-of-type
, *:last-of-type
, *:nth-of-type
,
*:nth-last-of-type
, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar]
is the same as :not([foo=bar])
:not()
accepts a sequence of simple selectors, not just a single
simple selector.
html <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ") html %>% html_element("h1") html %>% html_elements("p") html %>% html_elements(".important") html %>% html_elements("#first") # html_element() vs html_elements() -------------------------------------- html <- minimal_html(" <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> </ul> ") li <- html %>% html_elements("li") # When applied to a node set, html_elements() returns all matching elements # beneath any of the inputs, flattening results into a new node set. li %>% html_elements("i") # When applied to a node set, html_element() always returns a vector the # same length as the input, using a "missing" element where needed. li %>% html_element("i") # and html_text() and html_attr() will return NA li %>% html_element("i") %>% html_text2() li %>% html_element("span") %>% html_attr("class")
html <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ") html %>% html_element("h1") html %>% html_elements("p") html %>% html_elements(".important") html %>% html_elements("#first") # html_element() vs html_elements() -------------------------------------- html <- minimal_html(" <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> </ul> ") li <- html %>% html_elements("li") # When applied to a node set, html_elements() returns all matching elements # beneath any of the inputs, flattening results into a new node set. li %>% html_elements("i") # When applied to a node set, html_element() always returns a vector the # same length as the input, using a "missing" element where needed. li %>% html_element("i") # and html_text() and html_attr() will return NA li %>% html_element("i") %>% html_text2() li %>% html_element("span") %>% html_attr("class")
html_encoding_guess()
helps you handle web pages that declare an incorrect
encoding. Use html_encoding_guess()
to generate a list of possible
encodings, then try each out by using encoding
argument of read_html()
.
html_encoding_guess()
replaces the deprecated guess_encoding()
.
html_encoding_guess(x)
html_encoding_guess(x)
x |
A character vector. |
# A file with bad encoding included in the package path <- system.file("html-ex", "bad-encoding.html", package = "rvest") x <- read_html(path) x %>% html_elements("p") %>% html_text() html_encoding_guess(x) # Two valid encodings, only one of which is correct read_html(path, encoding = "ISO-8859-1") %>% html_elements("p") %>% html_text() read_html(path, encoding = "ISO-8859-2") %>% html_elements("p") %>% html_text()
# A file with bad encoding included in the package path <- system.file("html-ex", "bad-encoding.html", package = "rvest") x <- read_html(path) x %>% html_elements("p") %>% html_text() html_encoding_guess(x) # Two valid encodings, only one of which is correct read_html(path, encoding = "ISO-8859-1") %>% html_elements("p") %>% html_text() read_html(path, encoding = "ISO-8859-2") %>% html_elements("p") %>% html_text()
Use html_form()
to extract a form, set values with html_form_set()
,
and submit it with html_form_submit()
.
html_form(x, base_url = NULL) html_form_set(form, ...) html_form_submit(form, submit = NULL)
html_form(x, base_url = NULL) html_form_set(form, ...) html_form_submit(form, submit = NULL)
x |
A document (from |
base_url |
Base url of underlying HTML document. The default, |
form |
A form |
... |
< Provide a character vector to set multiple checkboxes in a set or select multiple values from a multi-select. |
submit |
Which button should be used to submit the form?
|
html_form()
returns as S3 object with class rvest_form
when applied
to a single element. It returns a list of rvest_form
objects when
applied to multiple elements or a document.
html_form_set()
returns an rvest_form
object.
html_form_submit()
submits the form, returning an httr response which
can be parsed with read_html()
.
HTML 4.01 form specification: https://www.w3.org/TR/html401/interact/forms.html
html <- read_html("http://www.google.com") search <- html_form(html)[[1]] search <- search %>% html_form_set(q = "My little pony", hl = "fr") # Or if you have a list of values, use !!! vals <- list(q = "web scraping", hl = "en") search <- search %>% html_form_set(!!!vals) # To submit and get result: ## Not run: resp <- html_form_submit(search) read_html(resp) ## End(Not run)
html <- read_html("http://www.google.com") search <- html_form(html)[[1]] search <- search %>% html_form_set(q = "My little pony", hl = "fr") # Or if you have a list of values, use !!! vals <- list(q = "web scraping", hl = "en") search <- search %>% html_form_set(!!!vals) # To submit and get result: ## Not run: resp <- html_form_submit(search) read_html(resp) ## End(Not run)
Get element name
html_name(x)
html_name(x)
x |
A document (from |
A character vector the same length as x
url <- "https://rvest.tidyverse.org/articles/starwars.html" html <- read_html(url) html %>% html_element("div") %>% html_children() %>% html_name()
url <- "https://rvest.tidyverse.org/articles/starwars.html" html <- read_html(url) html %>% html_element("div") %>% html_children() %>% html_name()
The algorithm mimics what a browser does, but repeats the values of merged cells in every cell that cover.
html_table( x, header = NA, trim = TRUE, fill = deprecated(), dec = ".", na.strings = "NA", convert = TRUE )
html_table( x, header = NA, trim = TRUE, fill = deprecated(), dec = ".", na.strings = "NA", convert = TRUE )
x |
A document (from |
header |
Use first row as header? If If |
trim |
Remove leading and trailing whitespace within each cell? |
fill |
Deprecated - missing cells in tables are now always
automatically filled with |
dec |
The character used as decimal place marker. |
na.strings |
Character vector of values that will be converted to |
convert |
If |
When applied to a single element, html_table()
returns a single tibble.
When applied to multiple elements or a document, html_table()
returns
a list of tibbles.
sample1 <- minimal_html("<table> <tr><th>Col A</th><th>Col B</th></tr> <tr><td>1</td><td>x</td></tr> <tr><td>4</td><td>y</td></tr> <tr><td>10</td><td>z</td></tr> </table>") sample1 %>% html_element("table") %>% html_table() # Values in merged cells will be duplicated sample2 <- minimal_html("<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td colspan='2'>4</td><td>5</td></tr> <tr><td>6</td><td colspan='2'>7</td></tr> </table>") sample2 %>% html_element("table") %>% html_table() # If a row is missing cells, they'll be filled with NAs sample3 <- minimal_html("<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td colspan='2'>1</td><td>2</td></tr> <tr><td colspan='2'>3</td></tr> <tr><td>4</td></tr> </table>") sample3 %>% html_element("table") %>% html_table()
sample1 <- minimal_html("<table> <tr><th>Col A</th><th>Col B</th></tr> <tr><td>1</td><td>x</td></tr> <tr><td>4</td><td>y</td></tr> <tr><td>10</td><td>z</td></tr> </table>") sample1 %>% html_element("table") %>% html_table() # Values in merged cells will be duplicated sample2 <- minimal_html("<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td colspan='2'>4</td><td>5</td></tr> <tr><td>6</td><td colspan='2'>7</td></tr> </table>") sample2 %>% html_element("table") %>% html_table() # If a row is missing cells, they'll be filled with NAs sample3 <- minimal_html("<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td colspan='2'>1</td><td>2</td></tr> <tr><td colspan='2'>3</td></tr> <tr><td>4</td></tr> </table>") sample3 %>% html_element("table") %>% html_table()
There are two ways to retrieve text from a element: html_text()
and
html_text2()
. html_text()
is a thin wrapper around xml2::xml_text()
which returns just the raw underlying text. html_text2()
simulates how
text looks in a browser, using an approach inspired by JavaScript's
innerText().
Roughly speaking, it converts <br />
to "\n"
, adds blank lines
around <p>
tags, and lightly formats tabular data.
html_text2()
is usually what you want, but it is much slower than
html_text()
so for simple applications where performance is important
you may want to use html_text()
instead.
html_text(x, trim = FALSE) html_text2(x, preserve_nbsp = FALSE)
html_text(x, trim = FALSE) html_text2(x, preserve_nbsp = FALSE)
x |
A document, node, or node set. |
trim |
If |
preserve_nbsp |
Should non-breaking spaces be preserved? By default,
|
A character vector the same length as x
# To understand the difference between html_text() and html_text2() # take the following html: html <- minimal_html( "<p>This is a paragraph. This another sentence.<br>This should start on a new line" ) # html_text() returns the raw underlying text, which includes whitespace # that would be ignored by a browser, and ignores the <br> html %>% html_element("p") %>% html_text() %>% writeLines() # html_text2() simulates what a browser would display. Non-significant # whitespace is collapsed, and <br> is turned into a line break html %>% html_element("p") %>% html_text2() %>% writeLines() # By default, html_text2() also converts non-breaking spaces to regular # spaces: html <- minimal_html("<p>x y</p>") x1 <- html %>% html_element("p") %>% html_text() x2 <- html %>% html_element("p") %>% html_text2() # When printed, non-breaking spaces look exactly like regular spaces x1 x2 # But aren't actually the same: x1 == x2 # Which you can confirm by looking at their underlying binary # representaion: charToRaw(x1) charToRaw(x2)
# To understand the difference between html_text() and html_text2() # take the following html: html <- minimal_html( "<p>This is a paragraph. This another sentence.<br>This should start on a new line" ) # html_text() returns the raw underlying text, which includes whitespace # that would be ignored by a browser, and ignores the <br> html %>% html_element("p") %>% html_text() %>% writeLines() # html_text2() simulates what a browser would display. Non-significant # whitespace is collapsed, and <br> is turned into a line break html %>% html_element("p") %>% html_text2() %>% writeLines() # By default, html_text2() also converts non-breaking spaces to regular # spaces: html <- minimal_html("<p>x y</p>") x1 <- html %>% html_element("p") %>% html_text() x2 <- html %>% html_element("p") %>% html_text2() # When printed, non-breaking spaces look exactly like regular spaces x1 x2 # But aren't actually the same: x1 == x2 # Which you can confirm by looking at their underlying binary # representaion: charToRaw(x1) charToRaw(x2)
You construct an LiveHTML object with read_html_live()
and then interact,
like you're a human, using the methods described below. When debugging a
scraping script it is particularly useful to use $view()
, which will open
a live preview of the site, and you can actually see each of the operations
performed on the real site.
rvest provides relatively simple methods for scrolling, typing, and clicking. For richer interaction, you probably want to use a package that exposes a more powerful user interface, like selendir.
session
Underlying chromote session object. For expert use only.
new()
initialize the object
LiveHTML$new(url)
url
URL to page.
print()
Called when print()
ed
LiveHTML$print(...)
...
Ignored
view()
Display a live view of the site
LiveHTML$view()
html_elements()
Extract HTML elements from the current page.
LiveHTML$html_elements(css, xpath)
css, xpath
CSS selector or xpath expression.
click()
Simulate a click on an HTML element.
LiveHTML$click(css, n_clicks = 1)
css
CSS selector or xpath expression.
n_clicks
Number of clicks
get_scroll_position()
Get the current scroll position.
LiveHTML$get_scroll_position()
scroll_into_view()
Scroll selected element into view.
LiveHTML$scroll_into_view(css)
css
CSS selector or xpath expression.
scroll_to()
Scroll to specified location
LiveHTML$scroll_to(top = 0, left = 0)
top, left
Number of pixels from top/left respectively.
scroll_by()
Scroll by the specified amount
LiveHTML$scroll_by(top = 0, left = 0)
top, left
Number of pixels to scroll up/down and left/right respectively.
type()
Type text in the selected element
LiveHTML$type(css, text)
css
CSS selector or xpath expression.
text
A single string containing the text to type.
press()
Simulate pressing a single key (including special keys).
LiveHTML$press(css, key_code, modifiers = character())
css
CSS selector or xpath expression. Set to NULL
key_code
Name of key. You can see a complete list of known keys at https://pptr.dev/api/puppeteer.keyinput/.
modifiers
A character vector of modifiers. Must be one or more
of "Shift
, "Control"
, "Alt"
, or "Meta"
.
clone()
The objects of this class are cloneable with this method.
LiveHTML$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: # To retrieve data for this paginated site, we need to repeatedly push # the "Load More" button sess <- read_html_live("https://www.bodybuilding.com/exercises/finder") sess$view() sess %>% html_elements(".ExResult-row") %>% length() sess$click(".ExLoadMore-btn") sess %>% html_elements(".ExResult-row") %>% length() sess$click(".ExLoadMore-btn") sess %>% html_elements(".ExResult-row") %>% length() ## End(Not run)
## Not run: # To retrieve data for this paginated site, we need to repeatedly push # the "Load More" button sess <- read_html_live("https://www.bodybuilding.com/exercises/finder") sess$view() sess %>% html_elements(".ExResult-row") %>% length() sess$click(".ExLoadMore-btn") sess %>% html_elements(".ExResult-row") %>% length() sess$click(".ExLoadMore-btn") sess %>% html_elements(".ExResult-row") %>% length() ## End(Not run)
read_html()
works by performing a HTTP request then parsing the HTML
received using the xml2 package. This is "static" scraping because it
operates only on the raw HTML file. While this works for most sites,
in some cases you will need to use read_html_live()
if the parts of
the page you want to scrape are dynamically generated with javascript.
Generally, we recommend using read_html()
if it works, as it will be
faster and more robust, as it has fewer external dependencies (i.e. it
doesn't rely on the Chrome web browser installed on your computer.)
read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS"))
read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS"))
x |
Usually a string representing a URL. See |
encoding |
Specify a default encoding for the document. Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default. |
... |
Additional arguments passed on to methods. |
options |
Set parsing options for the libxml2 parser. Zero or more of
|
# Start by reading a HTML page with read_html(): starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html") # Then find elements that match a css selector or XPath expression # using html_elements(). In this example, each <section> corresponds # to a different film films <- starwars %>% html_elements("section") films # Then use html_element() to extract one element per film. Here # we the title is given by the text inside <h2> title <- films %>% html_element("h2") %>% html_text2() title # Or use html_attr() to get data out of attributes. html_attr() always # returns a string so we convert it to an integer using a readr function episode <- films %>% html_element("h2") %>% html_attr("data-id") %>% readr::parse_integer() episode
# Start by reading a HTML page with read_html(): starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html") # Then find elements that match a css selector or XPath expression # using html_elements(). In this example, each <section> corresponds # to a different film films <- starwars %>% html_elements("section") films # Then use html_element() to extract one element per film. Here # we the title is given by the text inside <h2> title <- films %>% html_element("h2") %>% html_text2() title # Or use html_attr() to get data out of attributes. html_attr() always # returns a string so we convert it to an integer using a readr function episode <- films %>% html_element("h2") %>% html_attr("data-id") %>% readr::parse_integer() episode
read_html()
operates on the HTML source code downloaded from the server.
This works for most websites but can fail if the site uses javascript to
generate the HTML. read_html_live()
provides an alternative interface
that runs a live web browser (Chrome) in the background. This allows you to
access elements of the HTML page that are generated dynamically by javascript
and to interact with the live page by clicking on buttons or typing in
forms.
Behind the scenes, this function uses the chromote package, which requires that you have a copy of Google Chrome installed on your machine.
read_html_live(url)
read_html_live(url)
url |
Website url to read from. |
read_html_live()
returns an R6 LiveHTML object. You can interact
with this object using the usual rvest functions, or call its methods,
like $click()
, $scroll_to()
, and $type()
to interact with the live
page like a human would.
## Not run: # When we retrieve the raw HTML for this site, it doesn't contain the # data we're interested in: static <- read_html("https://www.forbes.com/top-colleges/") static %>% html_elements(".TopColleges2023_tableRow__BYOSU") # Instead, we need to run the site in a real web browser, causing it to # download a JSON file and then dynamically generate the html: sess <- read_html_live("https://www.forbes.com/top-colleges/") sess$view() rows <- sess %>% html_elements(".TopColleges2023_tableRow__BYOSU") rows %>% html_element(".TopColleges2023_organizationName__J1lEV") %>% html_text() rows %>% html_element(".grant-aid") %>% html_text() ## End(Not run)
## Not run: # When we retrieve the raw HTML for this site, it doesn't contain the # data we're interested in: static <- read_html("https://www.forbes.com/top-colleges/") static %>% html_elements(".TopColleges2023_tableRow__BYOSU") # Instead, we need to run the site in a real web browser, causing it to # download a JSON file and then dynamically generate the html: sess <- read_html_live("https://www.forbes.com/top-colleges/") sess$view() rows <- sess %>% html_elements(".TopColleges2023_tableRow__BYOSU") rows %>% html_element(".TopColleges2023_organizationName__J1lEV") %>% html_text() rows %>% html_element(".grant-aid") %>% html_text() ## End(Not run)
This set of functions allows you to simulate a user interacting with a website, using forms and navigating from page to page.
Create a session with session(url)
Navigate to a specified url with session_jump_to()
, or follow a link on the
page with session_follow_link()
.
Submit an html_form with session_submit()
.
View the history with session_history()
and navigate back and forward
with session_back()
and session_forward()
.
Extract page contents with html_element()
and html_elements()
, or get the
complete HTML document with read_html()
.
Inspect the HTTP response with httr::cookies()
, httr::headers()
,
and httr::status_code()
.
session(url, ...) is.session(x) session_jump_to(x, url, ...) session_follow_link(x, i, css, xpath, ...) session_back(x) session_forward(x) session_history(x) session_submit(x, form, submit = NULL, ...)
session(url, ...) is.session(x) session_jump_to(x, url, ...) session_follow_link(x, i, css, xpath, ...) session_back(x) session_forward(x) session_history(x) session_submit(x, form, submit = NULL, ...)
url |
A URL, either relative or absolute, to navigate to. |
... |
Any additional httr config to use throughout the session. |
x |
A session. |
i |
A integer to select the ith link or a string to match the first link containing that text (case sensitive). |
css , xpath
|
Elements to select. Supply one of |
form |
An html_form to submit |
submit |
Which button should be used to submit the form?
|
s <- session("http://hadley.nz") s %>% session_jump_to("hadley.jpg") %>% session_jump_to("/") %>% session_history() s %>% session_jump_to("hadley.jpg") %>% session_back() %>% session_history() s %>% session_follow_link(css = "p a") %>% html_elements("p")
s <- session("http://hadley.nz") s %>% session_jump_to("hadley.jpg") %>% session_jump_to("/") %>% session_history() s %>% session_jump_to("hadley.jpg") %>% session_back() %>% session_history() s %>% session_follow_link(css = "p a") %>% html_elements("p")