--- title: "Web scraping 101" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Web scraping 101} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} knitr::opts_chunk$set(comment = "#>", collapse = TRUE) ``` This vignette introduces you to the basics of web scraping with rvest. You'll first learn the basics of HTML and how to use CSS selectors to refer to specific elements, then you'll learn how to use rvest functions to get data out of HTML and into R. ```{r} library(rvest) ``` ## HTML basics HTML stands for "HyperText Markup Language" and looks like this: ``` {.html}
Some text & some bold text.
``` HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g. `` and `
` (paragraph), and `
Hi! My name is Hadley.
``` The **children** of a node refers only to elements, so the `` element above has one child, the `` element. The `` element has no children, but it does have contents (the text "name"). Some elements, like `` can't have children. These elements depend solely on attributes for their behavior. ### Attributes Tags can have named **attributes** which look like `name1='value1' name2='value2'`. Two of the most important attributes are `id` and `class`, which are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page. These are often useful when scraping data off a page. ## Reading HTML with rvest You'll usually start the scraping process with `read_html()`. This returns a `xml_document`[^2] object which you'll then manipulate using rvest functions: [^2]: This class comes from the [xml2](https://xml2.r-lib.org) package. xml2 is a low-level package that rvest builds on top of. ```{r} html <- read_html("http://rvest.tidyverse.org/") class(html) ``` For examples and experimentation, rvest also includes a function that lets you create an `xml_document` from literal HTML: ```{r} html <- minimal_html("
This is a paragraph
` elements. - `.title`: selects all elements with `class` "title". - `p.special`: selects all `
` elements with `class` "special". - `#title`: selects the element with the `id` attribute that equals "title". Id attributes must be unique within a document, so this will only ever select a single element. If you want to learn more CSS selectors I recommend starting with the fun [CSS dinner](https://flukeout.github.io/) tutorial and then referring to the [MDN web docs](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors). Lets try out the most important selectors with a simple example: ```{r} html <- minimal_html("
This is a paragraph
This is an important paragraph
") ``` In rvest you can extract a single element with `html_element()` or all matching elements with `html_elements()`. Both functions take a document[^3] and a css selector: [^3]: Or another element, more on that shortly. ```{r} html %>% html_element("h1") html %>% html_elements("p") html %>% html_elements(".important") html %>% html_elements("#first") ``` Selectors can also be combined in various ways using **combinators**. For example,The most important combinator is " ", the **descendant** combination, because `p a` selects all `` elements that are a child of a `` element. If you don't know exactly what selector you need, I highly recommend using [SelectorGadget](https://rvest.tidyverse.org/articles/selectorgadget.html), which lets you automatically generate the selector you need by supplying positive and negative examples in the browser. ## Extracting data Now that you've got the elements you care about, you'll need to get data out of them. You'll usually get the data from either the text contents or an attribute. But, sometimes (if you're lucky!), the data you need will be in an HTML table. ### Text Use `html_text2()` to extract the plain text contents of an HTML element: ```{r} html <- minimal_html("
This is a paragraph.
This is another paragraph. It has two sentences.
") ``` `html_text2()` gives you what you expect: two paragraphs of text separated by a blank line. ```{r} html %>% html_element("body") %>% html_text2() %>% cat() ``` Whereas `html_text()` returns the garbled raw underlying text: ```{r} html %>% html_element("body") %>% html_text() %>% cat() ``` ### Attributes Attributes are used to record the destination of links (the `href` attribute of `` elements) and the source of images (the `src` attribute of the `` element): ```{r} html <- minimal_html(" ") ``` The value of an attribute can be retrieved with `html_attr()`: ```{r} html %>% html_elements("a") %>% html_attr("href") html %>% html_elements("img") %>% html_attr("src") ``` Note that `html_attr()` always returns a string, so you may need to post-process with `as.integer()`/`readr::parse_integer()` or similar. ```{r} html %>% html_elements("img") %>% html_attr("width") html %>% html_elements("img") %>% html_attr("width") %>% as.integer() ``` ### Tables HTML tables are composed four main elements: `` (table heading), and ` | ` (table data).
Here's a simple HTML table with two columns and three rows:
```{r}
html <- minimal_html("
|
---|