There are times when we need to scrape some data from a website to use in our analyses. In a perfect world, the provider of the data would provide a csv or json download but let’s face it… we do not live in a perfect world and often data that is posted to the internet is done by people who really do not care about subseqent use by others (else they would have made it easy to use rather than just showing it).
Here is a quick tutorial on one method I use to do this.
For this example, I’m going to use the rvest
and dplyr
libraries to scrape a USDA page for county FIPS codes1 for all the counties in Virginia.
if( !("dplyr" %in% installed.packages()) ) {
install.packages("tidyverse")
}
if( !("rvest" %in% installed.packages() ) ) {
install.packages("rvest")
}
library( rvest )
Loading required package: xml2
library( tidyverse )
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x readr::guess_encoding() masks rvest::guess_encoding()
x dplyr::lag() masks stats::lag()
x purrr::pluck() masks rvest::pluck()
To start, grab the URL of the page you are intending to scrape. This one is a pretty straight forward structure.
USDA Website
url <- "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013697"
page <- read_html(url)
page
{html_document}
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
[1] <head>\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n< ...
[2] <body style="">\t\t \t\t \r\n<a href="#actualContent" accesskey ...
Now we can use some tidy methods to get to the components. We will go through the nodes and grab the “body” component then show the children in the body.
page %>%
html_node("body") %>%
html_children()
{xml_nodeset (2)}
[1] <a href="#actualContent" accesskey="c" style="position:absolute;color:#9B ...
[2] <table id="wptheme_pageArea" summary="This table is used for page layout" ...
The page has broken up into compoennts, of which we can search. To figure out what parts we are going to grab data from, we need to look at the raw code in the document. Your browser can do this and you’ll have to manually look through the components. Here is what this page looks like.
HTML Content
Here you notice that the table that contains the data we are interested in looking at has a class equal to data
. We can grab all the table
components from the page and look to see which one we are interested in.
page %>%
xml_find_all(".//table")
{xml_nodeset (26)}
[1] <table id="wptheme_pageArea" summary="This table is used for page layout ...
[2] <table class="layoutRow" cellpadding="0" cellspacing="0" width="100%"><t ...
[3] <table summary="This table is used for page layout" class="themetable" b ...
[4] <table class="layoutRow" cellpadding="0" cellspacing="0" width="100%"><t ...
[5] <table summary="This table is used for page layout" width="100%" border= ...
[6] <table class="layoutRow" cellpadding="0" cellspacing="0" width="100%"><t ...
[7] <table summary="This table is used for page layout" class="themetable" b ...
[8] <table summary="This table is used for page layout" width="100%" border= ...
[9] <table class="layoutRow" cellpadding="0" cellspacing="0" width="100%"><t ...
[10] <table summary="This table is used for page layout" class="themetable" b ...
[11] <table summary="This table is used for page layout" width="100%" border= ...
[12] <table summary="This table is used for page layout" width="100%" border= ...
[13] <table summary="This table is used for page layout" width="100%" border= ...
[14] <table summary="This table is used for page layout" width="100%" border= ...
[15] <table summary="This table is used for page layout" width="100%" border= ...
[16] <table summary="This table is used for page layout" width="100%" border= ...
[17] <table summary="This table is used for page layout" width="100%" border= ...
[18] <table summary="This table is used for page layout" class="themetable" b ...
[19] <table summary="This table is used for page layout" class="themetable" b ...
[20] <table summary="This table is used for page layout" width="100%" border= ...
...
There are 26 total tables on this page! But it looks like the first 20 do not contain the table of interest. Here are the last ones.
page %>%
xml_find_all(".//table") %>%
tail()
{xml_nodeset (6)}
[1] <table border="0" cellpadding="6" cellspacing="1" class="data"><tbody>\n< ...
[2] <table class="layoutRow" cellpadding="0" cellspacing="0" width="100%"><tr ...
[3] <table summary="This table is used for page layout" class="themetable" bo ...
[4] <table summary="This table is used for page layout" class="themetable" bo ...
[5] <table summary="This table is used for page layout" width="100%" border=" ...
[6] <table summary="This table is used for page layout" class="themetable" bo ...
And we see that the data table is the one with class="data"
, that differentiates it from the rest. We can use this to pull out just the table with the attribute class = "data"
using xml search techniques. The syntax is a bit odd as it is based upon xml search syntax but you can get the notion.
page %>%
xml_find_all(".//table") %>%
xml_find_all("//table[contains(@class, 'data')]")
{xml_nodeset (1)}
[1] <table border="0" cellpadding="6" cellspacing="1" class="data"><tbody>\n< ...
Now we can turn that table into a data.frame
and format the columns appropriately2
page %>%
xml_find_all(".//table") %>%
xml_find_all("//table[contains(@class, 'data')]") %>%
html_table( fill = TRUE ) %>%
.[[1]] %>%
mutate( Name = factor(Name), State = factor(State) ) -> raw_data
summary( raw_data)
FIPS Name State
Min. : 1001 Washington: 32 TX : 254
1st Qu.:19046 Jefferson : 26 GA : 159
Median :30044 Franklin : 25 VA : 136
Mean :31573 Jackson : 24 KY : 120
3rd Qu.:46136 Lincoln : 24 MO : 115
Max. :78030 Madison : 20 KS : 105
(Other) :3081 (Other):2343
PERFECT!! Now, to pull the parts that are relevant to us, the ones from Virginia.
raw_data %>%
filter( State == "VA" ) %>%
select(Name, FIPS) %>%
droplevels()
And there you go.
Federal Information Processing Standards (FIPS), now known as Federal Information Processing Series, are numeric codes assigned by the National Institute of Standards and Technology (NIST). Typically, FIPS codes deal with US states and counties. US states are identified by a 2-digit number, while US counties are identified by a 3-digit number. For example, a FIPS code of 51159, represents Virginia (51-) and Richomnd City -159.↩︎
The .[[1]]
part is how we grab just the first row from the list that is returned.↩︎