Hugo Future Imperfect Slim

Samuel Workman

Professor, Data & Statistical Consultant, West Virginian

Samuel Workman

6 minutes read

The word “confluence” means the juncture or merging of things, usually rivers. I grew up in West Virginia near the confluence of Gauley River (world-class whitewater for you rafters) and Meadow River (a smallmouth bass fishing haven). The word confluence is also an apt description of how the intellectual currents of my work converge. My work sits at the confluence of statistics, data science, and public policy. Were these rivers, data science would undoubtedly be the smallest. Though its fashionable to call oneself a “data scientist” these days, a data scientist I am not. I DO, however, use many of the tools, techniques, and processes of a data scientist.

I study public policy, and as such, don’t have the luxury of data that is of consistent form and measurement (e.g., dollars and cents for you economists). So, one day I am working with machine learning techniques for classifying text, the next I am working on measurement models for these classifications, and thereafter I may be building an extreme events model for some facet related to the data. All this means that I do a LOT of what data scientists, particularly those of the Tidyverse persuasion, call “data wrangling” - cleaning, organizing, and transforming data for analysis.

I often forget commands for various things that I need to with data. RStudio provides a great set of Cheatsheets for various packages and processes of data science and statistics. They are in the form of pdf posters like you’d see at a conference. The problem I have is that I am often working from The Lodge in West Virginia, where internet connection can be a challenge with Gauley River’s canyon roaring in the background. This means I am often searching and downloading the cheatsheets I think I may need while I am there. Since the cheatsheets are updated from time to time, I run the risk of having a dated one stored locally. This all led me to think about writing a short script that would access all RStudio Cheatsheets and download them locally. On to the code…

Load the Useful Libraries

The tidyverse is the workhorse of these types of operations, at least for those using . Specifically, rvest is the primary tool for reading and parsing html. Finally, stringr provides consistent ways of dealing with text strings—useful when you are scraping lists of urls.

library(tidyverse) 
library(rvest)
library(stringr)

Important Tidbits

  • In the initial scrape, str_subset("\\.pdf") tells to return all the links with pdfs. Otherwise, you get the links for the entire repository, including development files.
  • map(html_node, "#raw-url") tells to look for the url associated with the download button for each cheatsheet. You identify this tag by using Google’s Selector Gadget—search it for examples and how to identify tags.
  • purrr::walk2 applies the download.file function to each of the generated raw links. The “.” tells it to use the data from the previous commend (the raw urls). basename(.) tells it to give the downloaded file the basename of the url (e.g., “purrr.pdf”).
  • Depending on your pdf reader, you may need to add mode = "wb". Otherwise the files may appear blank, corrupt, or not render properly. See the documentation for download.file() for more information.

Old School

Note that walk2 and many of these other great functions from the tidyverse are not necessary per se. The for-loop below is an old-school implementation of what walk2 is doing above in base . Here, download.file is applied to each url in raw_list (url is just a tag here for each line in the object raw_list).

for (url in raw_list){
  download.file(url, destfile = basename(url), mode = "wb")
}

Wrapping Up

So, this is not the most efficient code for scraping where there is a need for “button-clicking,” but it gets the job done with minimal packages and knowledge of json or other languages. Now that we’re done, I can also say that the Cheatsheets are available at RStudio’s resource page. There, they would be an easier scrape, and of course you can just click and download them. I reference them often to students as they are a great resource in teachng data analysis and statistics.

Say something

Comments

Nothing yet.

Recent posts

See more

Categories

About

Academics, Data, Life on the Mountain