Scraping Impact Factor data from the Web using httr and regex in R

A couple of days ago, I found a website listing Impact Factor data of many scientific Journals organized in HTML tables (http://www.citefactor.org). Unfortunately, this website didn’t allow users to download Impact Factor tables in 1-click. Moreover, data were scattered over several HTML files. I wrote a short R script to scrape data from this website and save them in a convenient format. The package httr was used for retrieving the HTML code and information were extracted using R regular expressions.

baseAddr <- “http://www.citefactor.org/journal-impact-factor-list-2014_”
extenAddr <- “.html”
sitePages <- c(“0-A”,”B”,”C”,”D”,”E”,”F”,”G”,”J”,”K”,”L”,”M”,”N”,”O”,”P”,”Q”,”R”,”S”,”T”,”U”,”V”,”W”,”X”,”Y”,”Z”)

We can retrieve the content of each web page by looping over the sitePages vector and then using the GET function from the httr package

library(httr)

for (page in sitePages) {
queryAddr<- paste (baseAddr,page,extenAddr,sep=””)
sourceHTML<-GET(queryAddr)
sourceHTML<- toString(sourceHTML)

}

sourceHTML is now a character object containing a raw HTML String. We can use regular expressions and extract the information we are looking for: the full Journal Name and the 2013/14 Impact factor. The following R functions turned out to be very useful for this project:

  • regexpr(“pattern”, “text”, …) . Searches for the first occurrence of pattern in text and returns its position as a positive integer. Integer comes with an attribute (“match.length“) that specifies the length of the matched pattern in text.
  • substr(start, end, “text”) . Retruns a substring of text starting at the position start and ending at the position end.
  • gsub(“pattern”, “replace”, “text”) . Replaces each pattern found in text with the replace string.

If we don’t want to use special characters in pattern definition, we can include the parameter fixed = TRUE in the arguments of the regex function. For more info, check the R documentation about regex.

I started by extracting the HTML code corresponding to the table of interest. Interestingly, the table includes a <CAPTION> tag that I used to trim away undesired HTML code.

tabStart <- regexpr(“<CAPTION>Impact Factor 2014</CAPTION>”, sourceHTML, fixed = TRUE)
tabEnd <- regexpr(“</TABLE>”, sourceHTML, fixed = TRUE)
tabHTML <- substr(sourceHTML, tabStart+38, tabEnd-1)

Each table row is enclosed in a <TR> tag. We can explode the table in a row-wise fashion by calling the strsplit function as follows. This line of code will return a vector out of a list of HTML chunks.

tabChunks <- unlist(strsplit(tabHTML, “</TR>”, fixed=TRUE))

I then polished each HTML chunk in the tabChunks vector to remove formatting tags or unwanted code (new line characters).

chunk <- tabChunks[i] #where i is an integer between 1 and length(tabChunks)

chunk <- gsub(“<b>”, “”, chunk, fixed=TRUE)
chunk <- gsub(“</b>”, “”, chunk, fixed=TRUE)
chunk <- gsub(“\n”, “”, chunk, fixed=TRUE)

I want to retrieve the strings in the second and fourth column of my HTML row/table. As each cell is defined by the <TD> tag, I can use strplit again and then retrieve strings at specific positions.

tmp_entries <- unlist(strsplit(chunk, “</TD>”, fixed = TRUE))
jTitle <- gsub(“<TD DIR=LTR ALIGN=LEFT>”,””, tmp_entries[2], fixed=TRUE)
jTitle <- toupper(jTitle)
jIF <- gsub(“<TD DIR=LTR ALIGN=LEFT>”,””, tmp_entries[4], fixed=TRUE)

The script uses the rbind function to append information from each row in the same data frame (not shown). After looping over all HTML pages and all rows of each HTML table, the script gets a complete data frame that can be easily saved as a csv file for storage and further use.

You can find the source code of this project at the following address: https://github.com/dami82/datasci/blob/master/IF_scraping.R

 

About Author

Damiano
Postdoc Research Fellow at Northwestern University (Chicago)

Leave a Comment

Your email address will not be published. Required fields are marked *