Monday, July 6, 2015

Extracting JSON data in R

Increasingly, data on the internet is presented in JSON format. For the uninitiated, here's what a JSON file looks like:

{
  people:[ 
    {name:"John Smith", timings:[1,2,3]}, 
    {name:"Adam Yu", timings:[3,2,1]}, 
    {name:"Frank Pin", timings:[7,4,3]} 
}

It's more or less a data file. The curly braces indicated named arrays (e.g. we know that we have the name and timings of each person), while the square brackets indicate ordered unnamed arrays (e.g. Frank's first timing is 7, his second timing is 4, and his third timing is 3.)

How can R analyze the data appropriately? Let's take a look at a real life example: the salaries of public servants at the City of Chicago. R has three major packages which can analyze JSON data.

These are RJSONIO, rjson, and jsonlite. The differences are minor, especially for simple JSON files.
Here, we will use the RJSONIO package.


First, let's download and install the required packages and load the required data. 


install.packages("RJSONIO")
library(RJSONIO)
RawData <- fromJSON("https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.json?accessType=DOWNLOAD")
Should you have difficulty downloading the city of Chicago data, here's a link. (Note: information in the link is correct as of July 2015)

You should actually display the JSON file a web browser for easier viewing. Once you've done this, you'll notice that what you're looking for is in the data section. Hence

Data <- RawData$data

gives you most of what you need. I say most, because things like variable names aren't there. Here's an example of how you would extract a given variable:

employeeNames <- sapply(Data, function(x) x[[9]])

What sapply does is that it takes each observation in Data (which is a list containing lists), and
obtains the 9th element of each list, and then saves it to employeeNames. sapply returns an array.

Just to check that things are working properly, we can run
head(employeeNames) 
In order to extract all variables efficiently, we need to define two functions. 
install.packages('gdata')
library(gdata) # necessary for trim
grabinfo  <- function(var) {
  print(paste("Variable", var, sep=" ")) # for aesthetics: tells you which variables have been processed
  sapply(Data, function(x) returnData(x,var)) # the dataset "Data" is hardcoded here and is input as parameter "x"
}
returnData <- function(x, var) {
  if (!is.null( x[[var]] )) {
    return ( trim( x[[var]] ))
  } else {
    return(NA)
  }
}
df <- data.frame(sapply(1:length(Data[[1]]),grabinfo), stringsAsFactors=FALSE)
# Performs grabinfo for each variable in Data
# you run grabinfo(1), grabinfo(2), ... grabinfo(12).
# grabinfo(1) then uses sapply (again) to get values of Variable1 for all observations
# likewise with grabinfo(2), and so on
#
# in doing so, the entire dataset is transformed into a dataframe
Finally, some technicalities:
head(df) # Just checking that things are working again
names(df) <- sapply(1:12,function(x) RawData[['meta']][['view']][['columns']][[x]][['name']])
df$`Employee Annual Salary` <- as.numeric(df$`Employee Annual Salary`) # change the variable to numeric
Of course, this is a very simple case. What if our JSON contains nested objects (objects within objects)? You may wish to consult this excellent guide (this blogpost took many good points from there). What if you're actually downloading a Javascript file containing JSON? Check this out.

No comments:

Post a Comment