Sunday, July 12, 2015

Linear regression

Let's try regression on the auto dataset:
regfit.full <- regsubsets(price~., auto)

It doesn't work because by default, exhaustive search is performed.
This requires lots of permutations, so you have to set really.big=TRUE if you want this.

But let's just make things simpler by carving out a subset of variables:
autosmall <- auto[,c(2,3,5,6,7,8,9,10,11)]
regfit.full <- regsubsets(price ~ . , autosmall) # the . indicates we want all variables to be considered.
summary(regfit.full)

regfit.fwd <- regsubsets(price~., autosmall, method="forward") # forward stepwise selection
regfit.bwd <- regsubsets(price~., autosmall, method="backward") # backward stepwise selection
coef(regfit.fwd,3) # gives the coefficients for regfit.fwd for the optimal 3-variable model

Recall that we actually need to split our dataset into a training set and test set. This is how we can do it:
set.seed(1)
train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE)
# if we wanted 75% of observations to be in training set, we can do the below:
# train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE, prob = c(0.75, 0.25))
test <- (!train)

# another way of allocating the training/test sets:
# train <- sample(1:nrow(autosmall),  nrow(autosmall)/2)
# test <- (-train)
# autosmall.test <- autosmall[test]
# (notice that test contains all negative numbers, so all "train" observations are excluded)

To keep things simple, we didn't do cross validation (as well as regularization).
Here's how to do cross-validation:
regfit.full <- regsubsets(price~., autosmall[train,])
reg.summary <- summary(regfit.full) # for convenience
names(reg.summary)

For model selection, it'll be useful to plot how RSS (and other measures such as adjusted R-sq) varies with number of variables:

par(mfrow = c(1,2)) # sets a 1 row by 2 column plot
plot(reg.summary$rss,xlab="# of Var", ylab="RSS", type="l")
(m1 <- which.min(reg.summary$rss))
points(m1,reg.summary$rss[m1], col="red",cex=2,pch=20)
# cex refers to the dot thickness

plot(reg.summary$adjr2,xlab="# of Var", ylab="Adj R2", type="l")
(m1 <- which.max(reg.summary$adjr2))
points(m2,reg.summary$adjr2[m2], col="red",cex=2,pch=20)

plot(regfit.full, scale="r2")

Now we need to see MSE on the test set. First off, use model.matrix to generate the "X" matrix for the test set:

test.mat <- model.matrix(price~., data=autosmall[test,])

val.errors = rep(NA,8) # generates the empty MSE matrix
for (i in 1:8) {
  coefi <- coef(regfit.full, id=i) # takes the coefficients from the optimal model with i variables
  pred <- test.mat[,names(coefi)] %*% coefi # generate predicted values of each observation
  val.errors[i] <- mean((autosmall$price[test] - pred)^2) # generate MSE for i-th model
}
val.errors
#[1]  7781131  9856769 10060398  9574022  9421108  9470527  9269711  8942316
(v <- which.min(val.errors))
#[1] 1
# so we choose model with 1 variable (weight only)
coef(regfit.full, v)
#(Intercept)      weight 
#-426.910662    2.251572

We can also write a function, which will come in useful for k-fold cross validation
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]]) # takes the second item from object$call list
  # call is part of regfit.full
  # the second item within the list is the relevant variables
  mat <- model.matrix(form, newdata) # generates the required X matrix using the formula
  coefi <- coef(object, id=id) # the X variable COEFFICIENTS of the optimal model
  xvars <- names(coefi) # these are the X variables of the optimal model
  mat[,xvars] %*% coefi # generates predicted values (by looking at the relevant columns only)
}

# 10-fold cross validation
k = 10
set.seed(1)
folds = sample(1:k, nrow(autosmall), replace=TRUE) 
# if an observation is labelled 'i', it will be part of the i-th fold.
# e.g. an observation that is labelled 3 will be left out of the 3rd round of cross validation
cv.errors = matrix(NA,k,8, dimnames=list(NULL, paste(1:8)))
# sets up a k by 8 matrix filled with NA's. 
# columns are named 1 to 8

# armed with our new predict() method, we can:
for (j in 1:k) {
  # obtain the required coefficients
  best.fit <- regsubsets(price ~ . , data=autosmall[folds != j,])
  # note: if you have more than 8 variables you will need to specify nvmax in the above command
  for (i in 1:8) {
    # now, calculate MSE using the i-th fold
    pred <- predict(best.fit, autosmall[folds == j,],id=i)
    # since best.fit is of class regsubsets, the predict command that was created will work on it.
    cv.errors[j,i] <- mean((autosmall$price[folds == j] - pred)^2)
  }
}
(mean.cv.errors <- apply(cv.errors,2,mean))

Let's plot MSE of the CV dataset.
par(mfrow=c(1,1))
plot(mean.cv.errors, type='b')

Overview of Machine Learning Techniques in R

Whenever we do something as simple as a Google search, we benefit from machine learning. Demand is high for those with machine learning skills - a search of LinkedIn reveals 9,369 results for "machine learning".

This post gives an overview of packages (and functions within those packages) one can use to conduct machine learning techniques. A lot of this material is drawn from the fourth edition of the textbook Introduction to Statistical Learning.

  1. Linear regression
Library: leaps
Functions: regsubsets
  1. Decision trees
  2. Support Vector Machines
  3. Unsupervised learning
Subsequent posts will elaborate on these in more detail.

Apply, lapply, sapply, tapply in R

How can we use R to efficiently calculate key statistics such as the mean?
Let's play with the auto dataset to see how.

apply(cbind(auto$price,auto$length,auto$weight),2,mean) # finds simple means

apply(auto,2,mean) would not work because some arguments are non-numeric.
Instead, one can try apply(auto[,c(2,3)],2,mean).

If you wanted to find out the number of values each variable took, this would be simple:
apply(auto, 2, function (x) length(unique(x)))

The second argument (2) tells R to work column-by-column. Had 1 been specified, R would have worked row-by-row. You can also use the sapply function, which by default returns a vector:

sapply(auto, function(x) length(unique(x)))

If we wanted the output in terms of a list, we could use lapply:
lapply(auto, function(x) length(unique(x)))

What if we wanted disaggregated statistics? For example, we might want to know what is the mean price of foreign cars, as well as the mean price of domestic cars. We can use

tapply(auto$price, auto$foreign, mean)

Which would be similar to Excel's PivotTable. The by function is also acceptable:

by(auto$price, auto$foreign, mean)

What if the variable has NA values? For example, by(auto$price, auto$foreign, mean) returns us the unglamorous

auto$foreign: Domestic
[1] NA
-----------------
auto$foreign: Foreign
[1] NA

This was because the rep78 contained missing values. We can easily handle this:

tapply(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

And what if we wanted to restrict attention to cars with prices above $1000:
tapply(auto$rep78, auto$foreign, function(x) mean(x[auto$price > 10000],na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

Yet another way is to subset the data first:
m <- subset(auto, !is.na(auto$rep78) & auto$price > 10000)
by(m$rep78, m$foreign, mean)

Useful resources for learning R


Monday, July 6, 2015

Extracting JSON data in R

Increasingly, data on the internet is presented in JSON format. For the uninitiated, here's what a JSON file looks like:

{
  people:[ 
    {name:"John Smith", timings:[1,2,3]}, 
    {name:"Adam Yu", timings:[3,2,1]}, 
    {name:"Frank Pin", timings:[7,4,3]} 
}

It's more or less a data file. The curly braces indicated named arrays (e.g. we know that we have the name and timings of each person), while the square brackets indicate ordered unnamed arrays (e.g. Frank's first timing is 7, his second timing is 4, and his third timing is 3.)

How can R analyze the data appropriately? Let's take a look at a real life example: the salaries of public servants at the City of Chicago. R has three major packages which can analyze JSON data.

These are RJSONIO, rjson, and jsonlite. The differences are minor, especially for simple JSON files.
Here, we will use the RJSONIO package.


First, let's download and install the required packages and load the required data. 


install.packages("RJSONIO")
library(RJSONIO)
RawData <- fromJSON("https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.json?accessType=DOWNLOAD")
Should you have difficulty downloading the city of Chicago data, here's a link. (Note: information in the link is correct as of July 2015)

You should actually display the JSON file a web browser for easier viewing. Once you've done this, you'll notice that what you're looking for is in the data section. Hence

Data <- RawData$data

gives you most of what you need. I say most, because things like variable names aren't there. Here's an example of how you would extract a given variable:

employeeNames <- sapply(Data, function(x) x[[9]])

What sapply does is that it takes each observation in Data (which is a list containing lists), and
obtains the 9th element of each list, and then saves it to employeeNames. sapply returns an array.

Just to check that things are working properly, we can run
head(employeeNames) 
In order to extract all variables efficiently, we need to define two functions. 
install.packages('gdata')
library(gdata) # necessary for trim
grabinfo  <- function(var) {
  print(paste("Variable", var, sep=" ")) # for aesthetics: tells you which variables have been processed
  sapply(Data, function(x) returnData(x,var)) # the dataset "Data" is hardcoded here and is input as parameter "x"
}
returnData <- function(x, var) {
  if (!is.null( x[[var]] )) {
    return ( trim( x[[var]] ))
  } else {
    return(NA)
  }
}
df <- data.frame(sapply(1:length(Data[[1]]),grabinfo), stringsAsFactors=FALSE)
# Performs grabinfo for each variable in Data
# you run grabinfo(1), grabinfo(2), ... grabinfo(12).
# grabinfo(1) then uses sapply (again) to get values of Variable1 for all observations
# likewise with grabinfo(2), and so on
#
# in doing so, the entire dataset is transformed into a dataframe
Finally, some technicalities:
head(df) # Just checking that things are working again
names(df) <- sapply(1:12,function(x) RawData[['meta']][['view']][['columns']][[x]][['name']])
df$`Employee Annual Salary` <- as.numeric(df$`Employee Annual Salary`) # change the variable to numeric
Of course, this is a very simple case. What if our JSON contains nested objects (objects within objects)? You may wish to consult this excellent guide (this blogpost took many good points from there). What if you're actually downloading a Javascript file containing JSON? Check this out.

Sunday, July 5, 2015

Linear Regression in R

Let's load the auto dataset and learn regression techniques in more detail. (For those who have not read previous posts, this is a dataset containing information about 74 cars.)

Suppose we want to find the relationship between the price and weight of a car. One way to do this is

priceweight <- lm(auto$price ~ auto$weight)

Alternatively,

priceweight <- lm(price ~ weight, data=auto)

will work as well.

The command summary(priceweight) will then display details of the regression such as coefficient estimates and p-values.

We can then visualize the data by
plot(auto$weight,auto$price)
abline(priceweight)

Notice that for plot, the first named variable takes the x axis, while the other variable takes the y axis.

The function abline draws a straight line through the plot, using the coefficients produced by the regression. You can always produce other lines if you want. For example,

abline(a=0,b=1,col='blue')

produce a straight line with an intercept of 0 and a slope of 1. In other words, this would be the model if price and weight had a 1:1 relationship (clearly not supported by the data). Notice that specifying a blue color helps to differentiate this line from the best-fit line.

If we wanted to, we could emphasize the best fit line more:
abline(priceweight, lwd = 5)

here lwd stands for linewidth. The line will be 5 units thick (the default is 1 unit thick, so we have increased the width drastically).

Squared terms

It might be reasonable to think that price is a non-linear function of weight. How can we see whether, for example, price is a quadratic function of weight?

An intuitive guess may be 

priceweight2 <- lm(price ~ weight + weight^2, data=auto)

But it doesn't quite produce the results we want:

Call:
lm(formula = price ~ weight + weight^2, data = auto)

Coefficients:
(Intercept)       weight  
     -6.707        2.044  

What you actually wanted to do was this:

priceweight2 <- lm(price ~ weight + I(weight^2), data=auto)

The I() means "inhibit". In this case, without I()^ would be a formula operator. Using I() allows ^ to be interpreted as a arithmetic operator. If you are confused, think of it this way: ^ has a special meaning when it appears in a formula. Using I() suppresses this special meaning, and forces R to treat it in the normal way (exponential).

With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.

anova(priceweight,priceweight2)

Analysis of Variance Table

Model 1: price ~ weight
Model 2: price ~ weight + I(weight^2)
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
1     72 450831459                                  
2     71 384779934  1  66051525 12.188 0.0008313 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run
plot(priceweight)
plot(priceweight2)

and end up with the following plots:

Linear case:

Quadratic case:


It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.

Interaction Variables

Suppose we hypothesized that the relationship between price and weight was different from foreign cars as compared to domestic cars.
Then we can use 

priceweight <- lm(price ~ weight + weight:foreign, data=auto)

We might even think that being foreign has some effect, so let's throw the variable in: 

priceweight <- lm(price ~ weight + weight:foreign + foreign, data=auto)

But hey, all these can be simplified:

priceweight <- lm(price ~ weight*foreign, data=auto)

Notice that weight*foreign is simply shorthand for weight + weight:foreign + foreign.

Logistic Regression

Logistic regression is an instance of generalized linear models (GLM). Notice that the variable foreign takes on only two values: domestic and foreign. Suppose (this is for the sake of giving an example), we thought that price was a predictor of a car being foreign. We can then run

nonlinear <- glm(foreign~price,data=auto,family=binomial)
summary(nonlinear)

The argument family=binomial ensures that a logistic regression is run (there are many kinds of GLMs, so we want to pick the right one. Naturally, one may be unsure over whether "Domestic" is 0 or 1, but one can easily verify this with 

contrasts(auto$foreign)

which tells us that R treats "Domestic" as being equal to 0 and "Foreign" as being equal to 1.

predictedprob <- predict(nonlinear, type="response")

auto$predicted <- rep("Domestic", dim(auto)[1])
auto$predicted[predictedprob>=0.5] <- "Foreign"

table(auto$predicted,auto$foreign)

which gives us the confusion matrix

           Domestic Foreign
  Domestic       52      22
  Foreign         0       0

The last row may not appear on your screen, but I added it for clarity. Notice that it is price is a very poor predictor of whether a car is domestic or foreign. Our model predicts that all cars are domestically produced, resulting in a large number of errors. We either need to find better predictors, or tweak the threshold above which we predict a car is of foreign origin.

Saturday, July 4, 2015

Scatterplots with ggplot2

While the standard R installation comes with basic scatterplot functions, one often needs more advanced functions. To this end, one very popular package is ggplot2, where gg refers to "grammar of graphics".

library(ggplot2)
library(choroplethr)
data(df_state_demographics)
names(df_state_demographics)

outputs

[1] "region"            "total_population"  "percent_white"     "percent_black"  
[5] "percent_asian"     "percent_hispanic"  "per_capita_income" "median_rent"    
[9] "median_age"    

Suppose we want to see the relationship between per capita income and median rent in each state. A simple way of doing this would be 

ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent)) 
    + geom_point(shape=1)

The first argument tells us the dataset we want to use. We can then specify the x and y variables within aes. Think of aes as creating an "aesthetic", or something which allows you to specify which variables go where. 
At a bare minimum, ggplot requires us to specify what shape we want the plots to take. These must be added on as a separate layer. Hence the command "+ geom_point(shape=1)". The output follows:



If we want a linear regression line, we can tack on another layer: 

ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent)) 
    + geom_point(shape=1)
    + geom_smooth(method=lm)

which gives us



What if we don't want confidence intervals? Then we can try

ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent)) 
    + geom_point(shape=1)
    + geom_smooth(method=lm, se=FALSE)

 and...


Finally, what if we want to do LOESS? Just omit the arguments within geom_smooth.

ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent)) 
    + geom_point(shape=1)
    + geom_smooth()


Of course, we would want to make things fancier. For example, we might want to add a title. To make things simple, let's save our base plot as base_plot:

base_plot <- ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent)) 
    + geom_point(shape=1)
    + geom_smooth()



How can we:
1. Add a title?

base_plot + ggtitle("State per capita income and median rent")

2. Add labels?

base_plot + xlab("Per capita income ($)") + ylab("Median rent ($)")

Of course, this is just the tip of the iceberg. You may wish to see this excellent tutorial (part of this blogpost was drawn from there).

Plotting maps with choroplethr

According to the below graphic from the Sacramento Bee, Latinos have overtaken whites to become the most populous ethnic group.


Let's use the choroplethr package to examine the racial distribution of California counties.

Let's start with the preliminaries: installing and loading the package, as well as loading the relevant datasets. We'd also need to load the ggplot2 library, which will come useful in plotting the maps.

install.packages("choroplethr")
library(choroplethr)
data(df_state_demographics)
data(df_county_demographics)
library(ggplot2)

Thereafter, we create the individual maps. Notice that each map is created by adding "layers" on top of each other. For example, choro_white consists of three layers:
1. county population data of whites (county_choropleth accesses df_county_demographics$value)
2. a graph title,
3. a geographical map (coord_map()), which is part of ggplot2.


df_county_demographics$value = df_county_demographics$percent_white
choro_white = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
  ggtitle("California Counties\n Percent White") +
  coord_map()

df_county_demographics$value = df_county_demographics$percent_hispanic
choro_hispanic = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) + 
  ggtitle("California Counties\n Percent Hispanic") + 
  coord_map()

df_county_demographics$value = df_county_demographics$percent_asian
choro_asian = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) + 
  ggtitle("California Counties\n Percent Asian") + 
  coord_map()

Finally, we need to display the maps we've produced:

library(gridExtra)
grid.arrange(choro_white, choro_hispanic, choro_asian, nrow=3)

The output:



As is expected, the southern counties have more Hispanics.

Thursday, July 2, 2015

R for Stata users (Part 3)

In part 1 and part 2, we learned how to use R to load datasets, perform simple manipulations, and conduct linear regression.

Here, we will learn how to save work, draw scatterplots, handle dates and times, extract substrings, and encode variables. Make sure that you've opened RStudio and loaded the auto dataset.

Task #11: Save work

Once you've loaded and analyzed the auto dataset, you'll be wondering how to save it. For example, you may need to pass a CSV or Stata file to your colleague. Fortunately, R has lots of capabilities in these areas.


  • CSV File

The following command will write a CSV file to RStudio's current working directory:

write.csv(auto, "edited_auto.csv")

If you don't know what the current working directory is, type

getwd()

(short for get working directory). You can then find your file in that directory.
Alternatively, type

write.csv(auto, "C:/Documents/edited_auto.csv")

if you have a folder called Documents and want to place it there.

Important: In R, use forward slashes instead of backward slashes when writing directory paths.

One remark is that if I wanted to have a tab separated file, I could use the more general "write.table":

write.table(auto, "C:/Documents/edited_auto.csv", sep="\t")

where \t denotes the tab symbol.

  • Excel file
To write an Excel file, you need something called a "library". Think of a library as an add-on. It first needs to be installed

install.packages("xlsx")

where xlsx is the specific library we are instructing R to download. Thereafter, we need to load the package:

library(xlsx)

and we can then write our Excel file.

write.xlsx(auto, "C:/auto.xlsx")

the first parameter (auto) tells us the dataframe we want R to save, and the second parameter (C:/auto.xlsx) tells us where we want to save it. Notice again the use of a forward slash.


  • Stata

Of course, as a veteran Stata user, you may want to save something directly into a .dta file. You'll need a different library, called "foreign".

install.packages("foreign")
library(foreign)
write.dta(auto,"auto.dta")

and remember to locate auto.dta in the working directory.


  • SAS
If your collaborators use SAS, not to worry, the foreign package also has SAS capabilities.

write.foreign(auto, "auto.txt", "auto.sas", package="SAS")

The first parameter tells R the dataset to use and the second tells what data file to output. The third tells us to produce a file containing code to interpret the dataset. What kind of code? That's the job of the final parameter, which tells us it's SAS code that should be produced.


Task #12: Draw scatterplots

Plotting graphs in R is pretty much an art. There are many specialized libraries which allow you draw beautiful graphs. However, for the beginner, it's best to use the simple plot function.

Let's say we've loaded the dataset and want to do a scatterplot of price against weight. We can type

plot(auto$price,auto$weight)

The scatterplot appears on the bottom right corner.



However, it's a bare scatterplot. Let's make it better looking. First off, we can label the x and y axes more appropriately:

plot(auto$price,auto$weight, xlab="Price",ylab="Weight")

We can also add a title:

plot(auto$price,auto$weight, xlab="Price",ylab="Weight", main="Price vs. Weight")

For those who want more advanced graphing, take a look at the ggplot2 and lattice libraries.


Task #13: Handle dates and times

This is a little difficult to teach using the auto dataset (since there are no dates), but you can find a good tutorial here.

Task #14: Obtain substrings

Take a look at the variable "make". Notice that it takes on values such as "AMC Concord", "AMC Pacer", "Buick Century", and so on. Let's say that the first word is the car manufacturer's name, and the second word refers to the model. How can we separate the two?

Libraries come to the rescue again.

install.packages("stringr")
library(stringr)
auto$manufacturer <- str_split_fixed(auto$make, " ",2)[,1]
auto$model <- str_split_fixed(auto$make, " ",2)[,2]

str_split_fixed is a function within the stringr library.
The first parameter tells us which dataframe column we want to split, and " " is the instruction to look for a space. The last parameter (2) means that we are expecting two columns. Finally, [,1] and [,2] refer to the first and second column respectively.

At this point, it's useful to check that you've got the desired results.


Task #15: Encoding variables.

Look at the variable "foreign". In Stata, there is the "encode" command, which will allow you to code the different string values of foreign into numeric variables.

Here, you can do

auto$foreign_dummy <- as.numeric(auto$foreign)

and you will see 1's and 2's.