Sunday, March 20, 2016

Data cleaning in R

Microsoft has a good guide on using R to prepare data. It covers both basic and advanced techniques. Another relevant video is below:


Wednesday, March 9, 2016

Monte Carlo simulation in R

Here's how to do Monte Carlo simulation to demonstrate that traditional (OLS) regression estimates are unbiased and consistent.

Simplifying Multiple Regression with Frisch Waugh Lovell

Suppose I want to do regression and have two independent variables (let's call them X1 and X2). Is it possible for me to find out the effect of X1 on my dependent variable (Y), controlling for X2 by just using simple linear regression? (Answer: Yes).

More generally, (and more realistically), say we have 10000 variables but our computer only has enough memory to do regression with 6000 variables. (Having 10000 variables would not be unusual these days due to large datasets; additionally, some of the variables could cover seasonality and other easily observable factors.) Can we still run a regression controlling for all explanatory variables, despite our limited memory? Again, the answer is yes, due to the Frisch Waugh Lovell Theorem.

The video below uses R to illustrate how the Frisch Waugh Lovell theorem is used (a simple example is given on purpose, but it is easily generalizable).



Wonder how the Frisch Waugh Lovell theorem works? Well, here's the proof of the theorem.


Sunday, July 12, 2015

Linear regression

Let's try regression on the auto dataset:
regfit.full <- regsubsets(price~., auto)

It doesn't work because by default, exhaustive search is performed.
This requires lots of permutations, so you have to set really.big=TRUE if you want this.

But let's just make things simpler by carving out a subset of variables:
autosmall <- auto[,c(2,3,5,6,7,8,9,10,11)]
regfit.full <- regsubsets(price ~ . , autosmall) # the . indicates we want all variables to be considered.
summary(regfit.full)

regfit.fwd <- regsubsets(price~., autosmall, method="forward") # forward stepwise selection
regfit.bwd <- regsubsets(price~., autosmall, method="backward") # backward stepwise selection
coef(regfit.fwd,3) # gives the coefficients for regfit.fwd for the optimal 3-variable model

Recall that we actually need to split our dataset into a training set and test set. This is how we can do it:
set.seed(1)
train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE)
# if we wanted 75% of observations to be in training set, we can do the below:
# train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE, prob = c(0.75, 0.25))
test <- (!train)

# another way of allocating the training/test sets:
# train <- sample(1:nrow(autosmall),  nrow(autosmall)/2)
# test <- (-train)
# autosmall.test <- autosmall[test]
# (notice that test contains all negative numbers, so all "train" observations are excluded)

To keep things simple, we didn't do cross validation (as well as regularization).
Here's how to do cross-validation:
regfit.full <- regsubsets(price~., autosmall[train,])
reg.summary <- summary(regfit.full) # for convenience
names(reg.summary)

For model selection, it'll be useful to plot how RSS (and other measures such as adjusted R-sq) varies with number of variables:

par(mfrow = c(1,2)) # sets a 1 row by 2 column plot
plot(reg.summary$rss,xlab="# of Var", ylab="RSS", type="l")
(m1 <- which.min(reg.summary$rss))
points(m1,reg.summary$rss[m1], col="red",cex=2,pch=20)
# cex refers to the dot thickness

plot(reg.summary$adjr2,xlab="# of Var", ylab="Adj R2", type="l")
(m1 <- which.max(reg.summary$adjr2))
points(m2,reg.summary$adjr2[m2], col="red",cex=2,pch=20)

plot(regfit.full, scale="r2")

Now we need to see MSE on the test set. First off, use model.matrix to generate the "X" matrix for the test set:

test.mat <- model.matrix(price~., data=autosmall[test,])

val.errors = rep(NA,8) # generates the empty MSE matrix
for (i in 1:8) {
  coefi <- coef(regfit.full, id=i) # takes the coefficients from the optimal model with i variables
  pred <- test.mat[,names(coefi)] %*% coefi # generate predicted values of each observation
  val.errors[i] <- mean((autosmall$price[test] - pred)^2) # generate MSE for i-th model
}
val.errors
#[1]  7781131  9856769 10060398  9574022  9421108  9470527  9269711  8942316
(v <- which.min(val.errors))
#[1] 1
# so we choose model with 1 variable (weight only)
coef(regfit.full, v)
#(Intercept)      weight 
#-426.910662    2.251572

We can also write a function, which will come in useful for k-fold cross validation
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]]) # takes the second item from object$call list
  # call is part of regfit.full
  # the second item within the list is the relevant variables
  mat <- model.matrix(form, newdata) # generates the required X matrix using the formula
  coefi <- coef(object, id=id) # the X variable COEFFICIENTS of the optimal model
  xvars <- names(coefi) # these are the X variables of the optimal model
  mat[,xvars] %*% coefi # generates predicted values (by looking at the relevant columns only)
}

# 10-fold cross validation
k = 10
set.seed(1)
folds = sample(1:k, nrow(autosmall), replace=TRUE) 
# if an observation is labelled 'i', it will be part of the i-th fold.
# e.g. an observation that is labelled 3 will be left out of the 3rd round of cross validation
cv.errors = matrix(NA,k,8, dimnames=list(NULL, paste(1:8)))
# sets up a k by 8 matrix filled with NA's. 
# columns are named 1 to 8

# armed with our new predict() method, we can:
for (j in 1:k) {
  # obtain the required coefficients
  best.fit <- regsubsets(price ~ . , data=autosmall[folds != j,])
  # note: if you have more than 8 variables you will need to specify nvmax in the above command
  for (i in 1:8) {
    # now, calculate MSE using the i-th fold
    pred <- predict(best.fit, autosmall[folds == j,],id=i)
    # since best.fit is of class regsubsets, the predict command that was created will work on it.
    cv.errors[j,i] <- mean((autosmall$price[folds == j] - pred)^2)
  }
}
(mean.cv.errors <- apply(cv.errors,2,mean))

Let's plot MSE of the CV dataset.
par(mfrow=c(1,1))
plot(mean.cv.errors, type='b')

Overview of Machine Learning Techniques in R

Whenever we do something as simple as a Google search, we benefit from machine learning. Demand is high for those with machine learning skills - a search of LinkedIn reveals 9,369 results for "machine learning".

This post gives an overview of packages (and functions within those packages) one can use to conduct machine learning techniques. A lot of this material is drawn from the fourth edition of the textbook Introduction to Statistical Learning.

  1. Linear regression
Library: leaps
Functions: regsubsets
  1. Decision trees
  2. Support Vector Machines
  3. Unsupervised learning
Subsequent posts will elaborate on these in more detail.

Apply, lapply, sapply, tapply in R

How can we use R to efficiently calculate key statistics such as the mean?
Let's play with the auto dataset to see how.

apply(cbind(auto$price,auto$length,auto$weight),2,mean) # finds simple means

apply(auto,2,mean) would not work because some arguments are non-numeric.
Instead, one can try apply(auto[,c(2,3)],2,mean).

If you wanted to find out the number of values each variable took, this would be simple:
apply(auto, 2, function (x) length(unique(x)))

The second argument (2) tells R to work column-by-column. Had 1 been specified, R would have worked row-by-row. You can also use the sapply function, which by default returns a vector:

sapply(auto, function(x) length(unique(x)))

If we wanted the output in terms of a list, we could use lapply:
lapply(auto, function(x) length(unique(x)))

What if we wanted disaggregated statistics? For example, we might want to know what is the mean price of foreign cars, as well as the mean price of domestic cars. We can use

tapply(auto$price, auto$foreign, mean)

Which would be similar to Excel's PivotTable. The by function is also acceptable:

by(auto$price, auto$foreign, mean)

What if the variable has NA values? For example, by(auto$price, auto$foreign, mean) returns us the unglamorous

auto$foreign: Domestic
[1] NA
-----------------
auto$foreign: Foreign
[1] NA

This was because the rep78 contained missing values. We can easily handle this:

tapply(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

And what if we wanted to restrict attention to cars with prices above $1000:
tapply(auto$rep78, auto$foreign, function(x) mean(x[auto$price > 10000],na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

Yet another way is to subset the data first:
m <- subset(auto, !is.na(auto$rep78) & auto$price > 10000)
by(m$rep78, m$foreign, mean)

Useful resources for learning R