Sunday, March 20, 2016

Data cleaning in R

Microsoft has a good guide on using R to prepare data. It covers both basic and advanced techniques. Another relevant video is below:

Wednesday, March 9, 2016

Monte Carlo simulation in R

Here's how to do Monte Carlo simulation to demonstrate that traditional (OLS) regression estimates are unbiased and consistent.

Simplifying Multiple Regression with Frisch Waugh Lovell

Suppose I want to do regression and have two independent variables (let's call them X1 and X2). Is it possible for me to find out the effect of X1 on my dependent variable (Y), controlling for X2 by just using simple linear regression? (Answer: Yes).

More generally, (and more realistically), say we have 10000 variables but our computer only has enough memory to do regression with 6000 variables. (Having 10000 variables would not be unusual these days due to large datasets; additionally, some of the variables could cover seasonality and other easily observable factors.) Can we still run a regression controlling for all explanatory variables, despite our limited memory? Again, the answer is yes, due to the Frisch Waugh Lovell Theorem.

The video below uses R to illustrate how the Frisch Waugh Lovell theorem is used (a simple example is given on purpose, but it is easily generalizable).

Wonder how the Frisch Waugh Lovell theorem works? Well, here's the proof of the theorem.