Sunday, July 5, 2015

Linear Regression in R

Let's load the auto dataset and learn regression techniques in more detail. (For those who have not read previous posts, this is a dataset containing information about 74 cars.)

Suppose we want to find the relationship between the price and weight of a car. One way to do this is

priceweight <- lm(auto$price ~ auto$weight)

Alternatively,

priceweight <- lm(price ~ weight, data=auto)

will work as well.

The command summary(priceweight) will then display details of the regression such as coefficient estimates and p-values.

We can then visualize the data by
plot(auto$weight,auto$price)
abline(priceweight)

Notice that for plot, the first named variable takes the x axis, while the other variable takes the y axis.

The function abline draws a straight line through the plot, using the coefficients produced by the regression. You can always produce other lines if you want. For example,

abline(a=0,b=1,col='blue')

produce a straight line with an intercept of 0 and a slope of 1. In other words, this would be the model if price and weight had a 1:1 relationship (clearly not supported by the data). Notice that specifying a blue color helps to differentiate this line from the best-fit line.

If we wanted to, we could emphasize the best fit line more:
abline(priceweight, lwd = 5)

here lwd stands for linewidth. The line will be 5 units thick (the default is 1 unit thick, so we have increased the width drastically).

Squared terms

It might be reasonable to think that price is a non-linear function of weight. How can we see whether, for example, price is a quadratic function of weight?

An intuitive guess may be 

priceweight2 <- lm(price ~ weight + weight^2, data=auto)

But it doesn't quite produce the results we want:

Call:
lm(formula = price ~ weight + weight^2, data = auto)

Coefficients:
(Intercept)       weight  
     -6.707        2.044  

What you actually wanted to do was this:

priceweight2 <- lm(price ~ weight + I(weight^2), data=auto)

The I() means "inhibit". In this case, without I()^ would be a formula operator. Using I() allows ^ to be interpreted as a arithmetic operator. If you are confused, think of it this way: ^ has a special meaning when it appears in a formula. Using I() suppresses this special meaning, and forces R to treat it in the normal way (exponential).

With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.

anova(priceweight,priceweight2)

Analysis of Variance Table

Model 1: price ~ weight
Model 2: price ~ weight + I(weight^2)
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
1     72 450831459                                  
2     71 384779934  1  66051525 12.188 0.0008313 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run
plot(priceweight)
plot(priceweight2)

and end up with the following plots:

Linear case:

Quadratic case:


It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.

Interaction Variables

Suppose we hypothesized that the relationship between price and weight was different from foreign cars as compared to domestic cars.
Then we can use 

priceweight <- lm(price ~ weight + weight:foreign, data=auto)

We might even think that being foreign has some effect, so let's throw the variable in: 

priceweight <- lm(price ~ weight + weight:foreign + foreign, data=auto)

But hey, all these can be simplified:

priceweight <- lm(price ~ weight*foreign, data=auto)

Notice that weight*foreign is simply shorthand for weight + weight:foreign + foreign.

Logistic Regression

Logistic regression is an instance of generalized linear models (GLM). Notice that the variable foreign takes on only two values: domestic and foreign. Suppose (this is for the sake of giving an example), we thought that price was a predictor of a car being foreign. We can then run

nonlinear <- glm(foreign~price,data=auto,family=binomial)
summary(nonlinear)

The argument family=binomial ensures that a logistic regression is run (there are many kinds of GLMs, so we want to pick the right one. Naturally, one may be unsure over whether "Domestic" is 0 or 1, but one can easily verify this with 

contrasts(auto$foreign)

which tells us that R treats "Domestic" as being equal to 0 and "Foreign" as being equal to 1.

predictedprob <- predict(nonlinear, type="response")

auto$predicted <- rep("Domestic", dim(auto)[1])
auto$predicted[predictedprob>=0.5] <- "Foreign"

table(auto$predicted,auto$foreign)

which gives us the confusion matrix

           Domestic Foreign
  Domestic       52      22
  Foreign         0       0

The last row may not appear on your screen, but I added it for clarity. Notice that it is price is a very poor predictor of whether a car is domestic or foreign. Our model predicts that all cars are domestically produced, resulting in a large number of errors. We either need to find better predictors, or tweak the threshold above which we predict a car is of foreign origin.

No comments:

Post a Comment