Suppose we want to find the relationship between the price and weight of a car. One way to do this is

priceweight <- lm(auto$price ~ auto$weight)

Alternatively,

priceweight <- lm(price ~ weight, data=auto)

will work as well.

The command summary(priceweight) will then display details of the regression such as coefficient estimates and p-values.

We can then visualize the data by

plot(auto$weight,auto$price)

abline(priceweight)

Notice that for plot, the first named variable takes the x axis, while the other variable takes the y axis.

The function abline draws a straight line through the plot, using the coefficients produced by the regression. You can always produce other lines if you want. For example,

abline(a=0,b=1,col='blue')

produce a straight line with an intercept of 0 and a slope of 1. In other words, this would be the model if price and weight had a 1:1 relationship (clearly not supported by the data). Notice that specifying a blue color helps to differentiate this line from the best-fit line.

If we wanted to, we could emphasize the best fit line more:

abline(priceweight, lwd = 5)

here lwd stands for linewidth. The line will be 5 units thick (the default is 1 unit thick, so we have increased the width drastically).

## Squared terms

It might be reasonable to think that price is a non-linear function of weight. How can we see whether, for example, price is a quadratic function of weight?

An intuitive guess may be

priceweight2 <- lm(price ~ weight + weight^2, data=auto)

But it doesn't quite produce the results we want:

Call:

lm(formula = price ~ weight + weight^2, data = auto)

Coefficients:

(Intercept) weight

-6.707 2.044

What you actually wanted to do was this:

priceweight2 <- lm(price ~ weight + I(weight^2), data=auto)

The I() means "inhibit". In this case, without I(), ^ would be a formula operator. Using I() allows ^ to be interpreted as a arithmetic operator. If you are confused, think of it this way: ^ has a special meaning when it appears in a formula. Using I() suppresses this special meaning, and forces R to treat it in the normal way (exponential).

With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.

anova(priceweight,priceweight2)

Analysis of Variance Table

Model 1: price ~ weight

Model 2: price ~ weight + I(weight^2)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 72 450831459

2 71 384779934 1 66051525 12.188 0.0008313 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run

plot(priceweight)

plot(priceweight2)

and end up with the following plots:

It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.

With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.

anova(priceweight,priceweight2)

Analysis of Variance Table

Model 1: price ~ weight

Model 2: price ~ weight + I(weight^2)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 72 450831459

2 71 384779934 1 66051525 12.188 0.0008313 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run

plot(priceweight)

plot(priceweight2)

and end up with the following plots:

*Linear case:**Quadratic case:*It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.

## Interaction Variables

Suppose we hypothesized that the relationship between price and weight was different from foreign cars as compared to domestic cars.

Then we can use

priceweight <- lm(price ~ weight + weight:foreign, data=auto)

We might even think that being foreign has some effect, so let's throw the variable in:

priceweight <- lm(price ~ weight + weight:foreign + foreign, data=auto)

But hey, all these can be simplified:

priceweight <- lm(price ~ weight*foreign, data=auto)

Notice that weight*foreign is simply shorthand for weight + weight:foreign + foreign.

## Logistic Regression

Logistic regression is an instance of generalized linear models (GLM). Notice that the variable foreign takes on only two values: domestic and foreign. Suppose (this is for the sake of giving an example), we thought that price was a predictor of a car being foreign. We can then run

nonlinear <- glm(foreign~price,data=auto,family=binomial)

summary(nonlinear)

The argument family=binomial ensures that a logistic regression is run (there are many kinds of GLMs, so we want to pick the right one. Naturally, one may be unsure over whether "Domestic" is 0 or 1, but one can easily verify this with

contrasts(auto$foreign)

which tells us that R treats "Domestic" as being equal to 0 and "Foreign" as being equal to 1.

predictedprob <- predict(nonlinear, type="response")

auto$predicted <- rep("Domestic", dim(auto)[1])

auto$predicted[predictedprob>=0.5] <- "Foreign"

table(auto$predicted,auto$foreign)

which gives us the confusion matrix

Domestic Foreign

Domestic 52 22

Foreign 0 0

The last row may not appear on your screen, but I added it for clarity. Notice that it is price is a very poor predictor of whether a car is domestic or foreign. Our model predicts that all cars are domestically produced, resulting in a large number of errors. We either need to find better predictors, or tweak the threshold above which we predict a car is of foreign origin.

## No comments:

## Post a Comment