In this part, we will learn about sorting, generating/dropping new variables, linear regression, and post estimation commands.
If you haven't loaded the auto dataset, load it.
Task #6: Sort
While Stata has "sort", R has "order".
After you've loaded the dataset, enter
auto[order(auto$price),]
(don't forget the comma at the end!)
Notice that the dataset is sorted in ascending order:
make price mpg rep78 ...
34 Merc. Zephyr 3291 20 3 ...
14 Chev. Chevette 3299 29 3 ...
18 Chev. Monza 3667 24 2 ...
68 Toyota Corolla 3748 31 5 ...
66 Subaru 3798 35 5 ...
3 AMC Spirit 3799 22 NA ...
If we want to sort it in descending order, we can insert "-" before auto$price. In other words, issue the following command:
auto[order(-auto$price),]
You should see that the Cad. Seville appears at the top.
What if we wanted to sort by headroom and then rep78? That's simple:
auto[order(auto$headroom, auto$rep78),]
If you wanted to put NA values first, use
auto[order(auto$headroom, auto$rep78, na.last=FALSE),]
If you wanted to save the sorted results as an object, use "<-", which is the assignment operator:
autosort <- auto[order(auto$headroom, auto$rep78),]
Simply put, the above command creates a new object called autosort, which is assigned the sorted auto dataset.
Task #7: Generate new variables
In Stata, you had "generate" or "gen". Creating new variables in R is equally simple. If we wanted to take logs of prices, the assignment operator comes in handy again:
auto$logprice <- log(auto$price)
A new variable called logprice has been created, and is now part of the dataframe.
You can verify that this is the case by typing:
auto
What if we wanted to generate a dummy variable (or "indicator variable") called highprice, equal to 1 when the price of the car was $8000 or higher, and 0 otherwise? Then the command
auto$highprice <- 0
auto$highprice[auto$price >= 8000] <- 1
would do the trick.
Task #8: Dropping variables and observations
Compared to Stata, you may need one more command to drop a variable in R, but it's still trivial. Let's say I want to drop the newly created variable logprice;
drops <- "logprice"
auto <- auto[,!(names(auto) %in% drops)]
The second command essentially replaces the auto dataset with a "new" dataset containing all variables in auto except logprice. Remember the comma!
What if I wanted to drop two variables? Use c, which is R's concatenate function:
drops <- c("headroom","trunk")
auto <- auto[,!(names(auto) %in% drops)]
and you will find that both headroom and trunk have been dropped.
What if I wanted to drop certain observations? If we wanted to drop all cars priced above $10000, I could simply run:
auto <- auto[auto$price < 10000,]
Again, don't forget the comma towards the end. The experienced programmers should have figured out why we need the comma, and why it is placed where it is placed. For those who are less sure, type
auto[3,1]
and notice the output is
[1] AMC Spirit
In other words, RStudio is giving us the value of the third row and first column of the dataframe (think of the dataframe as the dataset). So the value before the comma indicates row, and the value after the comma indicates column. If we wanted all columns of the third observation (i.e. the entire third row), we would simply type
auto[3,]
omitting all numbers after the comma. It should now be clear why, in previous commands, the commas were placed where they were placed.
Remark: The beauty of R as compared to Stata is that one can create new datasets on the fly and keep both new and old datasets in memory at the same time. One can therefore do analyses on several datasets while keeping only one window open. For example, auto_lowprice <- auto[auto$price < 5000,] creates a new dataset containing low priced autos (below $5000). I can analyse both auto_lowprice and auto in the same window. |
Task #9: Linear regression
In Stata, you have "reg" or "regress". R's analogous command is lm, short for "linear model".
Let's drop and load the auto dataset again, to ensure that we're on the same page. If you are unsure how to do this, use rm(auto) to remove the dataset from memory, and then load it by following the instructions in part 1.
Having loaded the original auto dataset, we can now run linear regression on the entire sample. Suppose we want to run a regression with price on weight and length (i.e. price is the dependent variable, and weight and length are independent variables).
The command
lm(price ~ weight + length, auto)
spits out the following output:
Call:
lm(formula = price ~ weight + length, data = auto)
Coefficients:
(Intercept) weight length
10386.541 4.699 -97.960
The syntax goes like this:
lm(dependent var. ~ independent variables, dataset)
Notice that independent variables are separated by a plus ("+") sign.
As an experienced Stata user, you must be screaming now. "Where are my standard errors and p-values?"
That's simple.
summary(lm(price ~ weight + length, auto))
should give you what you want:
Call:
lm(formula = price ~ weight + length, data = auto)
Residuals:
Min 1Q Median 3Q Max
-4432.3 -1601.9 -523.6 1218.2 5771.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10386.541 4308.159 2.411 0.0185 *
weight 4.699 1.122 4.187 8e-05 ***
length -97.960 39.175 -2.501 0.0147 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2416 on 71 degrees of freedom
Multiple R-squared: 0.3476, Adjusted R-squared: 0.3292
F-statistic: 18.91 on 2 and 71 DF, p-value: 2.607e-07
Task #10: Post estimation commands
R, like Stata, has numerous post-estimation commands. With this in mind, it may be useful to actually save the linear model before performing post-estimation commands:
my_model <- lm(price ~ weight + length, auto)
saves the linear model as my_model, following which
summary(my_model)
displays the results (just like we saw previously). Here's a sample of post-estimation commands:
confint(my_model, level = 0.9) # gives a 90% confidence interval of regression coefficients
fitted(my_model) # gives the predicted values of each observation, based on the estimated regression coefficients
residuals(my_model) # gives the residuals
influence(my_model) # gives the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits.
You've doubled your R knowledge by being able to do five more tasks. Congratulations!
In part 3, we learn about saving work, scatterplots, handling dates and times, and obtaining substrings.