Tuesday, June 30, 2015

R for Stata users (part 2)

This is Part 2 of "How to convert from Stata to R". (If you haven't read Part 1, here's the link).

In this part, we will learn about sorting, generating/dropping new variables, linear regression, and post estimation commands.

If you haven't loaded the auto dataset, load it.

Task #6: Sort
While Stata has "sort", R has "order".
After you've loaded the dataset, enter
(don't forget the comma at the end!)
Notice that the dataset is sorted in ascending order:

                make price mpg rep78 ...
34      Merc. Zephyr  3291  20     3 ...
14    Chev. Chevette  3299  29     3 ...
18       Chev. Monza  3667  24     2 ...
68    Toyota Corolla  3748  31     5 ...
66            Subaru  3798  35     5 ...
3         AMC Spirit  3799  22    NA ...

If we want to sort it in descending order, we can insert "-" before auto$price. In other words, issue the following command:


You should see that the Cad. Seville appears at the top.
What if we wanted to sort by headroom and then rep78? That's simple:

auto[order(auto$headroom, auto$rep78),]

If you wanted to put NA values first, use

auto[order(auto$headroom, auto$rep78, na.last=FALSE),]

If you wanted to save the sorted results as an object, use "<-", which is the assignment operator:

autosort <- auto[order(auto$headroom, auto$rep78),]

Simply put, the above command creates a new object called autosort, which is assigned the sorted auto dataset.

Task #7: Generate new variables

In Stata, you had "generate" or "gen". Creating new variables in R is equally simple. If we wanted to take logs of prices, the assignment operator comes in handy again:

auto$logprice <- log(auto$price)

A new variable called logprice has been created, and is now part of the dataframe.

You can verify that this is the case by typing:


What if we wanted to generate a dummy variable (or "indicator variable") called highprice, equal to 1 when the price of the car was $8000 or higher, and 0 otherwise? Then the command

auto$highprice <- 0
auto$highprice[auto$price >= 8000] <- 1

would do the trick.

Task #8: Dropping variables and observations

Compared to Stata, you may need one more command to drop a variable in R, but it's still trivial. Let's say I want to drop the newly created variable logprice;

drops <- "logprice"
auto <- auto[,!(names(auto) %in% drops)]

The second command essentially replaces the auto dataset with a "new" dataset containing all variables in auto except logprice. Remember the comma!

What if I wanted to drop two variables? Use c, which is R's concatenate function:

drops <- c("headroom","trunk")
auto <- auto[,!(names(auto) %in% drops)]

and you will find that both headroom and trunk have been dropped.

What if I wanted  to drop certain observations? If we wanted to drop all cars priced above $10000, I could simply run:

auto <- auto[auto$price < 10000,]

Again, don't forget the comma towards the end. The experienced programmers should have figured out why we need the comma, and why it is placed where it is placed. For those who are less sure, type


and notice the output is

[1] AMC Spirit

In other words, RStudio is giving us the value of the third row and first column of the dataframe (think of the dataframe as the dataset). So the value before the comma indicates row, and the value after the comma indicates column. If we wanted all columns of the third observation (i.e. the entire third row), we would simply type


omitting all numbers after the comma. It should now be clear why, in previous commands, the commas were placed where they were placed.

Remark: The beauty of R as compared to Stata is that one can create new datasets on the fly and keep both new and old datasets in memory at the same time. One can therefore do analyses on several datasets while keeping only one window open.

For example, auto_lowprice <- auto[auto$price < 5000,]
creates a new dataset containing low priced autos (below $5000). I can analyse both auto_lowprice and auto in the same window.

Task #9: Linear regression

In Stata, you have "reg" or "regress". R's analogous command is lm, short for "linear model".

Let's drop and load the auto dataset again, to ensure that we're on the same page. If you are unsure how to do this, use rm(auto) to remove the dataset from memory, and then load it by following the instructions in part 1.

Having loaded the original auto dataset, we can now run linear regression on the entire sample. Suppose we want to run a regression with price on weight and length (i.e. price is the dependent variable, and weight and length are independent variables).

The command

lm(price ~ weight + length, auto)

spits out the following output:

lm(formula = price ~ weight + length, data = auto)

(Intercept)       weight       length  
  10386.541        4.699      -97.960  

The syntax goes like this:
lm(dependent var. ~ independent variables, dataset)

Notice that independent variables are separated by a plus ("+") sign.

As an experienced Stata user, you must be screaming now. "Where are my standard errors and p-values?"

That's simple.

summary(lm(price ~ weight + length, auto))

should give you what you want:

lm(formula = price ~ weight + length, data = auto)

    Min      1Q  Median      3Q     Max 
-4432.3 -1601.9  -523.6  1218.2  5771.0 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10386.541   4308.159   2.411   0.0185 *  
weight          4.699      1.122   4.187    8e-05 ***
length        -97.960     39.175  -2.501   0.0147 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2416 on 71 degrees of freedom
Multiple R-squared:  0.3476, Adjusted R-squared:  0.3292 
F-statistic: 18.91 on 2 and 71 DF,  p-value: 2.607e-07

Task #10: Post estimation commands

R, like Stata, has numerous post-estimation commands. With this in mind, it may be useful to actually save the linear model before performing post-estimation commands:

my_model <- lm(price ~ weight + length, auto)

saves the linear model as my_model, following which


displays the results (just like we saw previously). Here's a sample of post-estimation commands:

confint(my_model, level = 0.9) # gives a 90% confidence interval of regression coefficients
fitted(my_model) # gives the predicted values of each observation, based on the estimated regression coefficients
residuals(my_model) # gives the residuals 
influence(my_model) # gives the basic quantities which are used in forming a wide variety of diagnostics for checking the quality of regression fits.

You've doubled your R knowledge by being able to do five more tasks. Congratulations!

In part 3, we learn about saving work, scatterplots, handling dates and times, and obtaining substrings.

No comments:

Post a Comment