Tuesday, June 30, 2015

R for Stata users (Part 1)

Stata is a popular data analysis software among economists and epidemiologists, to name two disciplines.

Stata's advantage is that it has a wide variety of pre-programmed functions. These functions are not only easy to learn, but also easy to use. However, trying anything that is not pre-programmed (e.g. creating your own algorithm) is going to be difficult. This is much less difficult in R.

How, then, can a veteran Stata user convert to R? Here's a useful guide to mastering the basics.

If you haven't already, download the R program from the R website. Next, install RStudio.

What is RStudio?

You may have heard that the R interface is very plain, with pretty much only a command line to enter your commands (see below):


In contrast, the RStudio interface is much more intuitive:

Source: rprogramming.net

Looks just as good as Stata - in fact, probably better. So after you've finished downloading and installing it, your screen should be something like this:



Task #1: Import and open a Stata dataset

Let's take Stata's famous auto dataset, which you may have heard of. (To access this in Stata, simply type "sysuse auto.dta"

Method #1: 
Open up Stata and load the auto dataset. Then save the auto dataset as a CSV file. Preferably name it auto.csv. If you don't have access to Stata now, here's a link to a CSV file.

Then go to RStudio and click Tools > Import Dataset > From Text File...

Below is a screenshot if you're not sure what to click.


You'll then encounter this dialog box.


Make any necessary changes (e.g. change "Heading" to Yes). Then click on "Import" on the bottom right hand corner. Your screen should look like this:


Look at the bottom left corner of the above screen. Notice how RStudio automatically generates the required command (just like Stata)? Also, RStudio automatically presents the data in the top left hand corner of the screen.

Method #2:
Install the package 'foreign', and then use this package to directly import a dta file

I'll leave this as an exercise, but will give some hints. You should start off with the following commands:

install.packages('foreign')
package(foreign)

Google if you're unsure how to continue!

(Another technical note: Since I am using RStudio here, I let blue Courier text denote commands and red text denote R's output. This is the opposite of other blog posts.)


Task #2: Create an R Script (aka 'do-file')

Recall how Stata had do-files? In R, you have R scripts. Just like do-files, these are simply text files containing commands.

Click File > New File > R Script



and you'll notice that an "Untitled1" file covers the auto dataset. Now we can start writing our do-file (okay, from now onwards I'll say R Script).

Task #3: Generate summary statistics

In the top left corner (not the console window) Type in summary(auto) and click run (near the top right hand corner of the editor. If you can't find it, press Ctrl+Enter and the code will run.

You'll observe some summary statistics in the Console screen. Some questions naturally arise.

Q: What if I only want one variable?
A: Try using the dollar sign. For example, summary(auto$price)

Q: But when I type "codebook" in Stata, it gives much more useful information such as number of missing values, etc. How can I generate these statistics?
A: Try the following commands:

install.packages('pastecs')
library(pastecs)
stat.desc(auto)

or if you prefer to look at a single variable (e.g. price), replace the last line with

stat.desc(auto$price)


Task #4: Generate correlation matrices

For preliminary data analysis, you may wish to generate correlation matrices. (These are done with corr or pwcorr in Stata).

For R, the analogous function is cor.

One may be tempted to type in cor(auto). However, RStudio displays the following error:

Error in cor(auto) : 'x' must be numeric

The non-numeric variables are preventing R from displaying the correlations. You can rectify this by issuing the following command instead:

cor(auto[sapply(auto, is.numeric)])

and the output should start like:

                  price        mpg rep78   headroom      trunk
price         1.0000000 -0.4685967    NA  0.1145056  0.3143316
mpg          -0.4685967  1.0000000    NA -0.4138030 -0.5815850
rep78                NA         NA     1         NA         NA
headroom      0.1145056 -0.4138030    NA  1.0000000  0.6620111

What does sapply do? You would normally type "help sapply" in Stata, but the command is slightly different in R:

? sapply

The help file will load in the bottom right hand corner. Take a good look at it.

You may want to annotate your file at this point. For example, write a comment to remind yourself about what sapply does. In R, comments start with the pound sign ("#"). RStudio will display comments in green, just like Stata.

Task #5: List observations fulfilling certain criteria

You must be itching to find R's version of Stata's list function. It's just a different word:

To list the entire dataset:
print(auto)

To list only cars with gear ratio above 3.8:
print(subset(auto, subset = gear_ratio>3.8))

To list only foreign cars:
print(subset(auto, subset = foreign=="Foreign"))

To list only cars with gear ratio above 3.8 and mpg above 30:
print(subset(auto, subset = gear_ratio>3.8 & mpg > 30))

You've learned five crucial tasks. Part 2 will teach you five more key tasks!

2 comments: