Thursday, July 2, 2015

R for Stata users (Part 3)

In part 1 and part 2, we learned how to use R to load datasets, perform simple manipulations, and conduct linear regression.

Here, we will learn how to save work, draw scatterplots, handle dates and times, extract substrings, and encode variables. Make sure that you've opened RStudio and loaded the auto dataset.

Task #11: Save work

Once you've loaded and analyzed the auto dataset, you'll be wondering how to save it. For example, you may need to pass a CSV or Stata file to your colleague. Fortunately, R has lots of capabilities in these areas.


  • CSV File

The following command will write a CSV file to RStudio's current working directory:

write.csv(auto, "edited_auto.csv")

If you don't know what the current working directory is, type

getwd()

(short for get working directory). You can then find your file in that directory.
Alternatively, type

write.csv(auto, "C:/Documents/edited_auto.csv")

if you have a folder called Documents and want to place it there.

Important: In R, use forward slashes instead of backward slashes when writing directory paths.

One remark is that if I wanted to have a tab separated file, I could use the more general "write.table":

write.table(auto, "C:/Documents/edited_auto.csv", sep="\t")

where \t denotes the tab symbol.

  • Excel file
To write an Excel file, you need something called a "library". Think of a library as an add-on. It first needs to be installed

install.packages("xlsx")

where xlsx is the specific library we are instructing R to download. Thereafter, we need to load the package:

library(xlsx)

and we can then write our Excel file.

write.xlsx(auto, "C:/auto.xlsx")

the first parameter (auto) tells us the dataframe we want R to save, and the second parameter (C:/auto.xlsx) tells us where we want to save it. Notice again the use of a forward slash.


  • Stata

Of course, as a veteran Stata user, you may want to save something directly into a .dta file. You'll need a different library, called "foreign".

install.packages("foreign")
library(foreign)
write.dta(auto,"auto.dta")

and remember to locate auto.dta in the working directory.


  • SAS
If your collaborators use SAS, not to worry, the foreign package also has SAS capabilities.

write.foreign(auto, "auto.txt", "auto.sas", package="SAS")

The first parameter tells R the dataset to use and the second tells what data file to output. The third tells us to produce a file containing code to interpret the dataset. What kind of code? That's the job of the final parameter, which tells us it's SAS code that should be produced.


Task #12: Draw scatterplots

Plotting graphs in R is pretty much an art. There are many specialized libraries which allow you draw beautiful graphs. However, for the beginner, it's best to use the simple plot function.

Let's say we've loaded the dataset and want to do a scatterplot of price against weight. We can type

plot(auto$price,auto$weight)

The scatterplot appears on the bottom right corner.



However, it's a bare scatterplot. Let's make it better looking. First off, we can label the x and y axes more appropriately:

plot(auto$price,auto$weight, xlab="Price",ylab="Weight")

We can also add a title:

plot(auto$price,auto$weight, xlab="Price",ylab="Weight", main="Price vs. Weight")

For those who want more advanced graphing, take a look at the ggplot2 and lattice libraries.


Task #13: Handle dates and times

This is a little difficult to teach using the auto dataset (since there are no dates), but you can find a good tutorial here.

Task #14: Obtain substrings

Take a look at the variable "make". Notice that it takes on values such as "AMC Concord", "AMC Pacer", "Buick Century", and so on. Let's say that the first word is the car manufacturer's name, and the second word refers to the model. How can we separate the two?

Libraries come to the rescue again.

install.packages("stringr")
library(stringr)
auto$manufacturer <- str_split_fixed(auto$make, " ",2)[,1]
auto$model <- str_split_fixed(auto$make, " ",2)[,2]

str_split_fixed is a function within the stringr library.
The first parameter tells us which dataframe column we want to split, and " " is the instruction to look for a space. The last parameter (2) means that we are expecting two columns. Finally, [,1] and [,2] refer to the first and second column respectively.

At this point, it's useful to check that you've got the desired results.


Task #15: Encoding variables.

Look at the variable "foreign". In Stata, there is the "encode" command, which will allow you to code the different string values of foreign into numeric variables.

Here, you can do

auto$foreign_dummy <- as.numeric(auto$foreign)

and you will see 1's and 2's.


No comments:

Post a Comment