Let's try regression on the auto dataset:
regfit.full <- regsubsets(price~., auto)
It doesn't work because by default, exhaustive search is performed.
This requires lots of permutations, so you have to set really.big=TRUE if you want this.
But let's just make things simpler by carving out a subset of variables:
autosmall <- auto[,c(2,3,5,6,7,8,9,10,11)]
regfit.full <- regsubsets(price ~ . , autosmall) # the . indicates we want all variables to be considered.
summary(regfit.full)
regfit.fwd <- regsubsets(price~., autosmall, method="forward") # forward stepwise selection
regfit.bwd <- regsubsets(price~., autosmall, method="backward") # backward stepwise selection
coef(regfit.fwd,3) # gives the coefficients for regfit.fwd for the optimal 3-variable model
Recall that we actually need to split our dataset into a training set and test set. This is how we can do it:
set.seed(1)
train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE)
# if we wanted 75% of observations to be in training set, we can do the below:
# train <- sample(c(TRUE, FALSE),nrow(autosmall),rep=TRUE, prob = c(0.75, 0.25))
test <- (!train)
# another way of allocating the training/test sets:
# train <- sample(1:nrow(autosmall), nrow(autosmall)/2)
# test <- (-train)
# autosmall.test <- autosmall[test]
# (notice that test contains all negative numbers, so all "train" observations are excluded)
To keep things simple, we didn't do cross validation (as well as regularization).
Here's how to do cross-validation:
regfit.full <- regsubsets(price~., autosmall[train,])
reg.summary <- summary(regfit.full) # for convenience
names(reg.summary)
For model selection, it'll be useful to plot how RSS (and other measures such as adjusted R-sq) varies with number of variables:
par(mfrow = c(1,2)) # sets a 1 row by 2 column plot
plot(reg.summary$rss,xlab="# of Var", ylab="RSS", type="l")
(m1 <- which.min(reg.summary$rss))
points(m1,reg.summary$rss[m1], col="red",cex=2,pch=20)
# cex refers to the dot thickness
plot(reg.summary$adjr2,xlab="# of Var", ylab="Adj R2", type="l")
(m1 <- which.max(reg.summary$adjr2))
points(m2,reg.summary$adjr2[m2], col="red",cex=2,pch=20)
plot(regfit.full, scale="r2")
Now we need to see MSE on the test set. First off, use model.matrix to generate the "X" matrix for the test set:
test.mat <- model.matrix(price~., data=autosmall[test,])
val.errors = rep(NA,8) # generates the empty MSE matrix
for (i in 1:8) {
coefi <- coef(regfit.full, id=i) # takes the coefficients from the optimal model with i variables
pred <- test.mat[,names(coefi)] %*% coefi # generate predicted values of each observation
val.errors[i] <- mean((autosmall$price[test] - pred)^2) # generate MSE for i-th model
}
val.errors
#[1] 7781131 9856769 10060398 9574022 9421108 9470527 9269711 8942316
(v <- which.min(val.errors))
#[1] 1
# so we choose model with 1 variable (weight only)
coef(regfit.full, v)
#(Intercept) weight
#-426.910662 2.251572
We can also write a function, which will come in useful for k-fold cross validation
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]]) # takes the second item from object$call list
# call is part of regfit.full
# the second item within the list is the relevant variables
mat <- model.matrix(form, newdata) # generates the required X matrix using the formula
coefi <- coef(object, id=id) # the X variable COEFFICIENTS of the optimal model
xvars <- names(coefi) # these are the X variables of the optimal model
mat[,xvars] %*% coefi # generates predicted values (by looking at the relevant columns only)
}
# 10-fold cross validation
k = 10
set.seed(1)
folds = sample(1:k, nrow(autosmall), replace=TRUE)
# if an observation is labelled 'i', it will be part of the i-th fold.
# e.g. an observation that is labelled 3 will be left out of the 3rd round of cross validation
cv.errors = matrix(NA,k,8, dimnames=list(NULL, paste(1:8)))
# sets up a k by 8 matrix filled with NA's.
# columns are named 1 to 8
# armed with our new predict() method, we can:
for (j in 1:k) {
# obtain the required coefficients
best.fit <- regsubsets(price ~ . , data=autosmall[folds != j,])
# note: if you have more than 8 variables you will need to specify nvmax in the above command
for (i in 1:8) {
# now, calculate MSE using the i-th fold
pred <- predict(best.fit, autosmall[folds == j,],id=i)
# since best.fit is of class regsubsets, the predict command that was created will work on it.
cv.errors[j,i] <- mean((autosmall$price[folds == j] - pred)^2)
}
}
(mean.cv.errors <- apply(cv.errors,2,mean))
Let's plot MSE of the CV dataset.
par(mfrow=c(1,1))
plot(mean.cv.errors, type='b')
Sunday, July 12, 2015
Overview of Machine Learning Techniques in R
Whenever we do something as simple as a Google search, we benefit from machine learning. Demand is high for those with machine learning skills - a search of LinkedIn reveals 9,369 results for "machine learning".
This post gives an overview of packages (and functions within those packages) one can use to conduct machine learning techniques. A lot of this material is drawn from the fourth edition of the textbook Introduction to Statistical Learning.
This post gives an overview of packages (and functions within those packages) one can use to conduct machine learning techniques. A lot of this material is drawn from the fourth edition of the textbook Introduction to Statistical Learning.
- Linear regression
Library: leaps
Functions: regsubsets
- Decision trees
- Support Vector Machines
- Unsupervised learning
Subsequent posts will elaborate on these in more detail.
Apply, lapply, sapply, tapply in R
How can we use R to efficiently calculate key statistics such as the mean?
Let's play with the auto dataset to see how.
apply(cbind(auto$price,auto$length,auto$weight),2,mean) # finds simple means
apply(auto,2,mean) would not work because some arguments are non-numeric.
Instead, one can try apply(auto[,c(2,3)],2,mean).
If you wanted to find out the number of values each variable took, this would be simple:
apply(auto, 2, function (x) length(unique(x)))
The second argument (2) tells R to work column-by-column. Had 1 been specified, R would have worked row-by-row. You can also use the sapply function, which by default returns a vector:
sapply(auto, function(x) length(unique(x)))
If we wanted the output in terms of a list, we could use lapply:
lapply(auto, function(x) length(unique(x)))
What if we wanted disaggregated statistics? For example, we might want to know what is the mean price of foreign cars, as well as the mean price of domestic cars. We can use
tapply(auto$price, auto$foreign, mean)
Which would be similar to Excel's PivotTable. The by function is also acceptable:
by(auto$price, auto$foreign, mean)
What if the variable has NA values? For example, by(auto$price, auto$foreign, mean) returns us the unglamorous
auto$foreign: Domestic
[1] NA
-----------------
auto$foreign: Foreign
[1] NA
This was because the rep78 contained missing values. We can easily handle this:
tapply(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
And what if we wanted to restrict attention to cars with prices above $1000:
tapply(auto$rep78, auto$foreign, function(x) mean(x[auto$price > 10000],na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
Yet another way is to subset the data first:
m <- subset(auto, !is.na(auto$rep78) & auto$price > 10000)
by(m$rep78, m$foreign, mean)
Let's play with the auto dataset to see how.
apply(cbind(auto$price,auto$length,auto$weight),2,mean) # finds simple means
apply(auto,2,mean) would not work because some arguments are non-numeric.
Instead, one can try apply(auto[,c(2,3)],2,mean).
If you wanted to find out the number of values each variable took, this would be simple:
apply(auto, 2, function (x) length(unique(x)))
The second argument (2) tells R to work column-by-column. Had 1 been specified, R would have worked row-by-row. You can also use the sapply function, which by default returns a vector:
sapply(auto, function(x) length(unique(x)))
If we wanted the output in terms of a list, we could use lapply:
lapply(auto, function(x) length(unique(x)))
What if we wanted disaggregated statistics? For example, we might want to know what is the mean price of foreign cars, as well as the mean price of domestic cars. We can use
tapply(auto$price, auto$foreign, mean)
Which would be similar to Excel's PivotTable. The by function is also acceptable:
by(auto$price, auto$foreign, mean)
What if the variable has NA values? For example, by(auto$price, auto$foreign, mean) returns us the unglamorous
auto$foreign: Domestic
[1] NA
-----------------
auto$foreign: Foreign
[1] NA
This was because the rep78 contained missing values. We can easily handle this:
tapply(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
And what if we wanted to restrict attention to cars with prices above $1000:
tapply(auto$rep78, auto$foreign, function(x) mean(x[auto$price > 10000],na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
Yet another way is to subset the data first:
m <- subset(auto, !is.na(auto$rep78) & auto$price > 10000)
by(m$rep78, m$foreign, mean)
Useful resources for learning R
- R-bloggers is a collection of R blogs.
- UCLA has a comprehensive guide
- Try tryr
Monday, July 6, 2015
Extracting JSON data in R
Increasingly, data on the internet is presented in JSON format. For the uninitiated, here's what a JSON file looks like:
{
people:[
{name:"John Smith", timings:[1,2,3]},
{name:"Adam Yu", timings:[3,2,1]},
{name:"Frank Pin", timings:[7,4,3]}
}
It's more or less a data file. The curly braces indicated named arrays (e.g. we know that we have the name and timings of each person), while the square brackets indicate ordered unnamed arrays (e.g. Frank's first timing is 7, his second timing is 4, and his third timing is 3.)
How can R analyze the data appropriately? Let's take a look at a real life example: the salaries of public servants at the City of Chicago. R has three major packages which can analyze JSON data.
These are RJSONIO, rjson, and jsonlite. The differences are minor, especially for simple JSON files.
Here, we will use the RJSONIO package.
{
people:[
{name:"John Smith", timings:[1,2,3]},
{name:"Adam Yu", timings:[3,2,1]},
{name:"Frank Pin", timings:[7,4,3]}
}
It's more or less a data file. The curly braces indicated named arrays (e.g. we know that we have the name and timings of each person), while the square brackets indicate ordered unnamed arrays (e.g. Frank's first timing is 7, his second timing is 4, and his third timing is 3.)
How can R analyze the data appropriately? Let's take a look at a real life example: the salaries of public servants at the City of Chicago. R has three major packages which can analyze JSON data.
These are RJSONIO, rjson, and jsonlite. The differences are minor, especially for simple JSON files.
Here, we will use the RJSONIO package.
First, let's download and install the required packages and load the required data.
install.packages("RJSONIO")
library(RJSONIO)
RawData <- fromJSON("https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.json?accessType=DOWNLOAD")
Should you have difficulty downloading the city of Chicago data, here's a link. (Note: information in the link is correct as of July 2015)
You should actually display the JSON file a web browser for easier viewing. Once you've done this, you'll notice that what you're looking for is in the data section. Hence
Data <- RawData$data
gives you most of what you need. I say most, because things like variable names aren't there. Here's an example of how you would extract a given variable:
employeeNames <- sapply(Data, function(x) x[[9]])
What sapply does is that it takes each observation in Data (which is a list containing lists), andobtains the 9th element of each list, and then saves it to employeeNames. sapply returns an array.
Just to check that things are working properly, we can run
head(employeeNames)In order to extract all variables efficiently, we need to define two functions.install.packages('gdata')library(gdata) # necessary for trimgrabinfo <- function(var) {print(paste("Variable", var, sep=" ")) # for aesthetics: tells you which variables have been processedsapply(Data, function(x) returnData(x,var)) # the dataset "Data" is hardcoded here and is input as parameter "x"}returnData <- function(x, var) {if (!is.null( x[[var]] )) {return ( trim( x[[var]] ))} else {return(NA)}}df <- data.frame(sapply(1:length(Data[[1]]),grabinfo), stringsAsFactors=FALSE)# Performs grabinfo for each variable in Data# you run grabinfo(1), grabinfo(2), ... grabinfo(12).## grabinfo(1) then uses sapply (again) to get values of Variable1 for all observations# likewise with grabinfo(2), and so on## in doing so, the entire dataset is transformed into a dataframeFinally, some technicalities:head(df) # Just checking that things are working againnames(df) <- sapply(1:12,function(x) RawData[['meta']][['view']][['columns']][[x]][['name']])df$`Employee Annual Salary` <- as.numeric(df$`Employee Annual Salary`) # change the variable to numericOf course, this is a very simple case. What if our JSON contains nested objects (objects within objects)? You may wish to consult this excellent guide (this blogpost took many good points from there). What if you're actually downloading a Javascript file containing JSON? Check this out.
Sunday, July 5, 2015
Linear Regression in R
Let's load the auto dataset and learn regression techniques in more detail. (For those who have not read previous posts, this is a dataset containing information about 74 cars.)
Suppose we want to find the relationship between the price and weight of a car. One way to do this is
priceweight <- lm(auto$price ~ auto$weight)
Alternatively,
priceweight <- lm(price ~ weight, data=auto)
will work as well.
The command summary(priceweight) will then display details of the regression such as coefficient estimates and p-values.
We can then visualize the data by
plot(auto$weight,auto$price)
abline(priceweight)
Notice that for plot, the first named variable takes the x axis, while the other variable takes the y axis.
The function abline draws a straight line through the plot, using the coefficients produced by the regression. You can always produce other lines if you want. For example,
abline(a=0,b=1,col='blue')
produce a straight line with an intercept of 0 and a slope of 1. In other words, this would be the model if price and weight had a 1:1 relationship (clearly not supported by the data). Notice that specifying a blue color helps to differentiate this line from the best-fit line.
If we wanted to, we could emphasize the best fit line more:
abline(priceweight, lwd = 5)
here lwd stands for linewidth. The line will be 5 units thick (the default is 1 unit thick, so we have increased the width drastically).
Suppose we want to find the relationship between the price and weight of a car. One way to do this is
priceweight <- lm(auto$price ~ auto$weight)
Alternatively,
priceweight <- lm(price ~ weight, data=auto)
will work as well.
The command summary(priceweight) will then display details of the regression such as coefficient estimates and p-values.
We can then visualize the data by
plot(auto$weight,auto$price)
abline(priceweight)
Notice that for plot, the first named variable takes the x axis, while the other variable takes the y axis.
The function abline draws a straight line through the plot, using the coefficients produced by the regression. You can always produce other lines if you want. For example,
abline(a=0,b=1,col='blue')
produce a straight line with an intercept of 0 and a slope of 1. In other words, this would be the model if price and weight had a 1:1 relationship (clearly not supported by the data). Notice that specifying a blue color helps to differentiate this line from the best-fit line.
If we wanted to, we could emphasize the best fit line more:
abline(priceweight, lwd = 5)
here lwd stands for linewidth. The line will be 5 units thick (the default is 1 unit thick, so we have increased the width drastically).
Squared terms
It might be reasonable to think that price is a non-linear function of weight. How can we see whether, for example, price is a quadratic function of weight?
An intuitive guess may be
priceweight2 <- lm(price ~ weight + weight^2, data=auto)
But it doesn't quite produce the results we want:
Call:
lm(formula = price ~ weight + weight^2, data = auto)
Coefficients:
(Intercept) weight
-6.707 2.044
What you actually wanted to do was this:
priceweight2 <- lm(price ~ weight + I(weight^2), data=auto)
The I() means "inhibit". In this case, without I(), ^ would be a formula operator. Using I() allows ^ to be interpreted as a arithmetic operator. If you are confused, think of it this way: ^ has a special meaning when it appears in a formula. Using I() suppresses this special meaning, and forces R to treat it in the normal way (exponential).
With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.
anova(priceweight,priceweight2)
Analysis of Variance Table
Model 1: price ~ weight
Model 2: price ~ weight + I(weight^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 72 450831459
2 71 384779934 1 66051525 12.188 0.0008313 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run
plot(priceweight)
plot(priceweight2)
and end up with the following plots:
Linear case:
Quadratic case:
It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.
With this in mind, we can do standard F tests to determine if model fitness significantly improves with the addition of a squared term.
anova(priceweight,priceweight2)
Analysis of Variance Table
Model 1: price ~ weight
Model 2: price ~ weight + I(weight^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 72 450831459
2 71 384779934 1 66051525 12.188 0.0008313 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model fitness significantly improves. Since the null hypothesis is rejected, the test therefore lends support to the inclusion of a squared term. The test, however, is by no means definitive - a lot more things need to be considered. For example, you could run
plot(priceweight)
plot(priceweight2)
and end up with the following plots:
Linear case:
It looks like in the quadratic case, the residuals are much less correlated with the fitted values. This also suggests that a quadratic model is better.
Interaction Variables
Suppose we hypothesized that the relationship between price and weight was different from foreign cars as compared to domestic cars.
Then we can use
priceweight <- lm(price ~ weight + weight:foreign, data=auto)
We might even think that being foreign has some effect, so let's throw the variable in:
priceweight <- lm(price ~ weight + weight:foreign + foreign, data=auto)
But hey, all these can be simplified:
priceweight <- lm(price ~ weight*foreign, data=auto)
Notice that weight*foreign is simply shorthand for weight + weight:foreign + foreign.
Logistic Regression
Logistic regression is an instance of generalized linear models (GLM). Notice that the variable foreign takes on only two values: domestic and foreign. Suppose (this is for the sake of giving an example), we thought that price was a predictor of a car being foreign. We can then run
nonlinear <- glm(foreign~price,data=auto,family=binomial)
summary(nonlinear)
The argument family=binomial ensures that a logistic regression is run (there are many kinds of GLMs, so we want to pick the right one. Naturally, one may be unsure over whether "Domestic" is 0 or 1, but one can easily verify this with
contrasts(auto$foreign)
which tells us that R treats "Domestic" as being equal to 0 and "Foreign" as being equal to 1.
predictedprob <- predict(nonlinear, type="response")
auto$predicted <- rep("Domestic", dim(auto)[1])
auto$predicted[predictedprob>=0.5] <- "Foreign"
table(auto$predicted,auto$foreign)
which gives us the confusion matrix
Domestic Foreign
Domestic 52 22
Foreign 0 0
The last row may not appear on your screen, but I added it for clarity. Notice that it is price is a very poor predictor of whether a car is domestic or foreign. Our model predicts that all cars are domestically produced, resulting in a large number of errors. We either need to find better predictors, or tweak the threshold above which we predict a car is of foreign origin.
Saturday, July 4, 2015
Scatterplots with ggplot2
While the standard R installation comes with basic scatterplot functions, one often needs more advanced functions. To this end, one very popular package is ggplot2, where gg refers to "grammar of graphics".
library(ggplot2)
library(choroplethr)
data(df_state_demographics)
and...
library(ggplot2)
library(choroplethr)
data(df_state_demographics)
names(df_state_demographics)
outputs
[1] "region" "total_population" "percent_white" "percent_black"
[5] "percent_asian" "percent_hispanic" "per_capita_income" "median_rent"
[9] "median_age"
Suppose we want to see the relationship between per capita income and median rent in each state. A simple way of doing this would be
ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent))
+ geom_point(shape=1)
The first argument tells us the dataset we want to use. We can then specify the x and y variables within aes. Think of aes as creating an "aesthetic", or something which allows you to specify which variables go where.
At a bare minimum, ggplot requires us to specify what shape we want the plots to take. These must be added on as a separate layer. Hence the command "+ geom_point(shape=1)". The output follows:
If we want a linear regression line, we can tack on another layer:
ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent))
+ geom_point(shape=1)
+ geom_smooth(method=lm)
which gives us
What if we don't want confidence intervals? Then we can try
ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent))
+ geom_point(shape=1)
+ geom_smooth(method=lm, se=FALSE)
Finally, what if we want to do LOESS? Just omit the arguments within geom_smooth.
ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent))
+ geom_point(shape=1)
+ geom_smooth()
Of course, we would want to make things fancier. For example, we might want to add a title. To make things simple, let's save our base plot as base_plot:
base_plot <- ggplot(df_state_demographics, aes(x=per_capita_income, y=median_rent))
+ geom_point(shape=1)
+ geom_smooth()
How can we:
1. Add a title?
base_plot + ggtitle("State per capita income and median rent")
2. Add labels?
base_plot + xlab("Per capita income ($)") + ylab("Median rent ($)")
Of course, this is just the tip of the iceberg. You may wish to see this excellent tutorial (part of this blogpost was drawn from there).
Plotting maps with choroplethr
According to the below graphic from the Sacramento Bee, Latinos have overtaken whites to become the most populous ethnic group.
Thereafter, we create the individual maps. Notice that each map is created by adding "layers" on top of each other. For example, choro_white consists of three layers:
1. county population data of whites (county_choropleth accesses df_county_demographics$value)
2. a graph title,
3. a geographical map (coord_map()), which is part of ggplot2.
df_county_demographics$value = df_county_demographics$percent_white
choro_white = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent White") +
coord_map()
df_county_demographics$value = df_county_demographics$percent_hispanic
choro_hispanic = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent Hispanic") +
coord_map()
df_county_demographics$value = df_county_demographics$percent_asian
choro_asian = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent Asian") +
coord_map()
Let's use the choroplethr package to examine the racial distribution of California counties.
Let's start with the preliminaries: installing and loading the package, as well as loading the relevant datasets. We'd also need to load the ggplot2 library, which will come useful in plotting the maps.
install.packages("choroplethr")
library(choroplethr)
data(df_state_demographics)
data(df_county_demographics)
library(ggplot2)Thereafter, we create the individual maps. Notice that each map is created by adding "layers" on top of each other. For example, choro_white consists of three layers:
1. county population data of whites (county_choropleth accesses df_county_demographics$value)
2. a graph title,
3. a geographical map (coord_map()), which is part of ggplot2.
df_county_demographics$value = df_county_demographics$percent_white
choro_white = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent White") +
coord_map()
df_county_demographics$value = df_county_demographics$percent_hispanic
choro_hispanic = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent Hispanic") +
coord_map()
df_county_demographics$value = df_county_demographics$percent_asian
choro_asian = county_choropleth(df_county_demographics, state_zoom="california", num_colors=1) +
ggtitle("California Counties\n Percent Asian") +
coord_map()
Finally, we need to display the maps we've produced:
library(gridExtra)
grid.arrange(choro_white, choro_hispanic, choro_asian, nrow=3)
The output:
library(gridExtra)
grid.arrange(choro_white, choro_hispanic, choro_asian, nrow=3)
As is expected, the southern counties have more Hispanics.
Thursday, July 2, 2015
R for Stata users (Part 3)
In part 1 and part 2, we learned how to use R to load datasets, perform simple manipulations, and conduct linear regression.
Here, we will learn how to save work, draw scatterplots, handle dates and times, extract substrings, and encode variables. Make sure that you've opened RStudio and loaded the auto dataset.
Task #11: Save work
Once you've loaded and analyzed the auto dataset, you'll be wondering how to save it. For example, you may need to pass a CSV or Stata file to your colleague. Fortunately, R has lots of capabilities in these areas.
The following command will write a CSV file to RStudio's current working directory:
write.csv(auto, "edited_auto.csv")
If you don't know what the current working directory is, type
getwd()
(short for get working directory). You can then find your file in that directory.
Alternatively, type
write.csv(auto, "C:/Documents/edited_auto.csv")
if you have a folder called Documents and want to place it there.
Important: In R, use forward slashes instead of backward slashes when writing directory paths.
One remark is that if I wanted to have a tab separated file, I could use the more general "write.table":
write.table(auto, "C:/Documents/edited_auto.csv", sep="\t")
where \t denotes the tab symbol.
To write an Excel file, you need something called a "library". Think of a library as an add-on. It first needs to be installed
install.packages("xlsx")
where xlsx is the specific library we are instructing R to download. Thereafter, we need to load the package:
library(xlsx)
and we can then write our Excel file.
write.xlsx(auto, "C:/auto.xlsx")
the first parameter (auto) tells us the dataframe we want R to save, and the second parameter (C:/auto.xlsx) tells us where we want to save it. Notice again the use of a forward slash.
Of course, as a veteran Stata user, you may want to save something directly into a .dta file. You'll need a different library, called "foreign".
install.packages("foreign")
library(foreign)
write.dta(auto,"auto.dta")
and remember to locate auto.dta in the working directory.
Task #12: Draw scatterplots
Plotting graphs in R is pretty much an art. There are many specialized libraries which allow you draw beautiful graphs. However, for the beginner, it's best to use the simple plot function.
Let's say we've loaded the dataset and want to do a scatterplot of price against weight. We can type
plot(auto$price,auto$weight)
The scatterplot appears on the bottom right corner.
However, it's a bare scatterplot. Let's make it better looking. First off, we can label the x and y axes more appropriately:
plot(auto$price,auto$weight, xlab="Price",ylab="Weight")
We can also add a title:
plot(auto$price,auto$weight, xlab="Price",ylab="Weight", main="Price vs. Weight")
For those who want more advanced graphing, take a look at the ggplot2 and lattice libraries.
Task #13: Handle dates and times
This is a little difficult to teach using the auto dataset (since there are no dates), but you can find a good tutorial here.
Task #14: Obtain substrings
Take a look at the variable "make". Notice that it takes on values such as "AMC Concord", "AMC Pacer", "Buick Century", and so on. Let's say that the first word is the car manufacturer's name, and the second word refers to the model. How can we separate the two?
Libraries come to the rescue again.
install.packages("stringr")
library(stringr)
auto$manufacturer <- str_split_fixed(auto$make, " ",2)[,1]
auto$model <- str_split_fixed(auto$make, " ",2)[,2]
str_split_fixed is a function within the stringr library.
The first parameter tells us which dataframe column we want to split, and " " is the instruction to look for a space. The last parameter (2) means that we are expecting two columns. Finally, [,1] and [,2] refer to the first and second column respectively.
At this point, it's useful to check that you've got the desired results.
Task #15: Encoding variables.
Look at the variable "foreign". In Stata, there is the "encode" command, which will allow you to code the different string values of foreign into numeric variables.
Here, you can do
auto$foreign_dummy <- as.numeric(auto$foreign)
and you will see 1's and 2's.
Here, we will learn how to save work, draw scatterplots, handle dates and times, extract substrings, and encode variables. Make sure that you've opened RStudio and loaded the auto dataset.
Task #11: Save work
Once you've loaded and analyzed the auto dataset, you'll be wondering how to save it. For example, you may need to pass a CSV or Stata file to your colleague. Fortunately, R has lots of capabilities in these areas.
- CSV File
The following command will write a CSV file to RStudio's current working directory:
write.csv(auto, "edited_auto.csv")
If you don't know what the current working directory is, type
getwd()
(short for get working directory). You can then find your file in that directory.
Alternatively, type
write.csv(auto, "C:/Documents/edited_auto.csv")
if you have a folder called Documents and want to place it there.
Important: In R, use forward slashes instead of backward slashes when writing directory paths.
One remark is that if I wanted to have a tab separated file, I could use the more general "write.table":
write.table(auto, "C:/Documents/edited_auto.csv", sep="\t")
where \t denotes the tab symbol.
- Excel file
install.packages("xlsx")
where xlsx is the specific library we are instructing R to download. Thereafter, we need to load the package:
library(xlsx)
and we can then write our Excel file.
write.xlsx(auto, "C:/auto.xlsx")
the first parameter (auto) tells us the dataframe we want R to save, and the second parameter (C:/auto.xlsx) tells us where we want to save it. Notice again the use of a forward slash.
- Stata
Of course, as a veteran Stata user, you may want to save something directly into a .dta file. You'll need a different library, called "foreign".
install.packages("foreign")
library(foreign)
write.dta(auto,"auto.dta")
and remember to locate auto.dta in the working directory.
- SAS
If your collaborators use SAS, not to worry, the foreign package also has SAS capabilities.
write.foreign(auto, "auto.txt", "auto.sas", package="SAS")
The first parameter tells R the dataset to use and the second tells what data file to output. The third tells us to produce a file containing code to interpret the dataset. What kind of code? That's the job of the final parameter, which tells us it's SAS code that should be produced.
Task #12: Draw scatterplots
Plotting graphs in R is pretty much an art. There are many specialized libraries which allow you draw beautiful graphs. However, for the beginner, it's best to use the simple plot function.
Let's say we've loaded the dataset and want to do a scatterplot of price against weight. We can type
plot(auto$price,auto$weight)
The scatterplot appears on the bottom right corner.
However, it's a bare scatterplot. Let's make it better looking. First off, we can label the x and y axes more appropriately:
plot(auto$price,auto$weight, xlab="Price",ylab="Weight")
We can also add a title:
plot(auto$price,auto$weight, xlab="Price",ylab="Weight", main="Price vs. Weight")
For those who want more advanced graphing, take a look at the ggplot2 and lattice libraries.
Task #13: Handle dates and times
This is a little difficult to teach using the auto dataset (since there are no dates), but you can find a good tutorial here.
Task #14: Obtain substrings
Take a look at the variable "make". Notice that it takes on values such as "AMC Concord", "AMC Pacer", "Buick Century", and so on. Let's say that the first word is the car manufacturer's name, and the second word refers to the model. How can we separate the two?
Libraries come to the rescue again.
install.packages("stringr")
library(stringr)
auto$manufacturer <- str_split_fixed(auto$make, " ",2)[,1]
auto$model <- str_split_fixed(auto$make, " ",2)[,2]
str_split_fixed is a function within the stringr library.
The first parameter tells us which dataframe column we want to split, and " " is the instruction to look for a space. The last parameter (2) means that we are expecting two columns. Finally, [,1] and [,2] refer to the first and second column respectively.
At this point, it's useful to check that you've got the desired results.
Task #15: Encoding variables.
Look at the variable "foreign". In Stata, there is the "encode" command, which will allow you to code the different string values of foreign into numeric variables.
Here, you can do
auto$foreign_dummy <- as.numeric(auto$foreign)
and you will see 1's and 2's.
Subscribe to:
Posts (Atom)