Sunday, July 12, 2015

Apply, lapply, sapply, tapply in R

How can we use R to efficiently calculate key statistics such as the mean?
Let's play with the auto dataset to see how.

apply(cbind(auto$price,auto$length,auto$weight),2,mean) # finds simple means

apply(auto,2,mean) would not work because some arguments are non-numeric.
Instead, one can try apply(auto[,c(2,3)],2,mean).

If you wanted to find out the number of values each variable took, this would be simple:
apply(auto, 2, function (x) length(unique(x)))

The second argument (2) tells R to work column-by-column. Had 1 been specified, R would have worked row-by-row. You can also use the sapply function, which by default returns a vector:

sapply(auto, function(x) length(unique(x)))

If we wanted the output in terms of a list, we could use lapply:
lapply(auto, function(x) length(unique(x)))

What if we wanted disaggregated statistics? For example, we might want to know what is the mean price of foreign cars, as well as the mean price of domestic cars. We can use

tapply(auto$price, auto$foreign, mean)

Which would be similar to Excel's PivotTable. The by function is also acceptable:

by(auto$price, auto$foreign, mean)

What if the variable has NA values? For example, by(auto$price, auto$foreign, mean) returns us the unglamorous

auto$foreign: Domestic
[1] NA
-----------------
auto$foreign: Foreign
[1] NA

This was because the rep78 contained missing values. We can easily handle this:

tapply(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

And what if we wanted to restrict attention to cars with prices above $1000:
tapply(auto$rep78, auto$foreign, function(x) mean(x[auto$price > 10000],na.rm=TRUE))
or
by(auto$rep78, auto$foreign, function(x) mean(x, na.rm=TRUE))

Yet another way is to subset the data first:
m <- subset(auto, !is.na(auto$rep78) & auto$price > 10000)
by(m$rep78, m$foreign, mean)

No comments:

Post a Comment