Saturday, June 13, 2015

Introduction to R for Excel users

You've been using Excel a lot, but are new to R. You're wondering how to transition from Excel to R smoothly. If so, this post is for you.

First, open the R interface and you should reach this screen:


We're going to use a package called ggplot2. (Think of it as an "add-on" to R). In R, you need to load packages through the library function, so type in the following command:
library(ggplot2)
This package contains the dataset mpg that we need to use:
mpg
You should see lots of lines of text after typing the above command. I've copied and pasted the start and end:

    manufacturer                  model displ year cyl      trans drv cty hwy fl      class
1           audi                     a4   1.8 1999   4   auto(l5)   f  18  29  p    compact
2           audi                     a4   1.8 1999   4 manual(m5)   f  21  29  p    compact
3           audi                     a4   2.0 2008   4 manual(m6)   f  20  31  p    compact
...
232   volkswagen                 passat   2.8 1999   6   auto(l5)   f  16  26  p    midsize
233   volkswagen                 passat   2.8 1999   6 manual(m5)   f  18  26  p    midsize
234   volkswagen                 passat   3.6 2008   6   auto(s6)   f  17  26  p    midsize 

As you can see, there are 234 observations of popular cars from 1999 to 2008.

Sorting

Obviously, the variable "year" contains information about the year of a specific model. Let's say we want to see the latest models first.
How can we tell R to sort the data accordingly? We can use the order command:
mpg[order(mpg$year)]
Notice that mpg is again a reference to the mpg dataset, and by using the square brackets ([]) I'm saying I want to access certain list members, or arrange them in a specific way. the function order is key here: it tells R to sort the data. Finally, mpg$year says, "Sort according to the year variable in the mpg dataset".
R processes the command and outputs:

    manufacturer                  model displ year cyl      trans drv cty hwy fl      class
1           audi                     a4   1.8 1999   4   auto(l5)   f  18  29  p    compact
2           audi                     a4   1.8 1999   4 manual(m5)   f  21  29  p    compact
5           audi                     a4   2.8 1999   6   auto(l5)   f  16  26  p    compact
...

Whoops! Notice that the default is to output in an ascending order. To fix this, simply add a minus sign in front of mpg$year:
mpg[order(-mpg$year)]
Now you should get what you want.

Challenge 1: Sort by year, and then displ
Challenge 2: Sort by year, and then manufacturer (this is a little harder because manufacturer is a string variable)

Manipulating variables

The variable cty refers to "miles per gallon for city driving". Likewise, hwy refers to "miles per gallon for highway driving". Let's say I'm sending this dataset to a European friend, and I'd like to convert them into liters per 100 km.

Working through the math, we realize that 1 mile per gallon is 235 liters per 100 km.

It's easy to do the conversion:

mpg$hwy <- mpg$hwy * 235
mpg$cty <- mpg$cty * 235  

If we were driving on the highways 40% of the time, and driving within the city 60% of the time, it would be easy to construct a weighted average:

mpg$wtavglp100k <- mpg$hwy*0.4 + mpg$cty*0.6

lp100k is shorthand for liters per 100 km.

Rename variables

Recall we had done some metric conversion for our European friend. How can we remind ourselves that we've done these conversion?
The following commands should make clear what units the variables are in: 
names(mpg)[8] <- "lp100k_cty"
names(mpg)[9] <- "lp100k_hwy" 

Capturing part of a variable

Let's take a look at the trans variable closely:
summary(mpg$trans)
outputs

auto(av)   auto(l3)   auto(l4)   ...
      5          2         83    ...
auto(s6) manual(m5) manual(m6) 
     16         58         19 

How can we shave off the items in brackets?

There are several methods available. Some methods are short and sweet but rely on a crucial aspect of the dataset we're working on now (e.g. there are only auto and manual cars). Others are longer, but are more generalizable (i.e. the methods can be applied to other datasets as well).

Method 1

mpg$transmission <- substring(mpg$trans,1,4)
mpg$transmission[mpg$transmission=="manu"] <- "manual"

Method 2

use gsub, which substitutes an expression for another.

Note: we need to use double backslash (i.e. \\) because "(" is considered to be a metacharacter. (Think of a metacharacter as a special character).

mpg$trans <- gsub(\\(.*$","",mpg$trans) # best
or
mpg$trans <- gsub(\\(..\\)","",mpg$trans)
or
mpg$trans <- gsub(\\([a-z][a-z0-9]\\)","",mpg$trans)

In the first command, .*$ pick up everything after the opening bracket (. Likewise, ..\\) and ([a-z][a-z0-9]\\) do the same. This document produced by Svetlana Eden is an excellent guide to knowing the finer details of what was done.

Method 3

a <- regexpr("\\(",mpg$trans) # finds the position of (
mpg$trans <- substr(mpg$trans,0,a-1) # drops everything including and after (
rm(a) # removes the variable a, which was intended to be a temporary variable

Using any of these methods, all the text following ( is now deleted. We can verify this by typing mpg$trans or (better) if we want to be doubly sure, type unique(mpg$trans)

No comments:

Post a Comment