Friday, June 12, 2015

Spatial analysis

Continuing from our last post about crime in Charlotte, North Carolina, USA. (If you didn't read it, it's fine).

Suppose you had a database containing information on crime committed in Charlotte, containing the location of each crime report. How can you use R to determine which areas have the most crime?

Some preliminaries:
install.packages('maptools') # no need to run this if it's installed
install.packages('spatstat') no need to run this if it's installed
library(maptools)
library(spatstat)

Then, download all files in this folder (these collectively form what is known as a "shape file". Essentially, Charlotte's shape is recorded as a polygon.) Place the files in R's working directory (type getwd() if you are unsure what's your working directory).

If you weren't continuing from our last post, you'd also want to load the crime data. Download this file and place it in your working directory. Then run the below command:
mydata <- read.table("CLTCrime.csv",sep=",", header=TRUE)  #change the directory if necessary

Next, the following commands plot crime incidents on Charlotte on the City Council's boundaries:

CLTbdry<-readShapePoly("CityCouncil_Boundary.shp") # the above data file contains a polygon in the shape of Charlotte.
bdry<-as(CLTbdry,"owin") # as function forces CLTbdry to become a boundary.
myppp<-ppp(mydata$longitude, mydata$latitude,window=bdry) # the ppp function creates a point pattern dataset. 
# crime data is obtained from the longitude and latitude coordinates, and the boundary is accessed in the third argument (window=bdry)
png(filename="clt_crime.png") # saves plot into png file
plot(myppp) # crime committed in charlotte
dev.off() # turns off recording into png file

Opening clt_crime.png, you can see:

Finally, let's divide Charlotte into quadrats (note; this is different from quadrants. Quadrats are essentially "subplots" of an area)

myQC<-quadratcount(myppp) #by default it applies 5*5 quadrat
myQC   #check the details of quadrat count
plot(myQC) #plot the results as below

The output of myQC is as follows:

               x
y               [-81.01,-80.94] (-80.94,-80.87] (-80.87,-80.79] (-80.79,-80.72] (-80.72,-80.65]
  (35.32,35.4]                1               3              16               7               2
  (35.24,35.32]              24              27              67              65              21
  (35.17,35.24]               4              75             126              86              28
  (35.09,35.17]              18              55              34              29               1
  [35.01,35.09]               0               2              23              17               0

How would you read the above table? An example: the "126" in the middle cell indicates that for the quadrat (rectangle) with longitude -80.87 to -80.79 and latitude 35.17 to 35.24, there were a total of 126 crimes reported.

And plot(myQC) gives:

You now have two different ways of telling which areas of Charlotte have more crime than others!

Don't stop reading: there's more. What if we wanted to generate a "heat map" of crime in Charlotte, telling us where the crime hotspots are?

Crime "heat map" of Charlotte

R allows us to generate such heat maps easily through a simple command:
plot(density(myppp, 0.01))
The function density does what is called "kernel density estimation". If you don't really know what kernel density estimation is, this site is a good introduction to the intuition behind KDE. The Wikipedia page is pretty good for technically inclined readers too.

But for those short on time, here's a rough overview of what KDE is.

Let's take a look at the below picture. Assume that crime reports are represented by red dots (on the flat surface).

Source: Wikipedia

For each grid coordinate, KDE generates an estimate of the number of points that are "close" to the grid coordinate.
For example, there are very few red dots at any edge of the grid, hence crime incidence is estimated to be negligible there.
On the other hand, in the center of the grid, there are lots of red dots. Therefore, KDE would estimate that crime is very high in the center. Notice the two high peaks in the center. Essentially, the result of the KDE estimator are the "mountains", and the peaks are where crime incidence is estimated to be highest.

We mentioned that KDE generates an estimate of the number of points that are "close" to the grid coordinate. You must be wondering: what exactly does "close" mean? Whether a point is close depends on something called the bandwidth which has to be specified by the user (you). Roughly speaking, if the bandwidth is 0.01 units, then red dots within 0.01 units of a point will be considered close to that point. On the other hand, if the bandwidth is 1 unit, then red dots within 1 unit of a point will be considered close to that point. (This is of course inaccurate, but the main purpose is to give intuition).

Let's increase our bandwidth by ten times (to 0.10) and see what happens:

This neatly illustrates the perils of choosing too high a bandwidth. Too many points are considered to be "close" to any point on the grid map. As a result, the heat map provides little information about the crime in Charlotte.

Having too low a bandwidth can be dangerous too. To illustrate, let's decrease our bandwidth by a hundred times:

Almost no crime reports were judged to have occurred "close" to any grid point in Charlotte. As a result, one might think that Charlotte is crime-free.

How should one select the bandwidth? First, it is important to consider the data itself. If you were to type myppp, R would output

planar point pattern: 731 points
window: polygonal boundary
enclosing rectangle: [-81.00963, -80.65147] x [35.01052, 35.40026] units
*** 38 illegal points stored in attr(,“rejects”) ***

Crime appears to be recorded in degrees for both longitude and latitude. (Trivia: one degree of longitude or latitude is equivalent to around 111 kilometers/69 miles at the equator. At Charlotte's location, one degree latitude is still 111km, but one degree longitude is around 90km. But for simplicity's sake, let's take one degree to be 100km each way here.)

Another point to consider is that myppp consists of individual crime reports, not crime reports at the county (or neighborhood) level.

Here comes the subjective part. Most people would not be bothered by crime reports (e.g. stolen vehicle) that occur 10km away from where they stay. Thus setting the bandwidth to 0.1 degrees (10km) would mean that when calculating "crime intensity" for a specific area, we'd be including crime that wouldn't bother most people.

On the other hand, if we were to set the bandwidth to 0.001 degrees (0.1km, or around 300 feet), we would be effectively excluding crime that would bother most people. (Technical note: I say effectively, because although crime further than 0.1km is taken into account by most kernels (think of a kernel as an "estimation technique"), the weight assigned will be typically very small.)

Thus the sweet spot would probably be around 0.01 degrees, or 1km. Most people would be significantly bothered by crime that occurs within 1km of where they stay, but wouldn't be too bothered by crime that occurs beyond that (unless, for example, there are many such crime incidences).

Of course, if one wanted to be rigorous (e.g. for producing an academic report), one might actually survey Charlotte's residents and find out how far away a crime has to be before they are not bothered by it.

What if one didn't have much specific information regarding the dataset (e.g. one is breaking new ground in a certain area, or just doesn't have the expertise in the area)?

There are some rules behind choosing the optimal bandwidth published in academic journals. However, in some aspects choosing the best bandwidth is an art. The less mathematically inclined may wish to consider the following rule of thumb:

Pick an arbitrary bandwidth.
- If it seems too low, multiply it by 3. If it still seems too low, multiply it by three again, and so on, until it becomes too high.
- Conversely, if the bandwidth seems too high, divide it by three again and again until it seems too low.
Pick the diagram which best represents the picture you want to convey (you may wish to do further adjustment of the bandwidth before coming to your final choice).

I hope this introduction to spatial analysis and kernel density estimation in R was interesting. Thanks for reading this! :)

Acknowledgements: Nick Cox provided useful comments on a question I posed on Stackoverflow. All errors are mine.

No comments:

Post a Comment