Office Hours 4

September 25/26, 2018

Today’s goals

Recap

So far in R Office Hours (whether in person or by following along on your own at home) we have covered several important skills:

What’s new

All of what we’ve done so far has relied on a dataset which already has the variables we want. What about when we want to create new variables (based on the ones in the dataset). We’ll use the mutate() function to create new variables.

Next, we’ll look at how to chain a series of actions together to make your code more readable and easy to check and edit. This will allow us to put together everything we’ve learned so far!

Review

There are actually several related datasets about the states which can be linked together. You can read about them here. Today we’ll be using a dataset that’s provided with R, state.x77, which has measures on each of the 50 states from the 1970s. It’s actually provided as a matrix, so we’ll turn it into a dataframe. Since we’re using the “tidyverse” set of functions again, we’ll turn it into the tidyverse version of a dataframe, a “tibble”, which has some better properties when we want to look at it.

library(tidyverse)
states <- as_tibble(state.x77)
states
## # A tibble: 50 x 8
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
##  1       3615   3624        2.1       69.0   15.1      41.3    20  50708
##  2        365   6315        1.5       69.3   11.3      66.7   152 566432
##  3       2212   4530        1.8       70.6    7.8      58.1    15 113417
##  4       2110   3378        1.9       70.7   10.1      39.9    65  51945
##  5      21198   5114        1.1       71.7   10.3      62.6    20 156361
##  6       2541   4884        0.7       72.1    6.8      63.9   166 103766
##  7       3100   5348        1.1       72.5    3.1      56     139   4862
##  8        579   4809        0.9       70.1    6.2      54.6   103   1982
##  9       8277   4815        1.3       70.7   10.7      52.6    11  54090
## 10       4931   4091        2         68.5   13.9      40.6    60  58073
## # ... with 40 more rows

Sometimes datasets come with variables whose names have spaces in them. This makes them very hard to deal with, because every time you refer to the variable, you have to enclose it in backticks. For example, this dataset has a variable named Life Exp, which means that we would have to write states$`Life Exp` to access it. Let’s rename that and the variable HS Grad to have no spaces in their names. (Here and elsewhere, try to write the code yourself using what you’ve previously learned before looking at how I did it!) The glimpse() function is another great way to look at the variables in a dataset.

states <- rename(states,
                 LifeExp = `Life Exp`,
                 HSGrad = `HS Grad`)
glimpse(states)
## Observations: 50
## Variables: 8
## $ Population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277...
## $ Income     <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 481...
## $ Illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1...
## $ LifeExp    <dbl> 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70...
## $ Murder     <dbl> 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 1...
## $ HSGrad     <dbl> 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52....
## $ Frost      <dbl> 20, 152, 15, 65, 20, 166, 139, 103, 11, 60, 0, 126,...
## $ Area       <dbl> 50708, 566432, 113417, 51945, 156361, 103766, 4862,...

Now let’s practice subsetting the dataset by making a new dataset, large_cold_states, that contains only the states with > 100 days of frost (variable Frost) and > 100,000 square miles (Area).

large_cold_states <- filter(states, Frost > 100 & Area > 100000)

Now let’s look at the data by creating a scatterplot of the per capita income (Income) compared to the percentage of illiteracy (Illiteracy). We’ll color the observations by their population (Population). Do this both in the whole data and in the large, cold states.

ggplot(states) + geom_point(aes(x = Income, y = Illiteracy, col = Population))

ggplot(large_cold_states) + geom_point(aes(x = Income, y = Illiteracy, col = Population))

What do we notice? First of all, we have to be careful to notice that the axes have changed, as has the colorbar in the legend. Since the income in the large, cold states only ranges from between $3500 to $6500 per capita, but overall is as low as about $3000, the scale of the x-axis has changed to be optimal for the data in the plot. There are ways to set the axis limits, of course, but we won’t worry about that now.

What do you notice about the population in these particular states, compared to the states all together?

It looks these cold states are large in area, but quite deserted!

mutate()

You may have been confused by the scale of the Population variable too… obviously there’s no state with just 500 people in it. These are populations in 1000s, which is helpful in some ways, but often we want the raw number, especially if we’re going to do calculations with it. Let’s create a new variable for the actual population. We’ll do this using the mutate() function, which allows us to create a new variable in the dataset which is a function of one or more of the others.

For example let’s create the variable PopReal, which is Population \(\times\) 1000.

states <- mutate(states, PopReal = Population * 1000)
states
## # A tibble: 50 x 9
##    Population Income Illiteracy LifeExp Murder HSGrad Frost   Area  PopReal
##         <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>    <dbl>
##  1       3615   3624        2.1    69.0   15.1   41.3    20  50708  3615000
##  2        365   6315        1.5    69.3   11.3   66.7   152 566432   365000
##  3       2212   4530        1.8    70.6    7.8   58.1    15 113417  2212000
##  4       2110   3378        1.9    70.7   10.1   39.9    65  51945  2110000
##  5      21198   5114        1.1    71.7   10.3   62.6    20 156361 21198000
##  6       2541   4884        0.7    72.1    6.8   63.9   166 103766  2541000
##  7       3100   5348        1.1    72.5    3.1   56     139   4862  3100000
##  8        579   4809        0.9    70.1    6.2   54.6   103   1982   579000
##  9       8277   4815        1.3    70.7   10.7   52.6    11  54090  8277000
## 10       4931   4091        2      68.5   13.9   40.6    60  58073  4931000
## # ... with 40 more rows

Notice that the original variable is still there, and the new one has been appended to the end of the dataset.

Try creating a variable PopPerMile, which is the number of people per square mile in each state.

Then make a scatterplot of the population per square mile (y-axis) and the number of days of frost (x-axis). Add a regression line to the graph. Does it look like there’s a relationship between cold days and population density?

Is this what you did?

states <- mutate(states, PopPerMile = PopReal / Area)
ggplot(states, aes(y = PopPerMile, x = Frost)) + geom_point() + geom_smooth(method = "lm")

Besides easy mathematical operations, you can also use any number of other functions that are useful to create new variables. For example, we often make categories according to quantiles in the data. Let’s say we want to classify the states by quartile of per capita income. We can use the function ntile(Income, 4), which will assign the values 1 to 4 to the states, according to which quartile they fall in.

states <- mutate(states, Qtl_Inc = ntile(Income, 4))

We previously used col = <var> to assign colors to observations in geom_point(). With geom_histogram(), we actually want to fill in the bars with color, so we use fill = <var>. You can also use a number of other arguments depending on the type of graph and type of variable, like shape = <var> or size = <var>. Let’s check to make sure that did what we wanted it to. We’ll make a histogram of the income levels and color them according to the quartile. We expect a histogram chopped into four colors.

ggplot(states) + geom_histogram(aes(Income, fill = factor(Qtl_Inc)))

If you’ve forgotten about the ifelse() function, you can look back to week 1 to see how it is used. It takes 3 arguments: a true/false statement, the value that should be assigned if true, the value that should be assigned if false.

Now try to use another function that works well with mutate() to create new variables. We want to categorize the states into “Cold” states (those with > 100 days of frost) and “Warm” states (all the rest).

Use the ifelse() funtion to create a new variable Temp with values "Cold" or "Warm" depending on their days of frost. Then make a histogram of the Frost variable, with the fill color according to your new variable to check your work.

Did you do something like this?

states <- mutate(states, Temp = ifelse(Frost > 100, "Cold", "Warm"))
ggplot(states) + geom_histogram(aes(Frost, fill = Temp))

Putting it all together

Often, when you’re preparing a dataset for analysis, you go through a lot of these steps. It gets very tedious to have line after line of states <- function(states, ...). This is also not great practice because you are writing over the same dataset multiple times, so it’s hard to go back and change your code without rerunning the whole thing. To avoid that you might create states1 <-, states2 <-, states3 <-, etc., which is annoying, or you have to try to come up with better, more descriptive names for each intermediate dataset. Other times you are nesting a lot of functions in one other (e.g., make a new variable, perform a calculation on it, save that as a new variable, etc.) all in one line. This can get really hard to read, as there are often a number of parentheses and you have to read the code from the inside out.

One of the principles of the “tidyverse”, which most of the functions we’ve learned so far are from, is to use “piping” to connect a number of actions.

There’s some description and code that goes along with this example here. Unfortunately it’s in French! (The code is in English, though, so it’s still worth a look even if you don’t speak French.) This is my favorite visualization of the concept: