Today’s goals

Recap

So far in R Office Hours (whether in person or by following along on your own at home) we have covered several important skills:

Using ggplot() to make histograms and scatterplots to visualize data
- We also added regression lines as well as colored observations according to the value of a certain variable
Using filter() to subset a dataset according to the values of variables
- We also reviewed using logical operators to choose the observations we want
Using select() to create a dataset with only the variables you want to work with
- We also looked at helper functions which are useful when you have a lot of similarly-named variables

What’s new

All of what we’ve done so far has relied on a dataset which already has the variables we want. What about when we want to create new variables (based on the ones in the dataset). We’ll use the mutate() function to create new variables.

Next, we’ll look at how to chain a series of actions together to make your code more readable and easy to check and edit. This will allow us to put together everything we’ve learned so far!

Review

⊕There are actually several related datasets about the states which can be linked together. You can read about them here. Today we’ll be using a dataset that’s provided with R, state.x77, which has measures on each of the 50 states from the 1970s. It’s actually provided as a matrix, so we’ll turn it into a dataframe. Since we’re using the “tidyverse” set of functions again, we’ll turn it into the tidyverse version of a dataframe, a “tibble”, which has some better properties when we want to look at it.

library(tidyverse)
states <- as_tibble(state.x77)
states

## # A tibble: 50 x 8
##    Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
##         <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
##  1       3615   3624        2.1       69.0   15.1      41.3    20  50708
##  2        365   6315        1.5       69.3   11.3      66.7   152 566432
##  3       2212   4530        1.8       70.6    7.8      58.1    15 113417
##  4       2110   3378        1.9       70.7   10.1      39.9    65  51945
##  5      21198   5114        1.1       71.7   10.3      62.6    20 156361
##  6       2541   4884        0.7       72.1    6.8      63.9   166 103766
##  7       3100   5348        1.1       72.5    3.1      56     139   4862
##  8        579   4809        0.9       70.1    6.2      54.6   103   1982
##  9       8277   4815        1.3       70.7   10.7      52.6    11  54090
## 10       4931   4091        2         68.5   13.9      40.6    60  58073
## # ... with 40 more rows

Sometimes datasets come with variables whose names have spaces in them. This makes them very hard to deal with, because every time you refer to the variable, you have to enclose it in backticks. For example, this dataset has a variable named Life Exp, which means that we would have to write states$`Life Exp` to access it. Let’s rename that and the variable HS Grad to have no spaces in their names. (Here and elsewhere, try to write the code yourself using what you’ve previously learned before looking at how I did it!) ⊕The glimpse() function is another great way to look at the variables in a dataset.

states <- rename(states,
                 LifeExp = `Life Exp`,
                 HSGrad = `HS Grad`)
glimpse(states)

## Observations: 50
## Variables: 8
## $ Population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277...
## $ Income     <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 481...
## $ Illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1...
## $ LifeExp    <dbl> 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70...
## $ Murder     <dbl> 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 1...
## $ HSGrad     <dbl> 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52....
## $ Frost      <dbl> 20, 152, 15, 65, 20, 166, 139, 103, 11, 60, 0, 126,...
## $ Area       <dbl> 50708, 566432, 113417, 51945, 156361, 103766, 4862,...

Now let’s practice subsetting the dataset by making a new dataset, large_cold_states, that contains only the states with > 100 days of frost (variable Frost) and > 100,000 square miles (Area).

large_cold_states <- filter(states, Frost > 100 & Area > 100000)

Now let’s look at the data by creating a scatterplot of the per capita income (Income) compared to the percentage of illiteracy (Illiteracy). We’ll color the observations by their population (Population). Do this both in the whole data and in the large, cold states.

ggplot(states) + geom_point(aes(x = Income, y = Illiteracy, col = Population))

ggplot(large_cold_states) + geom_point(aes(x = Income, y = Illiteracy, col = Population))

What do we notice? First of all, we have to be careful to notice that the axes have changed, as has the colorbar in the legend. Since the income in the large, cold states only ranges from between $3500 to $6500 per capita, but overall is as low as about $3000, the scale of the x-axis has changed to be optimal for the data in the plot. There are ways to set the axis limits, of course, but we won’t worry about that now.

What do you notice about the population in these particular states, compared to the states all together?

It looks these cold states are large in area, but quite deserted!

mutate()

You may have been confused by the scale of the Population variable too… obviously there’s no state with just 500 people in it. These are populations in 1000s, which is helpful in some ways, but often we want the raw number, especially if we’re going to do calculations with it. Let’s create a new variable for the actual population. We’ll do this using the mutate() function, which allows us to create a new variable in the dataset which is a function of one or more of the others.

For example let’s create the variable PopReal, which is Population $\times$ 1000.

states <- mutate(states, PopReal = Population * 1000)
states

## # A tibble: 50 x 9
##    Population Income Illiteracy LifeExp Murder HSGrad Frost   Area  PopReal
##         <dbl>  <dbl>      <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>    <dbl>
##  1       3615   3624        2.1    69.0   15.1   41.3    20  50708  3615000
##  2        365   6315        1.5    69.3   11.3   66.7   152 566432   365000
##  3       2212   4530        1.8    70.6    7.8   58.1    15 113417  2212000
##  4       2110   3378        1.9    70.7   10.1   39.9    65  51945  2110000
##  5      21198   5114        1.1    71.7   10.3   62.6    20 156361 21198000
##  6       2541   4884        0.7    72.1    6.8   63.9   166 103766  2541000
##  7       3100   5348        1.1    72.5    3.1   56     139   4862  3100000
##  8        579   4809        0.9    70.1    6.2   54.6   103   1982   579000
##  9       8277   4815        1.3    70.7   10.7   52.6    11  54090  8277000
## 10       4931   4091        2      68.5   13.9   40.6    60  58073  4931000
## # ... with 40 more rows

Notice that the original variable is still there, and the new one has been appended to the end of the dataset.

Try creating a variable PopPerMile, which is the number of people per square mile in each state.

Then make a scatterplot of the population per square mile (y-axis) and the number of days of frost (x-axis). Add a regression line to the graph. Does it look like there’s a relationship between cold days and population density?

Is this what you did?

states <- mutate(states, PopPerMile = PopReal / Area)
ggplot(states, aes(y = PopPerMile, x = Frost)) + geom_point() + geom_smooth(method = "lm")

Besides easy mathematical operations, you can also use any number of other functions that are useful to create new variables. For example, we often make categories according to quantiles in the data. Let’s say we want to classify the states by quartile of per capita income. We can use the function ntile(Income, 4), which will assign the values 1 to 4 to the states, according to which quartile they fall in.

states <- mutate(states, Qtl_Inc = ntile(Income, 4))

⊕We previously used col = <var> to assign colors to observations in geom_point(). With geom_histogram(), we actually want to fill in the bars with color, so we use fill = <var>. You can also use a number of other arguments depending on the type of graph and type of variable, like shape = <var> or size = <var>. Let’s check to make sure that did what we wanted it to. We’ll make a histogram of the income levels and color them according to the quartile. We expect a histogram chopped into four colors.

ggplot(states) + geom_histogram(aes(Income, fill = factor(Qtl_Inc)))

⊕If you’ve forgotten about the ifelse() function, you can look back to week 1 to see how it is used. It takes 3 arguments: a true/false statement, the value that should be assigned if true, the value that should be assigned if false.

Now try to use another function that works well with mutate() to create new variables. We want to categorize the states into “Cold” states (those with > 100 days of frost) and “Warm” states (all the rest).

Use the ifelse() funtion to create a new variable Temp with values "Cold" or "Warm" depending on their days of frost. Then make a histogram of the Frost variable, with the fill color according to your new variable to check your work.

Did you do something like this?

states <- mutate(states, Temp = ifelse(Frost > 100, "Cold", "Warm"))
ggplot(states) + geom_histogram(aes(Frost, fill = Temp))

Putting it all together

Often, when you’re preparing a dataset for analysis, you go through a lot of these steps. It gets very tedious to have line after line of states <- function(states, ...). This is also not great practice because you are writing over the same dataset multiple times, so it’s hard to go back and change your code without rerunning the whole thing. To avoid that you might create states1 <-, states2 <-, states3 <-, etc., which is annoying, or you have to try to come up with better, more descriptive names for each intermediate dataset. Other times you are nesting a lot of functions in one other (e.g., make a new variable, perform a calculation on it, save that as a new variable, etc.) all in one line. This can get really hard to read, as there are often a number of parentheses and you have to read the code from the inside out.

One of the principles of the “tidyverse”, which most of the functions we’ve learned so far are from, is to use “piping” to connect a number of actions.

⊕There’s some description and code that goes along with this example here. Unfortunately it’s in French! (The code is in English, though, so it’s still worth a look even if you don’t speak French.) This is my favorite visualization of the concept: The idea is that you’re performing actions in a sequence, just like you would a recipe. Each step is connected with the “pipe” function, %>%. That function lets you use whatever object was created in the previous line as the first argument in the next line. So if we wanted to create a new variable in the states dataset, we could write states %>% mutate(PopReal = Population * 1000). In a simple case like this, it doesn’t seem to help us that much, but look at this:

We could rewrite all of the steps we’ve done to create the states dataset in one:

states <- state.x77 %>%
  as_tibble() %>%
  rename(
    LifeExp = `Life Exp`,
    HSGrad = `HS Grad`
  ) %>%
  mutate(
    PopReal = Population * 1000,
    PopPerMile = PopReal / Area,
    Qtl_Inc = ntile(Income, 4),
    Temp = ifelse(Frost > 100, "Cold", "Warm")
  )

⊕Notice that there’s no assignment <- operator here. We’re not saving any objects, just creating a graph. Note that we can even immediately use variables that we just created to create new ones.

We could then use this new dataset to subset to the “large, cold states” and then immediately create the plot we did before.

states %>%
  filter(Frost > 100 & Area > 100000) %>%
  ggplot() + geom_point(aes(x = Income, y = Illiteracy, col = Population))

If we didn’t ever want to reuse the subsetted dataset and just wanted to visualize it, this saves us from creating the intermediate large_cold_states object.

Challenge

We’ll switch to the mtcars dataset (also included with base R) to put all these concepts together. This dataset consists of characteristics of 32 cars. You can read about the variables here or by searching for “mtcars” in the help pane of RStudio. ⊕Though you may want to at least occasionally print your dataset along the way, as you pipe through your functions, to check that everything looks as it should. In one chunk of code, without creating any new objects (i.e., no <-), do the following: ⊕Hint: There are 1.60934 kilometers in a mile and 3.78541 liters in a gallon.

Use the mpg variable (miles per gallon) to create a kmpl variable (kilometers per liter)
Classify the wt (weight) of the cars into tertiles, making a new variable ttle_wt
Create a variable gears_cat with values “High” if the car has at least 4 gears (variable gear), and “Low” otherwise
Subset the dataset to remove the cars in the lowest tertile of weight
Create a scatterplot with kmpl on the x-axis and wt on the y-axis, with color according to hp (horsepower) and shape according to gears_cat
Add a red regression line to the graph (for the cars overall, not grouped by gears)

See if you get something like this:

Solution

Your code might look like this:

mtcars %>%
  mutate(
    kmpl = mpg * 1.60934 / 3.78541, # convert km/miles * gallons/liters
    ttle_wt = ntile(wt, 3), # create tertiles of weight
    gears_cat = ifelse(gear >= 4, "High", "Low") # classify cars by high vs. low gears
  ) %>%
  filter(ttle_wt != 1) %>% # keep only the cars in the upper 2 tertiles of weight
  ggplot(aes(x = kmpl, y = wt)) + # plot with these variables on the axes
  geom_point(aes(col = hp, shape = gears_cat)) + # these aes() arguments only apply to the points
  geom_smooth(method = "lm", col = "red") # add a regression line (whole dataset, not grouped)

You may want to read more about piping here and get some more practice with mutate() here.

Office Hours 4

September 25/26, 2018