Using factors to work with categorical variables

October 2/3, 2018

Motivation and goals

Today’s session comes from R for Data Science, Chapter 15.

Today, we’re going to cover another variable class in R: factors. We’ll review how to create and modify factors.

A (very) short glossary of terms

factor - A type of variable in R that takes on a limited number of different values, i.e., a categorical variable.

level - All possible ‘categories’ of a factor.

Set-up

For the past few weeks, we’ve been exploring the world of tidyverse. We’ll continue to use tidyverse, and specifically, some functions from forcats, which is newly built into tidyverse.

library(tidyverse)

We’ll use a subset of the General Social Survey because it has a number of categorical variables. It can be loaded from forcats.

# Load the data
forcats::gss_cat

Looking at the data + a quick review of ggplot & piping

# Take a first look
gss_cat 

For a quick review, take a look at past office hours sessions on ggplot and piping.

How many observations are in the dataset?

How many variables are in the dataset?

How many of them are factor variables?

What are the levels of the factor variables?

Like many things in R, there are multiple ways to take a look at the levels of a factor. Here are a few:

levels(gss_cat$race) # Access the levels directly
## [1] "Other"          "Black"          "White"          "Not applicable"
gss_cat %>% count(race) # Count observations in each level 
## # A tibble: 3 x 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395

Note that each of these commands give different output – count gives you the number of observations in each category, but doesn’t include levels with zero counts in its table output.

Let’s take a look at some visualizations.

ggplot(gss_cat, aes(race)) + 
  geom_bar()

Note that ggplot drops levels that don’t have any values. ggplot has an option to show all levels:

ggplot(gss_cat, aes(race)) +
  geom_bar() + 
  scale_x_discrete(drop = FALSE)

Creating factors

We’ve taken a look at the factors of a dataset, but there may be instances when you need to create the variables yourself.

You could do it this way:

x1 <- c("Dec", "Apr", "Jan", "Mar")

This creates a vector of strings (it’s not a factor).

But there are some risks.

  1. There’s no check on typos.
x_typo <- c("Dec", "Apr", "Jam", "Mar")
x_typo
## [1] "Dec" "Apr" "Jam" "Mar"
  1. It likely doesn’t sort in a useful way.
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"

Above, it simply sorted the months alphabetically, though it would probably be more helpful to be in month order.

Using a factor can address these issues. To create a factor, create a list of the levels.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Create a factor by calling factor() and defining the levels option:

f1 <- factor(x1, levels = month_levels)
f1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(f1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

If you omit the levels option, they’ll be taken from the data in alphabetical order.

factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar

Values that are not set in the levels assignment will be silently converted to NA.

f_typo <- factor(x_typo, levels = month_levels)
f_typo
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Changing factor order

You can take a look at all of the functions in the forcats package here.

When defining levels, their order matters. Taking a look at month_levels that we assigned above, we will see that the order in which we assigned the levels is retained.

It can be useful to change the order of factor levels for visualizations.

We’ll go over summarise and group_by in detail next week, but let’s use them here to quickly construct something we can use for our visualization example.

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

There’s not much of a pattern here that we can see. Let’s reorder the factor levels according to tvhours using fct_reorder from the forcats package.

relig_summary %>% # Call the dataset that you want to use
  # Redefine the levels of the `relig` factor variable
  mutate(relig = fct_reorder(relig, tvhours)) %>% 
  ggplot(aes(tvhours, relig)) + 
    geom_point()

We can see from this example that fct_reorder reorders the order of factors according to another variable.

Think before reordering – some categorical variables already have a principaled order. In the following example, it wouldn’t make sense to reorder income levels by increasing age.

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

Since the levels of the rincome variable have a meaningful order, they should probably stay that way.

levels(rincome_summary$rincome)
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

Though we may want to reverse them using fct_rev(), so that people making the most money are highest on the y-axis.

ggplot(rincome_summary, aes(age, fct_rev(rincome))) + geom_point()

It could also make sense to move the “Not applicable” to a spot that makes more sense. For this, we’ll use fct_relevel. Instead of designating a second variable to define the order, fct_relevel takes any levels that you want to move to the front of the list.

levels(fct_relevel(rincome_summary$rincome, "Not applicable"))
##  [1] "Not applicable" "No answer"      "Don't know"     "Refused"       
##  [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
##  [9] "$8000 to 9999"  "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999" 
## [13] "$4000 to 4999"  "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"
# (Note that this isn't saving it in that order in the dataset, just printing it)
rincome_summary %>%
  mutate(rincome = fct_relevel(rincome, "Not applicable"),
         rincome = fct_rev(rincome)) %>%
  ggplot(aes(age, rincome)) +
  geom_point()

fct_reorder2() is another helper function to reorder factor levels by the y values associated with the largest x values.

Another example of reordering:

fct_infreq() can be used to order levels in increasing frequency.

fct_rev() reverses the order of factor levels.

We can use them in succession without having to create a new variable in between.

gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()

Changing factor levels

The most general tool to change the values of factor levels is fct_recode().

Let’s take a quick look at partyid. We see that the levels are inconsistent. We can change them to use parallel construction.

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"

fct_collapse() could be another option. It provides a simple way to collapse multiple levels into fewer.

To combine groups, assign multiple old levels to the same new levels, as with “Other” below.

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
## # A tibble: 8 x 2
##   partyid                   n
##   <fct>                 <int>
## 1 Other                   548
## 2 Republican, strong     2314
## 3 Republican, weak       3032
## 4 Independent, near rep  1791
## 5 Independent            4119
## 6 Independent, near dem  2499
## 7 Democrat, weak         3690
## 8 Democrat, strong       3490

Using case_when() to create new levels conditionally

case_when() is an alternative to ifelse(). It can be useful when you want to create new variables that relies on existing variables (i.e., with mutate).

Let’s recode the age variable using case_when().

gss_cat %>% 
  mutate(
    age_cat = case_when(
      age < 40 ~ "youngest",
      age >= 40 | age <= 50 ~ "middle",
      age > 50 ~ "oldest"
    )
  )
## # A tibble: 21,483 x 10
##     year marital    age race  rincome  partyid relig denom tvhours age_cat
##    <int> <fct>    <int> <fct> <fct>    <fct>   <fct> <fct>   <int> <chr>  
##  1  2000 Never m…    26 White $8000 t… Ind,ne… Prot… Sout…      12 younge…
##  2  2000 Divorced    48 White $8000 t… Not st… Prot… Bapt…      NA middle 
##  3  2000 Widowed     67 White Not app… Indepe… Prot… No d…       2 middle 
##  4  2000 Never m…    39 White Not app… Ind,ne… Orth… Not …       4 younge…
##  5  2000 Divorced    25 White Not app… Not st… None  Not …       1 younge…
##  6  2000 Married     25 White $20000 … Strong… Prot… Sout…      NA younge…
##  7  2000 Never m…    36 White $25000 … Not st… Chri… Not …       3 younge…
##  8  2000 Divorced    44 White $7000 t… Ind,ne… Prot… Luth…      NA middle 
##  9  2000 Married     44 White $25000 … Not st… Prot… Other       0 middle 
## 10  2000 Married     47 White $25000 … Strong… Prot… Sout…       3 middle 
## # ... with 21,473 more rows

The TRUE here captures everything that hasn’t met one of the earlier criteria. Here’s another example:

mtcars %>% 
    mutate(carb_new = case_when(carb == 1 ~ "one",
                                carb == 2 ~ "two",
                                carb == 4 ~ "four",
                                 TRUE ~ "other")) %>% 
  head(n = 12) # the 12th observation is "other"!
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb carb_new
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     four
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     four
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      one
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1      one
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2      two
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      one
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     four
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      two
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2      two
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     four
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     four
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3    other

A helpful function to find which variables are NA – and keep them that way – is is.na(). So if you know the variable you’re working with has missing values, the first line of your case_when() statement might be: is.na ~ NA_character_. One tricky thing with case_when() occurs when you have NA values. It turns out that there’s actually a different version of NA that you have to use with case_when() whenever you want to create an NA. In the above example, if we wanted to code all the “other” observations as missing, we would have to give them the value NA_character_ because we’re assigning the other values to be character labels. (We can later turn the whole variable into a factor variable.) If instead they were numbers, we could use NA_real_.

# ERROR!!!
mtcars %>% 
    mutate(carb_new = case_when(carb == 1 ~ "one",
                                carb == 2 ~ "two",
                                carb == 4 ~ "four",
                                 TRUE ~ NA)) %>% 
  head(n = 12)
## Error in mutate_impl(.data, dots): Evaluation error: must be type character, not logical
## Call `rlang::last_error()` to see a backtrace.
# no error
mtcars %>% 
    mutate(carb_new = case_when(carb == 1 ~ "one",
                                carb == 2 ~ "two",
                                carb == 4 ~ "four",
                                 TRUE ~ NA_character_)) %>% 
  head(n = 12)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb carb_new
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     four
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     four
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1      one
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1      one
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2      two
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      one
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     four
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      two
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2      two
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     four
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     four
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3     <NA>
# numbers instead (make a binary variable -- obviously there are other ways to do this!)
mtcars %>% 
    mutate(carb_new = case_when(carb == 1 ~ 0,
                                carb == 2 ~ 1,
                                carb == 4 ~ 1,
                                 TRUE ~ NA_real_)) %>% 
  head(n = 12)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb carb_new
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4        1
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4        1
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1        0
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1        0
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2        1
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1        0
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4        1
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2        1
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2        1
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4        1
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4        1
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3       NA