Today’s session comes from R for Data Science, Chapter 15.
Today, we’re going to cover another variable class in R: factors. We’ll review how to create and modify factors.
factor
- A type of variable in R that takes on a limited number of different values, i.e., a categorical variable.
level
- All possible ‘categories’ of a factor
.
For the past few weeks, we’ve been exploring the world of tidyverse
. We’ll continue to use tidyverse
, and specifically, some functions from forcats
, which is newly built into tidyverse
.
library(tidyverse)
We’ll use a subset of the General Social Survey because it has a number of categorical variables. It can be loaded from forcats
.
# Load the data
forcats::gss_cat
ggplot
& piping# Take a first look
gss_cat
For a quick review, take a look at past office hours sessions on ggplot
and piping.
How many observations are in the dataset?
How many variables are in the dataset?
How many of them are factor variables?
What are the levels of the factor variables?
Like many things in R, there are multiple ways to take a look at the levels of a factor
. Here are a few:
levels(gss_cat$race) # Access the levels directly
## [1] "Other" "Black" "White" "Not applicable"
gss_cat %>% count(race) # Count observations in each level
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
Note that each of these commands give different output – count
gives you the number of observations in each category, but doesn’t include levels with zero counts in its table output.
Let’s take a look at some visualizations.
ggplot(gss_cat, aes(race)) +
geom_bar()
Note that ggplot
drops levels that don’t have any values. ggplot
has an option to show all levels:
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
We’ve taken a look at the factors of a dataset, but there may be instances when you need to create the variables yourself.
You could do it this way:
x1 <- c("Dec", "Apr", "Jan", "Mar")
This creates a vector of strings (it’s not a factor
).
But there are some risks.
x_typo <- c("Dec", "Apr", "Jam", "Mar")
x_typo
## [1] "Dec" "Apr" "Jam" "Mar"
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
Above, it simply sorted the months alphabetically, though it would probably be more helpful to be in month order.
Using a factor
can address these issues. To create a factor
, create a list of the levels.
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Create a factor by calling factor()
and defining the levels
option:
f1 <- factor(x1, levels = month_levels)
f1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(f1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If you omit the levels
option, they’ll be taken from the data in alphabetical order.
factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
Values that are not set in the levels assignment will be silently converted to NA
.
f_typo <- factor(x_typo, levels = month_levels)
f_typo
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
factor
orderYou can take a look at all of the functions in the forcats
package here.
When defining levels, their order matters. Taking a look at month_levels
that we assigned above, we will see that the order in which we assigned the levels is retained.
It can be useful to change the order of factor levels for visualizations.
We’ll go over summarise
and group_by
in detail next week, but let’s use them here to quickly construct something we can use for our visualization example.
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
There’s not much of a pattern here that we can see. Let’s reorder the factor levels according to tvhours
using fct_reorder
from the forcats
package.
relig_summary %>% # Call the dataset that you want to use
# Redefine the levels of the `relig` factor variable
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
We can see from this example that fct_reorder
reorders the order of factors according to another variable.
Think before reordering – some categorical variables already have a principaled order. In the following example, it wouldn’t make sense to reorder income levels by increasing age.
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
Since the levels of the rincome
variable have a meaningful order, they should probably stay that way.
levels(rincome_summary$rincome)
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
Though we may want to reverse them using fct_rev()
, so that people making the most money are highest on the y-axis.
ggplot(rincome_summary, aes(age, fct_rev(rincome))) + geom_point()
It could also make sense to move the “Not applicable” to a spot that makes more sense. For this, we’ll use fct_relevel
. Instead of designating a second variable to define the order, fct_relevel
takes any levels that you want to move to the front of the list.
levels(fct_relevel(rincome_summary$rincome, "Not applicable"))
## [1] "Not applicable" "No answer" "Don't know" "Refused"
## [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
## [9] "$8000 to 9999" "$7000 to 7999" "$6000 to 6999" "$5000 to 5999"
## [13] "$4000 to 4999" "$3000 to 3999" "$1000 to 2999" "Lt $1000"
# (Note that this isn't saving it in that order in the dataset, just printing it)
rincome_summary %>%
mutate(rincome = fct_relevel(rincome, "Not applicable"),
rincome = fct_rev(rincome)) %>%
ggplot(aes(age, rincome)) +
geom_point()
fct_reorder2()
is another helper function to reorder factor levels by the y
values associated with the largest x
values.
Another example of reordering:
fct_infreq()
can be used to order levels in increasing frequency.
fct_rev()
reverses the order of factor levels.
We can use them in succession without having to create a new variable in between.
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()
factor
levelsThe most general tool to change the values of factor
levels is fct_recode()
.
Let’s take a quick look at partyid
. We see that the levels are inconsistent. We can change them to use parallel construction.
levels(gss_cat$partyid)
## [1] "No answer" "Don't know" "Other party"
## [4] "Strong republican" "Not str republican" "Ind,near rep"
## [7] "Independent" "Ind,near dem" "Not str democrat"
## [10] "Strong democrat"
fct_collapse()
could be another option. It provides a simple way to collapse multiple levels into fewer.
To combine groups, assign multiple old levels to the same new levels, as with “Other” below.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
## # A tibble: 8 x 2
## partyid n
## <fct> <int>
## 1 Other 548
## 2 Republican, strong 2314
## 3 Republican, weak 3032
## 4 Independent, near rep 1791
## 5 Independent 4119
## 6 Independent, near dem 2499
## 7 Democrat, weak 3690
## 8 Democrat, strong 3490
case_when()
to create new levels conditionallycase_when()
is an alternative to ifelse()
. It can be useful when you want to create new variables that relies on existing variables (i.e., with mutate
).
Let’s recode the age
variable using case_when()
.
gss_cat %>%
mutate(
age_cat = case_when(
age < 40 ~ "youngest",
age >= 40 | age <= 50 ~ "middle",
age > 50 ~ "oldest"
)
)
## # A tibble: 21,483 x 10
## year marital age race rincome partyid relig denom tvhours age_cat
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> <chr>
## 1 2000 Never m… 26 White $8000 t… Ind,ne… Prot… Sout… 12 younge…
## 2 2000 Divorced 48 White $8000 t… Not st… Prot… Bapt… NA middle
## 3 2000 Widowed 67 White Not app… Indepe… Prot… No d… 2 middle
## 4 2000 Never m… 39 White Not app… Ind,ne… Orth… Not … 4 younge…
## 5 2000 Divorced 25 White Not app… Not st… None Not … 1 younge…
## 6 2000 Married 25 White $20000 … Strong… Prot… Sout… NA younge…
## 7 2000 Never m… 36 White $25000 … Not st… Chri… Not … 3 younge…
## 8 2000 Divorced 44 White $7000 t… Ind,ne… Prot… Luth… NA middle
## 9 2000 Married 44 White $25000 … Not st… Prot… Other 0 middle
## 10 2000 Married 47 White $25000 … Strong… Prot… Sout… 3 middle
## # ... with 21,473 more rows
The TRUE
here captures everything that hasn’t met one of the earlier criteria. Here’s another example:
mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ "one",
carb == 2 ~ "two",
carb == 4 ~ "four",
TRUE ~ "other")) %>%
head(n = 12) # the 12th observation is "other"!
## mpg cyl disp hp drat wt qsec vs am gear carb carb_new
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 four
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 four
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 one
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 one
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 two
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 one
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 four
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 two
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 two
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 four
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 four
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 other
A helpful function to find which variables are NA
– and keep them that way – is is.na()
. So if you know the variable you’re working with has missing values, the first line of your case_when()
statement might be: is.na ~ NA_character_
. One tricky thing with case_when()
occurs when you have NA
values. It turns out that there’s actually a different version of NA
that you have to use with case_when()
whenever you want to create an NA
. In the above example, if we wanted to code all the “other” observations as missing, we would have to give them the value NA_character_
because we’re assigning the other values to be character labels. (We can later turn the whole variable into a factor variable.) If instead they were numbers, we could use NA_real_
.
# ERROR!!!
mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ "one",
carb == 2 ~ "two",
carb == 4 ~ "four",
TRUE ~ NA)) %>%
head(n = 12)
## Error in mutate_impl(.data, dots): Evaluation error: must be type character, not logical
## Call `rlang::last_error()` to see a backtrace.
# no error
mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ "one",
carb == 2 ~ "two",
carb == 4 ~ "four",
TRUE ~ NA_character_)) %>%
head(n = 12)
## mpg cyl disp hp drat wt qsec vs am gear carb carb_new
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 four
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 four
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 one
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 one
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 two
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 one
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 four
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 two
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 two
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 four
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 four
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 <NA>
# numbers instead (make a binary variable -- obviously there are other ways to do this!)
mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ 0,
carb == 2 ~ 1,
carb == 4 ~ 1,
TRUE ~ NA_real_)) %>%
head(n = 12)
## mpg cyl disp hp drat wt qsec vs am gear carb carb_new
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 0
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 1
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 NA