Today’s session comes from R for Data Science, Chapter 15.
Today, we’re going to cover another variable class in R: factors. We’ll review how to create and modify factors.
factor
- A type of variable in R that takes on a limited number of different values, i.e., a categorical variable.
level
- All possible ‘categories’ of a factor
.
For the past few weeks, we’ve been exploring the world of tidyverse
. We’ll continue to use tidyverse
, and specifically, some functions from forcats
, which is newly built into tidyverse
.
library(tidyverse)
We’ll use a subset of the General Social Survey because it has a number of categorical variables. It can be loaded from forcats
.
# Load the data
forcats::gss_cat
ggplot
& piping# Take a first look
gss_cat
For a quick review, take a look at past office hours sessions on ggplot
and piping.
How many observations are in the dataset?
How many variables are in the dataset?
How many of them are factor variables?
What are the levels of the factor variables?
Like many things in R, there are multiple ways to take a look at the levels of a factor
. Here are a few:
levels(gss_cat$race) # Access the levels directly
## [1] "Other" "Black" "White" "Not applicable"
gss_cat %>% count(race) # Count observations in each level
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
Note that each of these commands give different output – count
gives you the number of observations in each category, but doesn’t include levels with zero counts in its table output.
Let’s take a look at some visualizations.
ggplot(gss_cat, aes(race)) +
geom_bar()
Note that ggplot
drops levels that don’t have any values. ggplot
has an option to show all levels:
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
We’ve taken a look at the factors of a dataset, but there may be instances when you need to create the variables yourself.
You could do it this way:
x1 <- c("Dec", "Apr", "Jan", "Mar")
This creates a vector of strings (it’s not a factor
).
But there are some risks.
x_typo <- c("Dec", "Apr", "Jam", "Mar")
x_typo
## [1] "Dec" "Apr" "Jam" "Mar"
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
Above, it simply sorted the months alphabetically, though it would probably be more helpful to be in month order.
Using a factor
can address these issues. To create a factor
, create a list of the levels.
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Create a factor by calling factor()
and defining the levels
option:
f1 <- factor(x1, levels = month_levels)
f1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(f1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If you omit the levels
option, they’ll be taken from the data in alphabetical order.
factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
Values that are not set in the levels assignment will be silently converted to NA
.
f_typo <- factor(x_typo, levels = month_levels)
f_typo
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
factor
orderYou can take a look at all of the functions in the forcats
package here.
When defining levels, their order matters. Taking a look at month_levels
that we assigned above, we will see that the order in which we assigned the levels is retained.
It can be useful to change the order of factor levels for visualizations.
We’ll go over summarise
and group_by
in detail next week, but let’s use them here to quickly construct something we can use for our visualization example.
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
There’s not much of a pattern here that we can see. Let’s reorder the factor levels according to tvhours
using fct_reorder
from the forcats
package.
relig_summary %>% # Call the dataset that you want to use
# Redefine the levels of the `relig` factor variable
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
We can see from this example that fct_reorder
reorders the order of factors according to another variable.
Think before reordering – some categorical variables already have a principaled order. In the following example, it wouldn’t make sense to reorder income levels by increasing age.
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
Since the levels of the rincome
variable have a meaningful order, they should probably stay that way.
levels(rincome_summary$rincome)
## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
Though we may want to reverse them using fct_rev()
, so that people making the most money are highest on the y-axis.
ggplot(rincome_summary, aes(age, fct_rev(rincome))) + geom_point()
It could also make sense to move the “Not applicable” to a spot that makes more sense. For this, we’ll use fct_relevel
. Instead of designating a second variable to define the order, fct_relevel
takes any levels that you want to move to the front of the list.
levels(fct_relevel(rincome_summary$rincome, "Not applicable"))
## [1] "Not applicable" "No answer" "Don't know" "Refused"
## [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
## [9] "$8000 to 9999" "$7000 to 7999" "$6000 to 6999" "$5000 to 5999"
## [13] "$4000 to 4999" "$3000 to 3999" "$1000 to 2999" "Lt $1000"
# (Note that this isn't saving it in that order in the dataset, just printing it)
rincome_summary %>%
mutate(rincome = fct_relevel(rincome, "Not applicable"),
rincome = fct_rev(rincome)) %>%
ggplot(aes(age, rincome)) +
geom_point()