Today’s session follows R for Data Science, Chapter 12.4.
Before a dataset is cleaned, it’s possible that multiple variables are stored in a single variable. For example, in table3
below, we see that the rate
variable is a fraction that includes both the number of cases and population for each observation.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
In most cases, we don’t want to analyze data in this format. We’d more likely either want a single value for rate or the number of cases and population count stored in separate variables. We’ll cover how to do this easily using separate()
in the tidyverse
library.
The complement of separate()
is unite()
. As its name suggests, unite()
can be used to create a single variable of elements spread across multiple columns.
Let’s load tidyverse
.
library(tidyverse)
separate()
separate()
pulls apart one column into multiple columns by splitting wherever a separator character appears.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Figure 12.4, R for Data Science
By default, separate()
will split values wherever it sees a non-alphanumeric character. However, you can also specify which character to be used to separate a column.
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
What type of variables are our new variables, cases
and population
?
How could we cast the cases
and population
as the appropriate variable types?
Hint: separate()
has options similar to those we covered in gather()
and spread()
Here’s the documentation.
We could do this by hand, but tidyr
provides us with a nice option to do this as we’re separating the variable.
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass vectors of integers to sep
to identify positions at which to split a variable.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite()
To demonstrate unite()
, let’s recombine the year
variable that we just separated. (This is also stored as table5
in tidyr
.) Figure 12.5, R for Data Science
table5 %>%
unite(new, century, year)
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
Does this give us what we want?
The default sep
option will place an _
between the values from different columns.
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
extra
and fill
arguments do in separate()
? You can experiment with the toy datasets below.tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
remove
option do?