# Separating and uniting variables in your dataset

## Motivation and goals

Todayâ€™s session follows R for Data Science, Chapter 12.4.

Before a dataset is cleaned, itâ€™s possible that multiple variables are stored in a single variable. For example, in table3 below, we see that the rate variable is a fraction that includes both the number of cases and population for each observation.

table3
## # A tibble: 6 x 3
##   country      year rate
## * <chr>       <int> <chr>
## 1 Afghanistan  1999 745/19987071
## 2 Afghanistan  2000 2666/20595360
## 3 Brazil       1999 37737/172006362
## 4 Brazil       2000 80488/174504898
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

In most cases, we donâ€™t want to analyze data in this format. Weâ€™d more likely either want a single value for rate or the number of cases and population count stored in separate variables. Weâ€™ll cover how to do this easily using separate() in the tidyverse library.

The complement of separate() is unite(). As its name suggests, unite() can be used to create a single variable of elements spread across multiple columns.

## Set-up

Letâ€™s load tidyverse.

library(tidyverse)

## Separating single variables into multiple variables using separate()

separate() pulls apart one column into multiple columns by splitting wherever a separator character appears.

table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>
## 1 Afghanistan  1999 745    19987071
## 2 Afghanistan  2000 2666   20595360
## 3 Brazil       1999 37737  172006362
## 4 Brazil       2000 80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

By default, separate() will split values wherever it sees a non-alphanumeric character. However, you can also specify which character to be used to separate a column.

table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")

What type of variables are our new variables, cases and population?

How could we cast the cases and population as the appropriate variable types?

Hint: separate() has options similar to those we covered in gather() and spread()

Hereâ€™s the documentation.

We could do this by hand, but tidyr provides us with a nice option to do this as weâ€™re separating the variable.

table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
## * <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can also pass vectors of integers to sep to identify positions at which to split a variable.

table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
##   country     century year  rate
## * <chr>       <chr>   <chr> <chr>
## 1 Afghanistan 19      99    745/19987071
## 2 Afghanistan 20      00    2666/20595360
## 3 Brazil      19      99    37737/172006362
## 4 Brazil      20      00    80488/174504898
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

## Uniting multiple variables into single variables using unite()

To demonstrate unite(), letâ€™s recombine the year variable that we just separated. (This is also stored as table5 in tidyr.) Figure 12.5, R for Data Science

table5 %>%
unite(new, century, year)
## # A tibble: 6 x 3
##   country     new   rate
##   <chr>       <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil      19_99 37737/172006362
## 4 Brazil      20_00 80488/174504898
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

Does this give us what we want?

The default sep option will place an _ between the values from different columns.

table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate
##   <chr>       <chr> <chr>
## 1 Afghanistan 1999  745/19987071
## 2 Afghanistan 2000  2666/20595360
## 3 Brazil      1999  37737/172006362
## 4 Brazil      2000  80488/174504898
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

1. What do the extra and fill arguments do in separate()? You can experiment with the toy datasets below.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
1. What does the remove option do?