Separating and uniting variables in your dataset

November 20, 2018

Motivation and goals

Today’s session follows R for Data Science, Chapter 12.4.

Before a dataset is cleaned, it’s possible that multiple variables are stored in a single variable. For example, in table3 below, we see that the rate variable is a fraction that includes both the number of cases and population for each observation.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

In most cases, we don’t want to analyze data in this format. We’d more likely either want a single value for rate or the number of cases and population count stored in separate variables. We’ll cover how to do this easily using separate() in the tidyverse library.

The complement of separate() is unite(). As its name suggests, unite() can be used to create a single variable of elements spread across multiple columns.

Set-up

Let’s load tidyverse.

library(tidyverse)

Separating single variables into multiple variables using separate()

separate() pulls apart one column into multiple columns by splitting wherever a separator character appears.

table3 %>% 
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Figure 12.4, R for Data Science

By default, separate() will split values wherever it sees a non-alphanumeric character. However, you can also specify which character to be used to separate a column.

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

What type of variables are our new variables, cases and population?

How could we cast the cases and population as the appropriate variable types?

Hint: separate() has options similar to those we covered in gather() and spread()

Here’s the documentation.

We could do this by hand, but tidyr provides us with a nice option to do this as we’re separating the variable.

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
## * <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can also pass vectors of integers to sep to identify positions at which to split a variable.

table3 %>%
  separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Uniting multiple variables into single variables using unite()

To demonstrate unite(), let’s recombine the year variable that we just separated. (This is also stored as table5 in tidyr.) Figure 12.5, R for Data Science

table5 %>% 
  unite(new, century, year)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

Does this give us what we want?

The default sep option will place an _ between the values from different columns.

table5 %>% 
  unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Some additional exercises

  1. What do the extra and fill arguments do in separate()? You can experiment with the toy datasets below.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
  1. What does the remove option do?