Tidying data using gather()

October 30-31, 2018

Motivations and goals

Today, we’ll be following parts from R for Data Science, Chapter 12.3.1.

Today, I’ll give you some additional tools for data cleaning. Data cleaning can often be the most time-consuming process of your data analysis. Hopefully, over the next couple of weeks, we can identify ways to make that cleaning process more consistent and efficient.

We’ll continue to use tidyverse, specifically tidyr within tidyverse. Today, we’ll review gather(). gather() helps tidy data in which a single variable is spread across multiple columns. Let’s begin by looking at some examples of when this may be the case.

Set-up

As usual, let’s load tidyverse.

library(tidyverse)

Representing data in different ways

The tables below show the same data organized in different ways. Take a moment to consider how these tables represent the data and how that relates to how they may be analyzed. In some cases, is it even possible to analyze the data when in this format?

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

What makes a dataset “tidy”?

There are generally three rules:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Figure 12.1, R for Data Science

Of the examples above, which table is tidy?

Gathering

Common problem with raw data: * Some of the column names are not names of variables, but values of a variable

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

To make this dataset ready for analysis, we need to gather these columns into a new pair of variables. What pair of new variables are we seeking to generate?

Key-Value Pairs

This explanation of key-value pairs has been adapted from Garrett Grolemund’s Data Science with R, Section 2.2.1.

What is a key value pair? A key value pair is a simple way to record information. A pair contains two parts: a key that explains what the information describes, and a value that contains the actual information.

Examples of key:value pairs:

Country: Afghanistan
Country: Brazil
Country: China
Year: 1999
Year: 2000
Year: 2001
Population:   19987071
Population:   20595360
Population:  172006362
Population:  174504898
Population: 1272915272
Population: 1280428583
Cases:    745
Cases:   2666
Cases:  37737
Cases:  80488
Cases: 212258
Cases: 213766

What are the parameters of the dataset that we need to have to generate these new variables? * The set of columns in question: 1999 and 2000 * The name of the new variable we want to generate based on the column names (i.e. key): year * The name of the new variable we want to generate based on the values spread over cells (i.e. values): cases

Let’s put these parameters together into gather().

table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Here are another few ways to do the same thing:

gather(table4a, key = "year", value = "cases", 2:3)
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
gather(table4a, key = "year", value = "cases", c(2, 3))
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
gather(table4a, key = year, value = cases, -1)
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Note that, under the hood, gather() uses select() to operate. Luckily, that’s a function with which we’re already quite familiar. Let’s do another example where we can see that gather() is using the helper functions of select() to make our lives easier.

# Glancing at the dataset we'll use for this example
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Implementing `gather()` using `contains()`, a helper function of `select()`
iris %>% 
  gather(key = flower_att, value = measurement, contains("Sepal"))
##     Petal.Length Petal.Width    Species   flower_att measurement
## 1            1.4         0.2     setosa Sepal.Length         5.1
## 2            1.4         0.2     setosa Sepal.Length         4.9
## 3            1.3         0.2     setosa Sepal.Length         4.7
## 4            1.5         0.2     setosa Sepal.Length         4.6
## 5            1.4         0.2     setosa Sepal.Length         5.0
## 6            1.7         0.4     setosa Sepal.Length         5.4
## 7            1.4         0.3     setosa Sepal.Length         4.6
## 8            1.5         0.2     setosa Sepal.Length         5.0
## 9            1.4         0.2     setosa Sepal.Length         4.4
## 10           1.5         0.1     setosa Sepal.Length         4.9
##  [ reached getOption("max.print") -- omitted 290 rows ]

Taking a look at the documentation

A critical skill of using R is being able to interpret and understand the R documentation for various functions. Let’s see how our understanding of gather() maps onto the gather() documentation. Here’s another version of the same documentation.

What does the convert option do?

What does the factor_key option do?

Using stackoverflow.com

stackoverflow.com is another resource, but can be tricky to use because you need to first sift through users’ posts to find a similar problem to yours and then understand the suggested solution(s).

Often this involves:

It’s helpful if you know what function you’re hoping to use because you can use the term in your search query.

  1. Specifying what you want to do (in various ways)
  1. Sifting through posts to find a discussion on your problem

Here is the first link that comes up with the above search: r - Gather multiple sets of columns - Stack Overflow

Here is another link that is further down in the list, but may prove to be more straightforward: How to specify multiple columns with gather() function to tidy data

Note that the first post of the thread describes a user’s problem. As a result, this code block is code that is not functioning as the user expected. Do not look at the first post for the solution, but rather compare their problem to yours.

  1. Once you’ve found a similar problem, look through community responses for a potential solution.

The most helpful answers are upvoted by other users on stackoverflow.com. The number of upvotes are denoted on the left. When the user who posted the problem accepts a solution as the best among those suggested, a green check appears.