gather()
Today, we’ll be following parts from R for Data Science, Chapter 12.3.1.
Today, I’ll give you some additional tools for data cleaning. Data cleaning can often be the most time-consuming process of your data analysis. Hopefully, over the next couple of weeks, we can identify ways to make that cleaning process more consistent and efficient.
We’ll continue to use tidyverse
, specifically tidyr
within tidyverse
. Today, we’ll review gather()
. gather()
helps tidy data in which a single variable is spread across multiple columns. Let’s begin by looking at some examples of when this may be the case.
As usual, let’s load tidyverse
.
library(tidyverse)
The tables below show the same data organized in different ways. Take a moment to consider how these tables represent the data and how that relates to how they may be analyzed. In some cases, is it even possible to analyze the data when in this format?
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
What makes a dataset “tidy”?
There are generally three rules:
Figure 12.1, R for Data Science
Of the examples above, which table is tidy?
Common problem with raw data: * Some of the column names are not names of variables, but values of a variable
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To make this dataset ready for analysis, we need to gather these columns into a new pair of variables. What pair of new variables are we seeking to generate?
This explanation of key-value pairs has been adapted from Garrett Grolemund’s Data Science with R, Section 2.2.1.
What is a key value pair? A key value pair is a simple way to record information. A pair contains two parts: a key that explains what the information describes, and a value that contains the actual information.
Examples of key:value pairs:
Country: Afghanistan
Country: Brazil
Country: China
Year: 1999
Year: 2000
Year: 2001
Population: 19987071
Population: 20595360
Population: 172006362
Population: 174504898
Population: 1272915272
Population: 1280428583
Cases: 745
Cases: 2666
Cases: 37737
Cases: 80488
Cases: 212258
Cases: 213766
What are the parameters of the dataset that we need to have to generate these new variables? * The set of columns in question: 1999
and 2000
* The name of the new variable we want to generate based on the column names (i.e. key): year
* The name of the new variable we want to generate based on the values spread over cells (i.e. values): cases
Let’s put these parameters together into gather()
.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Here are another few ways to do the same thing:
gather(table4a, key = "year", value = "cases", 2:3)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
gather(table4a, key = "year", value = "cases", c(2, 3))
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
gather(table4a, key = year, value = cases, -1)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Note that, under the hood, gather()
uses select()
to operate. Luckily, that’s a function with which we’re already quite familiar. Let’s do another example where we can see that gather()
is using the helper functions of select()
to make our lives easier.
# Glancing at the dataset we'll use for this example
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Implementing `gather()` using `contains()`, a helper function of `select()`
iris %>%
gather(key = flower_att, value = measurement, contains("Sepal"))
## Petal.Length Petal.Width Species flower_att measurement
## 1 1.4 0.2 setosa Sepal.Length 5.1
## 2 1.4 0.2 setosa Sepal.Length 4.9
## 3 1.3 0.2 setosa Sepal.Length 4.7
## 4 1.5 0.2 setosa Sepal.Length 4.6
## 5 1.4 0.2 setosa Sepal.Length 5.0
## 6 1.7 0.4 setosa Sepal.Length 5.4
## 7 1.4 0.3 setosa Sepal.Length 4.6
## 8 1.5 0.2 setosa Sepal.Length 5.0
## 9 1.4 0.2 setosa Sepal.Length 4.4
## 10 1.5 0.1 setosa Sepal.Length 4.9
## [ reached getOption("max.print") -- omitted 290 rows ]
A critical skill of using R is being able to interpret and understand the R documentation for various functions. Let’s see how our understanding of gather()
maps onto the gather()
documentation. Here’s another version of the same documentation.
What does the convert
option do?
What does the factor_key
option do?
stackoverflow.com is another resource, but can be tricky to use because you need to first sift through users’ posts to find a similar problem to yours and then understand the suggested solution(s).
Often this involves:
It’s helpful if you know what function you’re hoping to use because you can use the term in your search query.
Here is the first link that comes up with the above search: r - Gather multiple sets of columns - Stack Overflow
Here is another link that is further down in the list, but may prove to be more straightforward: How to specify multiple columns with gather() function to tidy data
Note that the first post of the thread describes a user’s problem. As a result, this code block is code that is not functioning as the user expected. Do not look at the first post for the solution, but rather compare their problem to yours.
The most helpful answers are upvoted by other users on stackoverflow.com. The number of upvotes are denoted on the left. When the user who posted the problem accepts a solution as the best among those suggested, a green check appears.