gather()
and spread()
Last week you saw how to use the gather()
function to make your data “tidy”. Often you’ll get a dataset that’s in “wide” format and have to turn it into “long” format to do your analysis, and other times you’ll have to go the other way around. This is particularly the case when working with longitudinal datasets.
For example, a wide dataset may look like this. We could imagine this is from a trial where participants were or were not given a drug at two timepoints (x_1
and x_2
), and their responses were also measured at two timepoints (y_1
and y_2
). The dataset contains an id
variable as well as the participants’ age at baseline age_bl
. I completely made up this data. Don’t try to read anything into it.
id | age_bl | x_1 | x_2 | y_1 | y_2 |
---|---|---|---|---|---|
1 | 30 | 1 | 1 | 34 | 95 |
2 | 43 | 0 | 1 | 93 | 28 |
3 | 39 | 0 | 0 | 48 | 20 |
4 | 52 | 1 | 0 | 38 | 40 |
However, we might want it in long format for our analyses:
id | age_bl | timepoint | x | y |
---|---|---|---|---|
1 | 30 | 1 | 1 | 34 |
1 | 30 | 2 | 1 | 95 |
2 | 43 | 1 | 0 | 93 |
2 | 43 | 2 | 1 | 28 |
3 | 39 | 1 | 0 | 48 |
3 | 39 | 2 | 0 | 20 |
4 | 52 | 1 | 1 | 38 |
4 | 52 | 2 | 0 | 40 |
Or we might be given our data in long format and want to get it into wide format. Different types of analysis require different types of data structures, and it is very helpful to be able to go back and forth.
Today we’ll add to our knowledge about the gather()
function, as well as learn spread()
and separate()
to help us go back and forth as needed.
gather()
First let’s practice the gather()
function to go from wide to long format data. First, as always, load the tidyverse
, then create part of the mini dataset from above.
library(tidyverse)
dat <- tribble(
~id, ~age_bl, ~x_1, ~x_2,
1, 30, 1, 1,
2, 43, 0, 1,
3, 39, 0, 0,
4, 52, 1, 0,
)
You may want to get rid of the x_
from the values of the x
variable. One way to do that is to take your long dataset and use %>% mutate(timepoint = factor(timepoint, labels = c(“1”, “2”)))
. There are lots of other ways too, and we’ll see another below! Now your goal is to get that data to look like the output below. Right now we have two columns for x
; in a tidy/long dataset, we’d just want one. We’ll want to name that new variable x
(it will have values of either 0 or 1), and then create a new variable timepoint
, which will take on values of either x_1
or x_2
. Look back at last week’s lesson or the function documentation if you need help.
dat %>%
gather(key = "timepoint", value = "x", x_1:x_2)
## # A tibble: 8 x 4
## id age_bl timepoint x
## <dbl> <dbl> <chr> <dbl>
## 1 1 30 x_1 1
## 2 2 43 x_1 0
## 3 3 39 x_1 0
## 4 4 52 x_1 1
## 5 1 30 x_2 1
## 6 2 43 x_2 1
## 7 3 39 x_2 0
## 8 4 52 x_2 0
spread()
Now we’ll look at spread()
, which is basically the inverse of gather()
. I found the following animation to be super helpful – you may want to watch it a few times. It was created in R using the gganimate
package. This awesome animation and more are available on GitHub via gadenbuie
.
It may also be helpful to just look at the following still image to see what happens when we go back and forth using gather()
and spread()
:
Using the fake dataset we just made long, let’s go back to wide format using spread()
.
longdat <- dat %>%
gather(key = "timepoint", value = "x", x_1:x_2)
longdat %>%
spread(key = "timepoint", value = "x")
## # A tibble: 4 x 4
## id age_bl x_1 x_2
## <dbl> <dbl> <dbl> <dbl>
## 1 1 30 1 1
## 2 2 43 0 1
## 3 3 39 0 0
## 4 4 52 1 0
We’re back where we started. Each value of the “key” variable was made its own column, with the values in that column taken from the “value” variable.
From the documentation: This dataset describes a randomized clinical trial (Goldman et al., 1996) in which both survival and longitudinal data were collected to compare the efficacy and safety of two antiretroviral drugs, namely ddI (didanosine) and ddC (zalcitabine), in treating HIV-infected patients intolerant or failing zidovudine (AZT) therapy. Now let’s try this with some more complex data. First we’ll look at data from a clinical trial for drugs to treat AIDS from the 1990s. The data is from the joineR
package, so you’ll have to install and load it before you can access the data with the data(aids)
command. Take a second to explore it.
# install.packages("joineR") # you'll probably need to run this line first
library(joineR)
data(aids)
head(aids)
## id time death CD4 obstime drug gender prevOI AZT
## 1 1 16.97 0 10.677078 0 ddC male AIDS intolerance
## 2 1 16.97 0 8.426150 6 ddC male AIDS intolerance
## 3 1 16.97 0 9.433981 12 ddC male AIDS intolerance
## 4 2 19.00 0 6.324555 0 ddI male noAIDS intolerance
## 5 2 19.00 0 8.124038 6 ddI male noAIDS intolerance
## 6 2 19.00 0 4.582576 12 ddI male noAIDS intolerance
“Survival” data means that we’re measuring time to a certain event – a binary variable (e.g., death). Ideally that time is continuous, although it’s often measured discretely (e.g., in weeks or months). “Longitudinal” data means that we’re measuring the same thing over and over again over time. What we’re measuring could be any type of variable – here it’s continuous (log) CD4 count. Patients’ CD4 count was measured at every visit, which took place at obstime
(which I think is in weeks, although it doesn’t say). They were assigned one of two drugs, and other baseline variables such as gender and reason for not being on the AZT drug are also included. The variable time
refers to time of death or censoring (you can tell which by seeing whether the value of the variable death
= 0 or 1).
Each patient has as many observations (i.e., rows) as visits. If you run table(aids$obstime)
, you’ll see that there were 467 patients at time 0, but only 34 made it to time 18, whether due to death or dropout.
We may want data in this format for certain types of analysis, or for making plots with “facets” (see previous office hours).
ggplot(aids) +
geom_histogram(aes(CD4, stat(density))) +
facet_grid(~obstime) +
ggtitle("CD4 count across study visits")
When doing a straightforward survival analysis like the one in your homework this week, you would use wide data, as you only need 1 row per observation, with the time to event and an indicator of event or censoring (and any baseline covariates). But if you have covariates that change over time, you’ll need that data in “long” format, with rows for a given person at each time the covariates change. But in other situations, we may want the data in wide format. For example, let’s say we want to fit a model predicting a person’s CD4 count at time 12 using their CD4 counts from times 0, 2, and 6 as predictors. We’ll need a column for each of those variables
Try using the spread()
function to make variables for each person’s CD4 count at each of the five timepoints (note that some patients will having missing data at some of the timepoints). Think carefully about what you want your “key” to be, and what you want your “value” to be. You may also want to include the argument sep = "_"
in the spread()
function to avoid having numbers as variable names (try it with and without to see what that argument does).
wide_data <- aids %>%
spread(key = "obstime", value = "CD4", sep = "_")
head(wide_data)
## id time death drug gender prevOI AZT obstime_0 obstime_2
## 1 1 16.97 0 ddC male AIDS intolerance 10.677078 NA
## 2 2 19.00 0 ddI male noAIDS intolerance 6.324555 NA
## 3 3 18.53 1 ddI female AIDS intolerance 3.464102 3.605551
## 4 4 12.70 0 ddC male AIDS failure 3.872983 4.582576
## 5 5 15.13 0 ddI male AIDS failure 7.280110 8.602325
## 6 6 1.90 1 ddC female AIDS failure 4.582576 NA
## obstime_6 obstime_12 obstime_18
## 1 8.426150 9.433981 NA
## 2 8.124038 4.582576 5
## 3 6.164414 NA NA
## 4 2.645751 1.732051 NA
## 5 8.602325 6.708204 NA
## 6 NA NA NA
Now if we want to run a regression using previous CD4 counts as predictors, we can.
lm(obstime_12 ~ obstime_0 + obstime_2 + obstime_6, data = wide_data)
##
## Call:
## lm(formula = obstime_12 ~ obstime_0 + obstime_2 + obstime_6,
## data = wide_data)
##
## Coefficients:
## (Intercept) obstime_0 obstime_2 obstime_6
## -0.30295 0.06415 0.33850 0.47824
As we would expect, CD4 count at previous times is positively correlated with CD4 count at time 12, but to less of an extent as we look further back in time.
Now use the wide_data
dataset and try to get back to the original dataset using gather()
. Some of the arguments will be the same as those you used in spread()
. What else do you need to include when using gather()
?
long_data <- wide_data %>%
gather(key = "obstime", value = "CD4", obstime_0:obstime_18)
head(long_data)
## id time death drug gender prevOI AZT obstime CD4
## 1 1 16.97 0 ddC male AIDS intolerance obstime_0 10.677078
## 2 2 19.00 0 ddI male noAIDS intolerance obstime_0 6.324555
## 3 3 18.53 1 ddI female AIDS intolerance obstime_0 3.464102
## 4 4 12.70 0 ddC male AIDS failure obstime_0 3.872983
## 5 5 15.13 0 ddI male AIDS failure obstime_0 7.280110
## 6 6 1.90 1 ddC female AIDS failure obstime_0 4.582576
Now try to make the data look exactly the same. That is, instead of the values obstime_0
, obstime_2
, etc., we want 0
, 2
, etc. (Also, can you get it back in the order it was in originally – that is, ordered by id
, not by obstime
?)
long_data %>%
mutate(obstime = factor(obstime, labels = c(0, 2, 6, 12, 18)),
obstime = as.numeric(as.character(obstime))) %>%
arrange(id) %>%
head()
## id time death drug gender prevOI AZT obstime CD4
## 1 1 16.97 0 ddC male AIDS intolerance 0 10.677078
## 2 1 16.97 0 ddC male AIDS intolerance 12 NA
## 3 1 16.97 0 ddC male AIDS intolerance 18 8.426150
## 4 1 16.97 0 ddC male AIDS intolerance 2 9.433981
## 5 1 16.97 0 ddC male AIDS intolerance 6 NA
## 6 2 19.00 0 ddI male noAIDS intolerance 0 6.324555
My solution is a little convoluted because I wanted the obstime
variable to be an actual number, not a factor. There are slightly easier ways to do this, but let’s look at a new function that will help us with rearranging data: separate()
. It takes a column that contains two values and separates them into two columns. Last week you saw this example of a non-tidy dataset:
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
We could use separate()
like this:
table3 %>%
separate(col = rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
The function took the single column rate
and split it at the /
into two new columns, cases
and population
.
We could do the same thing with the new long version of the aids
dataset and the ugly obstime
variable we created:
# what we used to have (reordered by id to make it clearer what we're doing)
long_data %>% arrange(id) %>% head()
## id time death drug gender prevOI AZT obstime CD4
## 1 1 16.97 0 ddC male AIDS intolerance obstime_0 10.677078
## 2 1 16.97 0 ddC male AIDS intolerance obstime_2 NA
## 3 1 16.97 0 ddC male AIDS intolerance obstime_6 8.426150
## 4 1 16.97 0 ddC male AIDS intolerance obstime_12 9.433981
## 5 1 16.97 0 ddC male AIDS intolerance obstime_18 NA
## 6 2 19.00 0 ddI male noAIDS intolerance obstime_0 6.324555
# what we have after separating
long_data %>%
arrange(id) %>%
separate(col = obstime, into = c("remove", "new_obstime"), sep = "_") %>%
head()
## id time death drug gender prevOI AZT remove new_obstime
## 1 1 16.97 0 ddC male AIDS intolerance obstime 0
## 2 1 16.97 0 ddC male AIDS intolerance obstime 2
## 3 1 16.97 0 ddC male AIDS intolerance obstime 6
## 4 1 16.97 0 ddC male AIDS intolerance obstime 12
## 5 1 16.97 0 ddC male AIDS intolerance obstime 18
## 6 2 19.00 0 ddI male noAIDS intolerance obstime 0
## CD4
## 1 10.677078
## 2 NA
## 3 8.426150
## 4 9.433981
## 5 NA
## 6 6.324555
I called the new variables remove
and new_obstime
because I’d get rid of the first one (how would you do that?) because it doesn’t give us any information, and because I wanted to make it clear that the new variable could have a new name (in real life I’d just call it obstime
again).
This new function, separate()
, is going to help us with our final goal. We want to make the original fake dataset long. But this differs from our other examples because we have two variables we need to deal with, x
and y
. (In the AIDS trial, for example, the patients got the same drug at all time points and so the only variable that differed across time was the CD4 count.)
Let’s make the data:
dat <- tribble(
~id, ~age_bl, ~x_1, ~x_2, ~y_1, ~y_2,
1, 30, 1, 1, 34, 95,
2, 43, 0, 1, 93, 28,
3, 39, 0, 0, 48, 20,
4, 52, 1, 0, 38, 40
)
What will happen if you just use gather()
as we’ve learned so far? (Try to guess before you run it yourself or look at the output.)
dat %>%
gather(key = "variable", value = "value", x_1:y_2) %>%
arrange(id)
## # A tibble: 16 x 4
## id age_bl variable value
## <dbl> <dbl> <chr> <dbl>
## 1 1 30 x_1 1
## 2 1 30 x_2 1
## 3 1 30 y_1 34
## 4 1 30 y_2 95
## 5 2 43 x_1 0
## 6 2 43 x_2 1
## 7 2 43 y_1 93
## 8 2 43 y_2 28
## 9 3 39 x_1 0
## 10 3 39 x_2 0
## 11 3 39 y_1 48
## 12 3 39 y_2 20
## 13 4 52 x_1 1
## 14 4 52 x_2 0
## 15 4 52 y_1 38
## 16 4 52 y_2 40
So that’s not going to work. Our goal, as seen in the beginning, is to get an x
column and a y
column. But here we just have one column with values of both the variables mixed up. We’re going to solve this by using separate()
to split up the variable name (x
or y
) and the timepoint (1
or 2
). (I keep using arrange()
just to make what we’re doing clearer, but it’s definitely not necessary.)
dat %>%
gather(key = "variable", value = "value", x_1:y_2) %>%
separate(col = variable, into = c("variable", "timepoint")) %>%
arrange(id)
## # A tibble: 16 x 5
## id age_bl variable timepoint value
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 30 x 1 1
## 2 1 30 x 2 1
## 3 1 30 y 1 34
## 4 1 30 y 2 95
## 5 2 43 x 1 0
## 6 2 43 x 2 1
## 7 2 43 y 1 93
## 8 2 43 y 2 28
## 9 3 39 x 1 0
## 10 3 39 x 2 0
## 11 3 39 y 1 48
## 12 3 39 y 2 20
## 13 4 52 x 1 1
## 14 4 52 x 2 0
## 15 4 52 y 1 38
## 16 4 52 y 2 40
OK, now we have the timepoint
variable we want. But the dataset is still “too long”. We don’t want to go back to the initial wide dataset, but we need to make it a little “wider”. We’ll use spread()
.
newdat <- dat %>%
gather(key = "variable", value = "value", x_1:y_2) %>%
separate(col = variable, into = c("variable", "timepoint")) %>%
spread(key = "variable", value = "value") %>%
arrange(id)
head(newdat)
## # A tibble: 6 x 5
## id age_bl timepoint x y
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 30 1 1 34
## 2 1 30 2 1 95
## 3 2 43 1 0 93
## 4 2 43 2 1 28
## 5 3 39 1 0 48
## 6 3 39 2 0 20
This isn’t necessarily a good way to display the data; I’m just showing you what’s possible! Perfect! Now we can do things like graph our data to compare across the two time points:
ggplot(newdat) +
geom_boxplot(aes(factor(x), y)) +
facet_grid(~timepoint)
ggplot(newdat, aes(timepoint, y)) +
geom_point(aes(col = factor(x))) +
geom_line(aes(group = id))
If you are confused by how these functions work, don’t worry, so is everyone else. That’s why the animation came to be, for one. I also had to look on StackOverflow several times while writing this up! Every time I use these functions, I have to remind myself how they work. BUT this is one of the most useful workflows you’ll come across, in my opinion. Almost every time I graph something I use gather()
because I want to facet it or color it in some way. So just practice on your own data until you get it straight, and until then, there are plenty of R users around the world with whom you can commiserate!
Me literally every time is use gather() and spread() #tidyverse #rstats pic.twitter.com/EwxR0o3VSO
— Grad School Imposter (@darinself) February 9, 2018
tidyr is fantastic, but I'd be lying if I didn't have to do ?gather and ?spread at least once a week #rstats
— Brock Tibert (@BrockTibert) June 20, 2018
Me guessing if gather() or spread() is the right choice. Literally every time I need it. #rstats #tidyverse pic.twitter.com/xWQmNhrv14
— Cédric Scherer 🧐 (@CedScherer) October 8, 2018
honest #rstats: I try seemingly random combinations of terms until gather() or spread() behave the way I want.
— Peter Smits (@PeterDSmits) June 20, 2018
I don't wanna brag or nuthin but I just successfully used gather() from tidyr without having to google anything for the first time ever. #rstats
— Andrew Kniss 🌱 (@WyoWeeds) March 9, 2018