Long to wide and back again with gather() and spread()

November 6-7, 2018

Motivations and goals

Last week you saw how to use the gather() function to make your data “tidy”. Often you’ll get a dataset that’s in “wide” format and have to turn it into “long” format to do your analysis, and other times you’ll have to go the other way around. This is particularly the case when working with longitudinal datasets.

For example, a wide dataset may look like this. We could imagine this is from a trial where participants were or were not given a drug at two timepoints (x_1 and x_2), and their responses were also measured at two timepoints (y_1 and y_2). The dataset contains an id variable as well as the participants’ age at baseline age_bl. I completely made up this data. Don’t try to read anything into it.

id age_bl x_1 x_2 y_1 y_2
1 30 1 1 34 95
2 43 0 1 93 28
3 39 0 0 48 20
4 52 1 0 38 40

However, we might want it in long format for our analyses:

id age_bl timepoint x y
1 30 1 1 34
1 30 2 1 95
2 43 1 0 93
2 43 2 1 28
3 39 1 0 48
3 39 2 0 20
4 52 1 1 38
4 52 2 0 40

Or we might be given our data in long format and want to get it into wide format. Different types of analysis require different types of data structures, and it is very helpful to be able to go back and forth.

Today we’ll add to our knowledge about the gather() function, as well as learn spread() and separate() to help us go back and forth as needed.

Review: gather()

First let’s practice the gather() function to go from wide to long format data. First, as always, load the tidyverse, then create part of the mini dataset from above.

library(tidyverse)

dat <- tribble(
  ~id, ~age_bl, ~x_1, ~x_2, 
    1,      30,    1,    1, 
    2,      43,    0,    1, 
    3,      39,    0,    0, 
    4,      52,    1,    0, 
  )

You may want to get rid of the x_ from the values of the x variable. One way to do that is to take your long dataset and use %>% mutate(timepoint = factor(timepoint, labels = c(“1”, “2”))). There are lots of other ways too, and we’ll see another below! Now your goal is to get that data to look like the output below. Right now we have two columns for x; in a tidy/long dataset, we’d just want one. We’ll want to name that new variable x (it will have values of either 0 or 1), and then create a new variable timepoint, which will take on values of either x_1 or x_2. Look back at last week’s lesson or the function documentation if you need help.

dat %>% 
  gather(key = "timepoint", value = "x", x_1:x_2)
## # A tibble: 8 x 4
##      id age_bl timepoint     x
##   <dbl>  <dbl> <chr>     <dbl>
## 1     1     30 x_1           1
## 2     2     43 x_1           0
## 3     3     39 x_1           0
## 4     4     52 x_1           1
## 5     1     30 x_2           1
## 6     2     43 x_2           1
## 7     3     39 x_2           0
## 8     4     52 x_2           0

Introducing spread()

Now we’ll look at spread(), which is basically the inverse of gather(). I found the following animation to be super helpful – you may want to watch it a few times. It was created in R using the gganimate package. This awesome animation and more are available on GitHub via gadenbuie.

It may also be helpful to just look at the following still image to see what happens when we go back and forth using gather() and spread():

Using the fake dataset we just made long, let’s go back to wide format using spread().

longdat <- dat %>% 
  gather(key = "timepoint", value = "x", x_1:x_2)

longdat %>% 
  spread(key = "timepoint", value = "x")
## # A tibble: 4 x 4
##      id age_bl   x_1   x_2
##   <dbl>  <dbl> <dbl> <dbl>
## 1     1     30     1     1
## 2     2     43     0     1
## 3     3     39     0     0
## 4     4     52     1     0

We’re back where we started. Each value of the “key” variable was made its own column, with the values in that column taken from the “value” variable.

Data

From the documentation: This dataset describes a randomized clinical trial (Goldman et al., 1996) in which both survival and longitudinal data were collected to compare the efficacy and safety of two antiretroviral drugs, namely ddI (didanosine) and ddC (zalcitabine), in treating HIV-infected patients intolerant or failing zidovudine (AZT) therapy. Now let’s try this with some more complex data. First we’ll look at data from a clinical trial for drugs to treat AIDS from the 1990s. The data is from the joineR package, so you’ll have to install and load it before you can access the data with the data(aids) command. Take a second to explore it.

# install.packages("joineR")  # you'll probably need to run this line first
library(joineR)
data(aids)
head(aids)
##   id  time death       CD4 obstime drug gender prevOI         AZT
## 1  1 16.97     0 10.677078       0  ddC   male   AIDS intolerance
## 2  1 16.97     0  8.426150       6  ddC   male   AIDS intolerance
## 3  1 16.97     0  9.433981      12  ddC   male   AIDS intolerance
## 4  2 19.00     0  6.324555       0  ddI   male noAIDS intolerance
## 5  2 19.00     0  8.124038       6  ddI   male noAIDS intolerance
## 6  2 19.00     0  4.582576      12  ddI   male noAIDS intolerance

“Survival” data means that we’re measuring time to a certain event – a binary variable (e.g., death). Ideally that time is continuous, although it’s often measured discretely (e.g., in weeks or months). “Longitudinal” data means that we’re measuring the same thing over and over again over time. What we’re measuring could be any type of variable – here it’s continuous (log) CD4 count. Patients’ CD4 count was measured at every visit, which took place at obstime (which I think is in weeks, although it doesn’t say). They were assigned one of two drugs, and other baseline variables such as gender and reason for not being on the AZT drug are also included. The variable time refers to time of death or censoring (you can tell which by seeing whether the value of the variable death = 0 or 1).

Each patient has as many observations (i.e., rows) as visits. If you run table(aids$obstime), you’ll see that there were 467 patients at time 0, but only 34 made it to time 18, whether due to death or dropout.

We may want data in this format for certain types of analysis, or for making plots with “facets” (see previous office hours).

ggplot(aids) + 
  geom_histogram(aes(CD4, stat(density))) +
  facet_grid(~obstime) +
  ggtitle("CD4 count across study visits")

When doing a straightforward survival analysis like the one in your homework this week, you would use wide data, as you only need 1 row per observation, with the time to event and an indicator of event or censoring (and any baseline covariates). But if you have covariates that change over time, you’ll need that data in “long” format, with rows for a given person at each time the covariates change. But in other situations, we may want the data in wide format. For example, let’s say we want to fit a model predicting a person’s CD4 count at time 12 using their CD4 counts from times 0, 2, and 6 as predictors. We’ll need a column for each of those variables

Try using the spread() function to make variables for each person’s CD4 count at each of the five timepoints (note that some patients will having missing data at some of the timepoints). Think carefully about what you want your “key” to be, and what you want your “value” to be. You may also want to include the argument sep = "_" in the spread() function to avoid having numbers as variable names (try it with and without to see what that argument does).

wide_data <- aids %>% 
  spread(key = "obstime", value = "CD4", sep = "_")
head(wide_data)
##   id  time death drug gender prevOI         AZT obstime_0 obstime_2
## 1  1 16.97     0  ddC   male   AIDS intolerance 10.677078        NA
## 2  2 19.00     0  ddI   male noAIDS intolerance  6.324555        NA
## 3  3 18.53     1  ddI female   AIDS intolerance  3.464102  3.605551
## 4  4 12.70     0  ddC   male   AIDS     failure  3.872983  4.582576
## 5  5 15.13     0  ddI   male   AIDS     failure  7.280110  8.602325
## 6  6  1.90     1  ddC female   AIDS     failure  4.582576        NA
##   obstime_6 obstime_12 obstime_18
## 1  8.426150   9.433981         NA
## 2  8.124038   4.582576          5
## 3  6.164414         NA         NA
## 4  2.645751   1.732051         NA
## 5  8.602325   6.708204         NA
## 6        NA         NA         NA

Now if we want to run a regression using previous CD4 counts as predictors, we can.

lm(obstime_12 ~ obstime_0 + obstime_2 + obstime_6, data = wide_data)
## 
## Call:
## lm(formula = obstime_12 ~ obstime_0 + obstime_2 + obstime_6, 
##     data = wide_data)
## 
## Coefficients:
## (Intercept)    obstime_0    obstime_2    obstime_6  
##    -0.30295      0.06415      0.33850      0.47824

As we would expect, CD4 count at previous times is positively correlated with CD4 count at time 12, but to less of an extent as we look further back in time.

Now use the wide_data dataset and try to get back to the original dataset using gather(). Some of the arguments will be the same as those you used in spread(). What else do you need to include when using gather()?

long_data <- wide_data %>% 
  gather(key = "obstime", value = "CD4", obstime_0:obstime_18)
head(long_data)
##   id  time death drug gender prevOI         AZT   obstime       CD4
## 1  1 16.97     0  ddC   male   AIDS intolerance obstime_0 10.677078
## 2  2 19.00     0  ddI   male noAIDS intolerance obstime_0  6.324555
## 3  3 18.53     1  ddI female   AIDS intolerance obstime_0  3.464102
## 4  4 12.70     0  ddC   male   AIDS     failure obstime_0  3.872983
## 5  5 15.13     0  ddI   male   AIDS     failure obstime_0  7.280110
## 6  6  1.90     1  ddC female   AIDS     failure obstime_0  4.582576

Now try to make the data look exactly the same. That is, instead of the values obstime_0, obstime_2, etc., we want 0, 2, etc. (Also, can you get it back in the order it was in originally – that is, ordered by id, not by obstime?)

long_data %>%
  mutate(obstime = factor(obstime, labels = c(0, 2, 6, 12, 18)),
         obstime = as.numeric(as.character(obstime))) %>%
  arrange(id) %>%
  head()
##   id  time death drug gender prevOI         AZT obstime       CD4
## 1  1 16.97     0  ddC   male   AIDS intolerance       0 10.677078
## 2  1 16.97     0  ddC   male   AIDS intolerance      12        NA
## 3  1 16.97     0  ddC   male   AIDS intolerance      18  8.426150
## 4  1 16.97     0  ddC   male   AIDS intolerance       2  9.433981
## 5  1 16.97     0  ddC   male   AIDS intolerance       6        NA
## 6  2 19.00     0  ddI   male noAIDS intolerance       0  6.324555

My solution is a little convoluted because I wanted the obstime variable to be an actual number, not a factor. There are slightly easier ways to do this, but let’s look at a new function that will help us with rearranging data: separate(). It takes a column that contains two values and separates them into two columns. Last week you saw this example of a non-tidy dataset:

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

We could use separate() like this:

table3 %>% 
  separate(col = rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

The function took the single column rate and split it at the / into two new columns, cases and population.

We could do the same thing with the new long version of the aids dataset and the ugly obstime variable we created:

# what we used to have (reordered by id to make it clearer what we're doing)
long_data %>% arrange(id) %>% head()
##   id  time death drug gender prevOI         AZT    obstime       CD4
## 1  1 16.97     0  ddC   male   AIDS intolerance  obstime_0 10.677078
## 2  1 16.97     0  ddC   male   AIDS intolerance  obstime_2        NA
## 3  1 16.97     0  ddC   male   AIDS intolerance  obstime_6  8.426150
## 4  1 16.97     0  ddC   male   AIDS intolerance obstime_12  9.433981
## 5  1 16.97     0  ddC   male   AIDS intolerance obstime_18        NA
## 6  2 19.00     0  ddI   male noAIDS intolerance  obstime_0  6.324555
# what we have after separating
long_data %>% 
  arrange(id) %>%
  separate(col = obstime, into = c("remove", "new_obstime"), sep = "_") %>%
  head()
##   id  time death drug gender prevOI         AZT  remove new_obstime
## 1  1 16.97     0  ddC   male   AIDS intolerance obstime           0
## 2  1 16.97     0  ddC   male   AIDS intolerance obstime           2
## 3  1 16.97     0  ddC   male   AIDS intolerance obstime           6
## 4  1 16.97     0  ddC   male   AIDS intolerance obstime          12
## 5  1 16.97     0  ddC   male   AIDS intolerance obstime          18
## 6  2 19.00     0  ddI   male noAIDS intolerance obstime           0
##         CD4
## 1 10.677078
## 2        NA
## 3  8.426150
## 4  9.433981
## 5        NA
## 6  6.324555

I called the new variables remove and new_obstime because I’d get rid of the first one (how would you do that?) because it doesn’t give us any information, and because I wanted to make it clear that the new variable could have a new name (in real life I’d just call it obstime again).

More than one time-varying variable

This new function, separate(), is going to help us with our final goal. We want to make the original fake dataset long. But this differs from our other examples because we have two variables we need to deal with, x and y. (In the AIDS trial, for example, the patients got the same drug at all time points and so the only variable that differed across time was the CD4 count.)

Let’s make the data:

dat <- tribble(
  ~id, ~age_bl, ~x_1, ~x_2, ~y_1, ~y_2,
    1,      30,    1,    1,   34,   95,
    2,      43,    0,    1,   93,   28,
    3,      39,    0,    0,   48,   20,
    4,      52,    1,    0,   38,   40
  )

What will happen if you just use gather() as we’ve learned so far? (Try to guess before you run it yourself or look at the output.)

dat %>% 
  gather(key = "variable", value = "value", x_1:y_2) %>%
  arrange(id)
## # A tibble: 16 x 4
##       id age_bl variable value
##    <dbl>  <dbl> <chr>    <dbl>
##  1     1     30 x_1          1
##  2     1     30 x_2          1
##  3     1     30 y_1         34
##  4     1     30 y_2         95
##  5     2     43 x_1          0
##  6     2     43 x_2          1
##  7     2     43 y_1         93
##  8     2     43 y_2         28
##  9     3     39 x_1          0
## 10     3     39 x_2          0
## 11     3     39 y_1         48
## 12     3     39 y_2         20
## 13     4     52 x_1          1
## 14     4     52 x_2          0
## 15     4     52 y_1         38
## 16     4     52 y_2         40

So that’s not going to work. Our goal, as seen in the beginning, is to get an x column and a y column. But here we just have one column with values of both the variables mixed up. We’re going to solve this by using separate() to split up the variable name (x or y) and the timepoint (1 or 2). (I keep using arrange() just to make what we’re doing clearer, but it’s definitely not necessary.)

dat %>% 
  gather(key = "variable", value = "value", x_1:y_2) %>%
  separate(col = variable, into = c("variable", "timepoint")) %>% 
  arrange(id)
## # A tibble: 16 x 5
##       id age_bl variable timepoint value
##    <dbl>  <dbl> <chr>    <chr>     <dbl>
##  1     1     30 x        1             1
##  2     1     30 x        2             1
##  3     1     30 y        1            34
##  4     1     30 y        2            95
##  5     2     43 x        1             0
##  6     2     43 x        2             1
##  7     2     43 y        1            93
##  8     2     43 y        2            28
##  9     3     39 x        1             0
## 10     3     39 x        2             0
## 11     3     39 y        1            48
## 12     3     39 y        2            20
## 13     4     52 x        1             1
## 14     4     52 x        2             0
## 15     4     52 y        1            38
## 16     4     52 y        2            40

OK, now we have the timepoint variable we want. But the dataset is still “too long”. We don’t want to go back to the initial wide dataset, but we need to make it a little “wider”. We’ll use spread().

newdat <- dat %>% 
  gather(key = "variable", value = "value", x_1:y_2) %>%
  separate(col = variable, into = c("variable", "timepoint")) %>% 
  spread(key = "variable", value = "value") %>%
  arrange(id)
head(newdat)
## # A tibble: 6 x 5
##      id age_bl timepoint     x     y
##   <dbl>  <dbl> <chr>     <dbl> <dbl>
## 1     1     30 1             1    34
## 2     1     30 2             1    95
## 3     2     43 1             0    93
## 4     2     43 2             1    28
## 5     3     39 1             0    48
## 6     3     39 2             0    20

This isn’t necessarily a good way to display the data; I’m just showing you what’s possible! Perfect! Now we can do things like graph our data to compare across the two time points:

ggplot(newdat) +
  geom_boxplot(aes(factor(x), y)) +
  facet_grid(~timepoint)

ggplot(newdat, aes(timepoint, y)) +
  geom_point(aes(col = factor(x))) +
  geom_line(aes(group = id))

Wrap-up

If you are confused by how these functions work, don’t worry, so is everyone else. That’s why the animation came to be, for one. I also had to look on StackOverflow several times while writing this up! Every time I use these functions, I have to remind myself how they work. BUT this is one of the most useful workflows you’ll come across, in my opinion. Almost every time I graph something I use gather() because I want to facet it or color it in some way. So just practice on your own data until you get it straight, and until then, there are plenty of R users around the world with whom you can commiserate!

Me literally every time is use gather() and spread() #tidyverse #rstats pic.twitter.com/EwxR0o3VSO

— Grad School Imposter (@darinself) February 9, 2018

tidyr is fantastic, but I'd be lying if I didn't have to do ?gather and ?spread at least once a week #rstats

— Brock Tibert (@BrockTibert) June 20, 2018

Me guessing if gather() or spread() is the right choice. Literally every time I need it. #rstats #tidyverse pic.twitter.com/xWQmNhrv14

— Cédric Scherer 🧐 (@CedScherer) October 8, 2018

honest #rstats: I try seemingly random combinations of terms until gather() or spread() behave the way I want.

— Peter Smits (@PeterDSmits) June 20, 2018

I don't wanna brag or nuthin but I just successfully used gather() from tidyr without having to google anything for the first time ever. #rstats

— Andrew Kniss 🌱 (@WyoWeeds) March 9, 2018