Long to wide and back again with gather() and spread()

November 6-7, 2018

Motivations and goals

Last week you saw how to use the gather() function to make your data “tidy”. Often you’ll get a dataset that’s in “wide” format and have to turn it into “long” format to do your analysis, and other times you’ll have to go the other way around. This is particularly the case when working with longitudinal datasets.

For example, a wide dataset may look like this. We could imagine this is from a trial where participants were or were not given a drug at two timepoints (x_1 and x_2), and their responses were also measured at two timepoints (y_1 and y_2). The dataset contains an id variable as well as the participants’ age at baseline age_bl. I completely made up this data. Don’t try to read anything into it.

id age_bl x_1 x_2 y_1 y_2
1 30 1 1 34 95
2 43 0 1 93 28
3 39 0 0 48 20
4 52 1 0 38 40

However, we might want it in long format for our analyses:

id age_bl timepoint x y
1 30 1 1 34
1 30 2 1 95
2 43 1 0 93
2 43 2 1 28
3 39 1 0 48
3 39 2 0 20
4 52 1 1 38
4 52 2 0 40

Or we might be given our data in long format and want to get it into wide format. Different types of analysis require different types of data structures, and it is very helpful to be able to go back and forth.

Today we’ll add to our knowledge about the gather() function, as well as learn spread() and separate() to help us go back and forth as needed.

Review: gather()

First let’s practice the gather() function to go from wide to long format data. First, as always, load the tidyverse, then create part of the mini dataset from above.

library(tidyverse)

dat <- tribble(
  ~id, ~age_bl, ~x_1, ~x_2, 
    1,      30,    1,    1, 
    2,      43,    0,    1, 
    3,      39,    0,    0, 
    4,      52,    1,    0, 
  )

You may want to get rid of the x_ from the values of the x variable. One way to do that is to take your long dataset and use %>% mutate(timepoint = factor(timepoint, labels = c(“1”, “2”))). There are lots of other ways too, and we’ll see another below! Now your goal is to get that data to look like the output below. Right now we have two columns for x; in a tidy/long dataset, we’d just want one. We’ll want to name that new variable x (it will have values of either 0 or 1), and then create a new variable timepoint, which will take on values of either x_1 or x_2. Look back at last week’s lesson or the function documentation if you need help.

dat %>% 
  gather(key = "timepoint", value = "x", x_1:x_2)
## # A tibble: 8 x 4
##      id age_bl timepoint     x
##   <dbl>  <dbl> <chr>     <dbl>
## 1     1     30 x_1           1
## 2     2     43 x_1           0
## 3     3     39 x_1           0
## 4     4     52 x_1           1
## 5     1     30 x_2           1
## 6     2     43 x_2           1
## 7     3     39 x_2           0
## 8     4     52 x_2           0

Introducing spread()

Now we’ll look at spread(), which is basically the inverse of gather(). I found the following animation to be super helpful – you may want to watch it a few times. It was created in R using the gganimate package. This awesome animation and more are available on GitHub via gadenbuie.