Merging data with _join() functions

November 27-28, 2018

Motivations and goals

You’ll likely have to merge multiple datasets at some point in your work. For example, I’ve worked with intergenerational datasets – one dataset for parents, another for their children – that are linked via the parents’ ids. When I’m looking at, for example, a parental exposure and a child outcome, those datasets need to be merged.

Other times you may find yourself with several datasets with different variables about the same people. For example, I’m working on a project that uses registry data from Sweden. They have about a million different registries – births, patient records, education results, census data, etc. – that can all be merged together to create a rich set of information about everyone in my cohort.

Other types you just end up with two datasets because of how you coded something and you need to link them back together! Different situations call for different types of “joins.” In the tidyverse, these functions are all suffixed with _join(). We’ll cover a few of them today.

Data

Again we’ll be using data from the fivethirtyeight package, in this case a compilation of datasets about reporting of Hurricane Maria in Puerto Rico last year (articles here and here). You can find brief descriptions of the various datasets here. Let’s load them. (We won’t actually be using all of them, but you can play around with the others if you wish!)

rm(list = ls())
library(tidyverse)
library(fivethirtyeight)
data("google_trends")
data("mediacloud_hurricanes")
data("mediacloud_online_news")
data("mediacloud_states")
data("mediacloud_trump")
data("tv_hurricanes")
data("tv_hurricanes_by_network")
data("tv_states")

And take a quick look at them (don’t worry about how this code works; it’s not important!).

walk(ls(), ~ {
  print(.)
  glimpse(get(.))
  cat("\n")
})
## [1] "google_trends"
## Observations: 37
## Variables: 5
## $ date                <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-...
## $ hurricane_harvey_us <int> 0, 0, 0, 1, 9, 29, 44, 18, 17, 16, 13, 10,...
## $ hurricane_irma_us   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 14, 13...
## $ hurricane_maria_us  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hurricane_jose_us   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## 
## [1] "mediacloud_hurricanes"
## Observations: 38
## Variables: 5
## $ date   <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2017-0...
## $ harvey <int> 0, 0, 4, 6, 309, 1348, 995, 960, 1956, 2283, 1603, 1503...
## $ irma   <int> 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 61, 66, 36, 52, 221, 7...
## $ maria  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ jose   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31, ...
## 
## [1] "mediacloud_online_news"
## Observations: 49
## Variables: 2
## $ name <chr> "al.com", "bleacherreport.com", "Business Insider", "Buzz...
## $ url  <chr> "http://www.al.com", "http://bleacherreport.com", "http:/...
## 
## [1] "mediacloud_states"
## Observations: 51
## Variables: 4
## $ date        <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2...
## $ texas       <int> 295, 998, 696, 1130, 1851, 2693, 1953, 2962, 5072,...
## $ puerto_rico <int> 32, 24, 32, 36, 135, 39, 15, 11, 42, 14, 42, 40, 1...
## $ florida     <int> 239, 527, 577, 596, 933, 533, 199, 211, 569, 508, ...
## 
## [1] "mediacloud_trump"
## Observations: 51
## Variables: 7
## $ date                  <date> 2017-08-20, 2017-08-21, 2017-08-22, 201...
## $ puerto_rico           <int> 0, 1, 0, 1, 8, 2, 0, 0, 6, 1, 0, 2, 0, 0...
## $ puerto_rico_and_trump <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ florida               <int> 18, 41, 30, 39, 75, 28, 13, 13, 40, 38, ...
## $ florida_and_trump     <int> 2, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 1, 0...
## $ texas                 <int> 12, 79, 39, 83, 143, 218, 212, 223, 289,...
## $ texas_and_trump       <int> 0, 0, 0, 6, 3, 11, 3, 38, 41, 133, 22, 1...
## 
## [1] "tv_hurricanes"
## Observations: 37
## Variables: 5
## $ date   <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2017-0...
## $ harvey <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.1259, 0.7544, 0.8878,...
## $ irma   <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
## $ maria  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ jose   <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
## 
## [1] "tv_hurricanes_by_network"
## Observations: 84
## Variables: 6
## $ date     <date> 2017-09-17, 2017-09-18, 2017-09-19, 2017-09-20, 2017...
## $ query    <fct> Hurricane Maria, Hurricane Maria, Hurricane Maria, Hu...
## $ bbc_news <dbl> 0.0000, 0.2204, 0.8901, 0.9880, 0.4756, 0.2761, 0.399...
## $ cnn      <dbl> 0.0290, 0.1866, 0.5818, 0.7985, 0.6454, 0.2778, 0.144...
## $ fox_news <dbl> 0.0079, 0.1877, 0.1779, 0.5387, 0.3177, 0.1211, 0.059...
## $ msnbc    <dbl> 0.0208, 0.0473, 0.2621, 0.4835, 0.2406, 0.1998, 0.067...
## 
## [1] "tv_states"
## Observations: 52
## Variables: 4
## $ date        <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2...
## $ florida     <dbl> 0.0923, 0.0441, 0.0359, 0.0431, 0.0492, 0.0442, 0....
## $ texas       <dbl> 0.0286, 0.0904, 0.0334, 0.0839, 0.3978, 1.4639, 2....
## $ puerto_rico <dbl> 0.0000, 0.0099, 0.0000, 0.0103, 0.0184, 0.0000, 0....

Let’s imagine we want to merge mediacloud_hurricanes (the number of sentences per day that mention Hurricanes Harvey, Irma, Jose, and Maria in online news) and tv_hurricanes (the percentage of sentences per day in TV news that mention the hurricanes). Maybe we want to see whether hurricane news got earlier traction online vs. on TV.

Let’s first think about what we want. Our ideal dataset might be long, in which there are four columns for each of the hurricanes and an indicator for which type of news the data is from (each row is a unique date and news type), it might be longer, in which each hurricane also gets its own row, or it might be wide, where we have eight columns for each hurricane/news type combination (each row is a unique date).

Join types

full_join()

Let’s explore full_join(). This will merge two datasets and keep all of the observations from each one, even if there’s no match. For example, we can see that the mediacloud dataset extends one day longer than the TV dataset:

range(mediacloud_hurricanes$date)
## [1] "2017-08-20" "2017-09-26"
range(tv_hurricanes$date)
## [1] "2017-08-20" "2017-09-25"

That means that when we full_join() them, we’ll get the entire range of dates:

full <- full_join(mediacloud_hurricanes, tv_hurricanes)
range(full$date)
## [1] "2017-08-20" "2017-09-26"

left_join() and right_join()

This is in contrast to the one-sided joins, which only keep the data with matches on one side. For example, since the TV dataset only goes through 2017-09-25, when we do right_join(mediacloud_hurricanes, tv_hurricanes), we’ll get only those same dates:

right <- right_join(mediacloud_hurricanes, tv_hurricanes)
range(right$date)
## [1] "2017-08-20" "2017-09-25"

Joining from the left, however, means that all of the mediacloud dataset’s dates will be kept (which means all of them):

left <- left_join(mediacloud_hurricanes, tv_hurricanes)
range(left$date)
## [1] "2017-08-20" "2017-09-26"

inner_join()

Finally, inner_join() is like a combination of both of the sided-joins: it only returns data where there’s a matching value in both datasets.

inner <- inner_join(mediacloud_hurricanes, tv_hurricanes)
range(left$date)
## [1] "2017-08-20" "2017-09-26"

Whoa! To understand what happened here, we’ll have to look more deeply into how datasets are being joined in the next section.

First, however, check out these animations (from the same source as those for gather() and spread() in week 10), which help distinguish between these types of joins.

The by = argument

We haven’t yet looked at what’s happening when we join, however. Let’s look at the first five days from each dataset first:

mediacloud_hurricanes %>%
  filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 5
##   date       harvey  irma maria  jose
##   <date>      <int> <int> <int> <int>
## 1 2017-08-20      0     0     0     0
## 2 2017-08-21      0     0     0     0
## 3 2017-08-22      4     1     0     0
## 4 2017-08-23      6     1     0     0
## 5 2017-08-24    309     0     0     0
tv_hurricanes %>%
  filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 5
##   date       harvey  irma maria  jose
##   <date>      <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20  0         0     0     0
## 2 2017-08-21  0         0     0     0
## 3 2017-08-22  0         0     0     0
## 4 2017-08-23  0         0     0     0
## 5 2017-08-24  0.126     0     0     0

And from the dataset we made using full_join():

full %>%
  arrange(date) %>%
  filter(date < as.Date("2017-08-25"))
## # A tibble: 8 x 5
##   date        harvey  irma maria  jose
##   <date>       <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20   0         0     0     0
## 2 2017-08-21   0         0     0     0
## 3 2017-08-22   4         1     0     0
## 4 2017-08-22   0         0     0     0
## 5 2017-08-23   6         1     0     0
## 6 2017-08-23   0         0     0     0
## 7 2017-08-24 309         0     0     0
## 8 2017-08-24   0.126     0     0     0

What’s going on here?

When we use one of the _join() functions, it will attempt to merge the two datasets on each of the column names that they share – in our case, all of them. In fact, it will give you a warning message that it is doing so:

full <- full_join(mediacloud_hurricanes, tv_hurricanes)
## Joining, by = c("date", "harvey", "irma", "maria", "jose")

When the values don’t match across the matching columns, another row is created. So the rows from 2017-08-20 and 2017-08-21 are fully combined, since they share all the same values in each dataset: the date and zeros for all the hurricanes.

For each of the other dates, however, the values don’t match across the board. On 2017-08-22, the TV dataset still has zeros everywhere, whereas both Hurricanes Harvey and Irma have been mentioned on online news: they get values of 4 and 1 respectively in the mediacloud dataset.

That’s why when we performed an inner join, only the two completely matching observations remained:

inner_join(mediacloud_hurricanes, tv_hurricanes)
## Joining, by = c("date", "harvey", "irma", "maria", "jose")
## # A tibble: 2 x 5
##   date       harvey  irma maria  jose
##   <date>      <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20      0     0     0     0
## 2 2017-08-21      0     0     0     0

Generally we want to join by a “key” – a value that is unique to an observation and is shared across datasets. Here the key is the date, so we should use the argument by = "date":

full <- full_join(mediacloud_hurricanes, tv_hurricanes, by = "date")
full %>%
  arrange(date) %>%
  filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 9
##   date       harvey.x irma.x maria.x jose.x harvey.y irma.y maria.y jose.y
##   <date>        <int>  <int>   <int>  <int>    <dbl>  <dbl>   <dbl>  <dbl>
## 1 2017-08-20        0      0       0      0    0          0       0      0
## 2 2017-08-21        0      0       0      0    0          0       0      0
## 3 2017-08-22        4      1       0      0    0          0       0      0
## 4 2017-08-23        6      1       0      0    0          0       0      0
## 5 2017-08-24      309      0       0      0    0.126      0       0      0

Now we have a wide dataset: each date has a row, and each hurricane has two columns, one for online news and one for TV news. But since they had the same names originally, now they’ve been distinguished with .x and .y which isn’t super helpful.

Instead, let’s use the rename() function we saw back in week 3 to change these names before we merge. In particular, we’ll use the rename_at() variant to add a suffix to all the variable names except for date. (There are other ways to do this! If you have extra time, explore some of them.)

mediacloud_hurricanes <- rename_at(mediacloud_hurricanes, vars(-date), funs(paste0(., "_online")))

tv_hurricanes <- rename_at(tv_hurricanes, vars(-date), funs(paste0(., "_tv")))

full <- full_join(mediacloud_hurricanes, tv_hurricanes, by = "date")

full %>%
  arrange(date) %>%
  filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 9
##   date       harvey_online irma_online maria_online jose_online harvey_tv
##   <date>             <int>       <int>        <int>       <int>     <dbl>
## 1 2017-08-20             0           0            0           0     0    
## 2 2017-08-21             0           0            0           0     0    
## 3 2017-08-22             4           1            0           0     0    
## 4 2017-08-23             6           1            0           0     0    
## 5 2017-08-24           309           0            0           0     0.126
## # ... with 3 more variables: irma_tv <dbl>, maria_tv <dbl>, jose_tv <dbl>

That looks like the wide dataset we envisioned! Now can you use the wide dataset to make a fully long dataset (with a row for each value of the number/percentage of sentences, and an indicator for which hurricane they refer to)?

longer <- full %>%
  gather(key = "key", value = "sentences", -date) %>%
  separate(key, into = c("hurricane", "media"))

longer
## # A tibble: 304 x 4
##    date       hurricane media  sentences
##    <date>     <chr>     <chr>      <dbl>
##  1 2017-08-20 harvey    online         0
##  2 2017-08-21 harvey    online         0
##  3 2017-08-22 harvey    online         4
##  4 2017-08-23 harvey    online         6
##  5 2017-08-24 harvey    online       309
##  6 2017-08-25 harvey    online      1348
##  7 2017-08-26 harvey    online       995
##  8 2017-08-27 harvey    online       960
##  9 2017-08-28 harvey    online      1956
## 10 2017-08-29 harvey    online      2283
## # ... with 294 more rows

What about a version of the dataset that contains columns for each hurricane, and observations for each unique date/media source combination? (This is almost like the first dataset we made with full_join() except that it has two copies of each of the fully zero rows, and an indicator for which type of media.)

long <- full %>%
  gather(key = "key", value = "sentences", -date) %>%
  separate(key, into = c("hurricane", "media")) %>%
  spread(key = "hurricane", value = "sentences")

long
## # A tibble: 76 x 6
##    date       media   harvey  irma  jose maria
##    <date>     <chr>    <dbl> <dbl> <dbl> <dbl>
##  1 2017-08-20 online   0         0     0     0
##  2 2017-08-20 tv       0         0     0     0
##  3 2017-08-21 online   0         0     0     0
##  4 2017-08-21 tv       0         0     0     0
##  5 2017-08-22 online   4         1     0     0
##  6 2017-08-22 tv       0         0     0     0
##  7 2017-08-23 online   6         1     0     0
##  8 2017-08-23 tv       0         0     0     0
##  9 2017-08-24 online 309         0     0     0
## 10 2017-08-24 tv       0.126     0     0     0
## # ... with 66 more rows

Plotting the data

Now can you replicate these graphs using the longer dataset? (They don’t have to look as nice as the published ones!)

Here’s my version (hint: I used + ggthemes::theme_fivethirtyeight() to try to make it look more like a fivethirtyeight graph… it didn’t really help, in this case!). Try it on your own first, but you can look at the code if you need help.

longer %>%
  filter(hurricane != "jose") %>%
  ggplot() +
  geom_line(aes(x = date, y = sentences, col = hurricane)) +
  facet_grid(media~., scales = "free") +
  ggthemes::theme_fivethirtyeight()

Challenge

The second fivethirtyeight article adds information about the date of landfall to similar graphs. I’m providing you with this mini dataset of landfall dates.

dates <- tribble(
  ~date, ~landfall,
  "2017-08-21", "Harvey",
  "2017-09-10", "Irma",
  "2017-09-20", "Maria"
) %>%
  mutate(date = as.Date(date))

Let’s say we want to put labels on our graph from above. But we want our labels specifically to be positioned at the highest point on the graph on the landfall days. (See graph at the bottom if you’re not sure what I mean.) That means we need to add to the dataset above a value of sentences (i.e., the height on the y-axis) to go along with each date of landfall. Since we need a different one for each media type, we’ll need 6 rows altogether.

Can you use a join function with the data above to make this dataset? (See output below to check your answer as you go along.)

Hint 1: I used the condition row_number() == 1 at one point in my code.

Hint2: I used a join function, then group_by(), then arrange(), then filter().

Note: This is not the only way to do this!

labeldat <- left_join(dates, longer, by = "date") %>%
  group_by(landfall, media) %>%
  arrange(-sentences) %>%
  filter(row_number() == 1)

labeldat
## # A tibble: 6 x 5
## # Groups:   landfall, media [6]
##   date       landfall hurricane media  sentences
##   <date>     <chr>    <chr>     <chr>      <dbl>
## 1 2017-09-10 Irma     irma      online  2544    
## 2 2017-09-20 Maria    maria     online   545    
## 3 2017-09-20 Maria    maria     tv         0.509
## 4 2017-09-10 Irma     irma      tv         0.499
## 5 2017-08-21 Harvey   harvey    online     0    
## 6 2017-08-21 Harvey   harvey    tv         0

Did you know you can use two datasets for graphing in ggplot2? We can use the new labeldat dataset to add labels to the graph. We just need to specify the new dataset in the geom_label() call.

ggplot(data = filter(longer, hurricane != "jose")) +
  geom_line(aes(x = date, y = sentences, col = hurricane)) +
  geom_label(data = labeldat, aes(x = date, y = sentences, label = landfall)) +
  facet_grid(media~., scales = "free") +
  ggthemes::theme_fivethirtyeight()

I’d definitely fix this up before publishing it, but pretty cool!