_join()
functionsYou’ll likely have to merge multiple datasets at some point in your work. For example, I’ve worked with intergenerational datasets – one dataset for parents, another for their children – that are linked via the parents’ ids. When I’m looking at, for example, a parental exposure and a child outcome, those datasets need to be merged.
Other times you may find yourself with several datasets with different variables about the same people. For example, I’m working on a project that uses registry data from Sweden. They have about a million different registries – births, patient records, education results, census data, etc. – that can all be merged together to create a rich set of information about everyone in my cohort.
Other types you just end up with two datasets because of how you coded something and you need to link them back together! Different situations call for different types of “joins.” In the tidyverse
, these functions are all suffixed with _join()
. We’ll cover a few of them today.
Again we’ll be using data from the fivethirtyeight
package, in this case a compilation of datasets about reporting of Hurricane Maria in Puerto Rico last year (articles here and here). You can find brief descriptions of the various datasets here. Let’s load them. (We won’t actually be using all of them, but you can play around with the others if you wish!)
rm(list = ls())
library(tidyverse)
library(fivethirtyeight)
data("google_trends")
data("mediacloud_hurricanes")
data("mediacloud_online_news")
data("mediacloud_states")
data("mediacloud_trump")
data("tv_hurricanes")
data("tv_hurricanes_by_network")
data("tv_states")
And take a quick look at them (don’t worry about how this code works; it’s not important!).
walk(ls(), ~ {
print(.)
glimpse(get(.))
cat("\n")
})
## [1] "google_trends"
## Observations: 37
## Variables: 5
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-...
## $ hurricane_harvey_us <int> 0, 0, 0, 1, 9, 29, 44, 18, 17, 16, 13, 10,...
## $ hurricane_irma_us <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 14, 13...
## $ hurricane_maria_us <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hurricane_jose_us <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
##
## [1] "mediacloud_hurricanes"
## Observations: 38
## Variables: 5
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2017-0...
## $ harvey <int> 0, 0, 4, 6, 309, 1348, 995, 960, 1956, 2283, 1603, 1503...
## $ irma <int> 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 61, 66, 36, 52, 221, 7...
## $ maria <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ jose <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31, ...
##
## [1] "mediacloud_online_news"
## Observations: 49
## Variables: 2
## $ name <chr> "al.com", "bleacherreport.com", "Business Insider", "Buzz...
## $ url <chr> "http://www.al.com", "http://bleacherreport.com", "http:/...
##
## [1] "mediacloud_states"
## Observations: 51
## Variables: 4
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2...
## $ texas <int> 295, 998, 696, 1130, 1851, 2693, 1953, 2962, 5072,...
## $ puerto_rico <int> 32, 24, 32, 36, 135, 39, 15, 11, 42, 14, 42, 40, 1...
## $ florida <int> 239, 527, 577, 596, 933, 533, 199, 211, 569, 508, ...
##
## [1] "mediacloud_trump"
## Observations: 51
## Variables: 7
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 201...
## $ puerto_rico <int> 0, 1, 0, 1, 8, 2, 0, 0, 6, 1, 0, 2, 0, 0...
## $ puerto_rico_and_trump <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ florida <int> 18, 41, 30, 39, 75, 28, 13, 13, 40, 38, ...
## $ florida_and_trump <int> 2, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 1, 0...
## $ texas <int> 12, 79, 39, 83, 143, 218, 212, 223, 289,...
## $ texas_and_trump <int> 0, 0, 0, 6, 3, 11, 3, 38, 41, 133, 22, 1...
##
## [1] "tv_hurricanes"
## Observations: 37
## Variables: 5
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2017-0...
## $ harvey <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.1259, 0.7544, 0.8878,...
## $ irma <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
## $ maria <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ jose <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,...
##
## [1] "tv_hurricanes_by_network"
## Observations: 84
## Variables: 6
## $ date <date> 2017-09-17, 2017-09-18, 2017-09-19, 2017-09-20, 2017...
## $ query <fct> Hurricane Maria, Hurricane Maria, Hurricane Maria, Hu...
## $ bbc_news <dbl> 0.0000, 0.2204, 0.8901, 0.9880, 0.4756, 0.2761, 0.399...
## $ cnn <dbl> 0.0290, 0.1866, 0.5818, 0.7985, 0.6454, 0.2778, 0.144...
## $ fox_news <dbl> 0.0079, 0.1877, 0.1779, 0.5387, 0.3177, 0.1211, 0.059...
## $ msnbc <dbl> 0.0208, 0.0473, 0.2621, 0.4835, 0.2406, 0.1998, 0.067...
##
## [1] "tv_states"
## Observations: 52
## Variables: 4
## $ date <date> 2017-08-20, 2017-08-21, 2017-08-22, 2017-08-23, 2...
## $ florida <dbl> 0.0923, 0.0441, 0.0359, 0.0431, 0.0492, 0.0442, 0....
## $ texas <dbl> 0.0286, 0.0904, 0.0334, 0.0839, 0.3978, 1.4639, 2....
## $ puerto_rico <dbl> 0.0000, 0.0099, 0.0000, 0.0103, 0.0184, 0.0000, 0....
Let’s imagine we want to merge mediacloud_hurricanes
(the number of sentences per day that mention Hurricanes Harvey, Irma, Jose, and Maria in online news) and tv_hurricanes
(the percentage of sentences per day in TV news that mention the hurricanes). Maybe we want to see whether hurricane news got earlier traction online vs. on TV.
Let’s first think about what we want. Our ideal dataset might be long, in which there are four columns for each of the hurricanes and an indicator for which type of news the data is from (each row is a unique date and news type), it might be longer, in which each hurricane also gets its own row, or it might be wide, where we have eight columns for each hurricane/news type combination (each row is a unique date).
full_join()
Let’s explore full_join()
. This will merge two datasets and keep all of the observations from each one, even if there’s no match. For example, we can see that the mediacloud dataset extends one day longer than the TV dataset:
range(mediacloud_hurricanes$date)
## [1] "2017-08-20" "2017-09-26"
range(tv_hurricanes$date)
## [1] "2017-08-20" "2017-09-25"
That means that when we full_join()
them, we’ll get the entire range of dates:
full <- full_join(mediacloud_hurricanes, tv_hurricanes)
range(full$date)
## [1] "2017-08-20" "2017-09-26"
left_join()
and right_join()
This is in contrast to the one-sided joins, which only keep the data with matches on one side. For example, since the TV dataset only goes through 2017-09-25, when we do right_join(mediacloud_hurricanes, tv_hurricanes)
, we’ll get only those same dates:
right <- right_join(mediacloud_hurricanes, tv_hurricanes)
range(right$date)
## [1] "2017-08-20" "2017-09-25"
Joining from the left, however, means that all of the mediacloud dataset’s dates will be kept (which means all of them):
left <- left_join(mediacloud_hurricanes, tv_hurricanes)
range(left$date)
## [1] "2017-08-20" "2017-09-26"
inner_join()
Finally, inner_join()
is like a combination of both of the sided-joins: it only returns data where there’s a matching value in both datasets.
inner <- inner_join(mediacloud_hurricanes, tv_hurricanes)
range(left$date)
## [1] "2017-08-20" "2017-09-26"
Whoa! To understand what happened here, we’ll have to look more deeply into how datasets are being joined in the next section.
First, however, check out these animations (from the same source as those for gather()
and spread()
in week 10), which help distinguish between these types of joins.
by =
argumentWe haven’t yet looked at what’s happening when we join, however. Let’s look at the first five days from each dataset first:
mediacloud_hurricanes %>%
filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 5
## date harvey irma maria jose
## <date> <int> <int> <int> <int>
## 1 2017-08-20 0 0 0 0
## 2 2017-08-21 0 0 0 0
## 3 2017-08-22 4 1 0 0
## 4 2017-08-23 6 1 0 0
## 5 2017-08-24 309 0 0 0
tv_hurricanes %>%
filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 5
## date harvey irma maria jose
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20 0 0 0 0
## 2 2017-08-21 0 0 0 0
## 3 2017-08-22 0 0 0 0
## 4 2017-08-23 0 0 0 0
## 5 2017-08-24 0.126 0 0 0
And from the dataset we made using full_join()
:
full %>%
arrange(date) %>%
filter(date < as.Date("2017-08-25"))
## # A tibble: 8 x 5
## date harvey irma maria jose
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20 0 0 0 0
## 2 2017-08-21 0 0 0 0
## 3 2017-08-22 4 1 0 0
## 4 2017-08-22 0 0 0 0
## 5 2017-08-23 6 1 0 0
## 6 2017-08-23 0 0 0 0
## 7 2017-08-24 309 0 0 0
## 8 2017-08-24 0.126 0 0 0
What’s going on here?
When we use one of the _join()
functions, it will attempt to merge the two datasets on each of the column names that they share – in our case, all of them. In fact, it will give you a warning message that it is doing so:
full <- full_join(mediacloud_hurricanes, tv_hurricanes)
## Joining, by = c("date", "harvey", "irma", "maria", "jose")
When the values don’t match across the matching columns, another row is created. So the rows from 2017-08-20 and 2017-08-21 are fully combined, since they share all the same values in each dataset: the date and zeros for all the hurricanes.
For each of the other dates, however, the values don’t match across the board. On 2017-08-22, the TV dataset still has zeros everywhere, whereas both Hurricanes Harvey and Irma have been mentioned on online news: they get values of 4 and 1 respectively in the mediacloud dataset.
That’s why when we performed an inner join, only the two completely matching observations remained:
inner_join(mediacloud_hurricanes, tv_hurricanes)
## Joining, by = c("date", "harvey", "irma", "maria", "jose")
## # A tibble: 2 x 5
## date harvey irma maria jose
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20 0 0 0 0
## 2 2017-08-21 0 0 0 0
Generally we want to join by a “key” – a value that is unique to an observation and is shared across datasets. Here the key is the date, so we should use the argument by = "date"
:
full <- full_join(mediacloud_hurricanes, tv_hurricanes, by = "date")
full %>%
arrange(date) %>%
filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 9
## date harvey.x irma.x maria.x jose.x harvey.y irma.y maria.y jose.y
## <date> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20 0 0 0 0 0 0 0 0
## 2 2017-08-21 0 0 0 0 0 0 0 0
## 3 2017-08-22 4 1 0 0 0 0 0 0
## 4 2017-08-23 6 1 0 0 0 0 0 0
## 5 2017-08-24 309 0 0 0 0.126 0 0 0
Now we have a wide dataset: each date has a row, and each hurricane has two columns, one for online news and one for TV news. But since they had the same names originally, now they’ve been distinguished with .x
and .y
which isn’t super helpful.
Instead, let’s use the rename()
function we saw back in week 3 to change these names before we merge. In particular, we’ll use the rename_at()
variant to add a suffix to all the variable names except for date. (There are other ways to do this! If you have extra time, explore some of them.)
mediacloud_hurricanes <- rename_at(mediacloud_hurricanes, vars(-date), funs(paste0(., "_online")))
tv_hurricanes <- rename_at(tv_hurricanes, vars(-date), funs(paste0(., "_tv")))
full <- full_join(mediacloud_hurricanes, tv_hurricanes, by = "date")
full %>%
arrange(date) %>%
filter(date < as.Date("2017-08-25"))
## # A tibble: 5 x 9
## date harvey_online irma_online maria_online jose_online harvey_tv
## <date> <int> <int> <int> <int> <dbl>
## 1 2017-08-20 0 0 0 0 0
## 2 2017-08-21 0 0 0 0 0
## 3 2017-08-22 4 1 0 0 0
## 4 2017-08-23 6 1 0 0 0
## 5 2017-08-24 309 0 0 0 0.126
## # ... with 3 more variables: irma_tv <dbl>, maria_tv <dbl>, jose_tv <dbl>
That looks like the wide dataset we envisioned! Now can you use the wide dataset to make a fully long dataset (with a row for each value of the number/percentage of sentences, and an indicator for which hurricane they refer to)?
longer <- full %>%
gather(key = "key", value = "sentences", -date) %>%
separate(key, into = c("hurricane", "media"))
longer
## # A tibble: 304 x 4
## date hurricane media sentences
## <date> <chr> <chr> <dbl>
## 1 2017-08-20 harvey online 0
## 2 2017-08-21 harvey online 0
## 3 2017-08-22 harvey online 4
## 4 2017-08-23 harvey online 6
## 5 2017-08-24 harvey online 309
## 6 2017-08-25 harvey online 1348
## 7 2017-08-26 harvey online 995
## 8 2017-08-27 harvey online 960
## 9 2017-08-28 harvey online 1956
## 10 2017-08-29 harvey online 2283
## # ... with 294 more rows
What about a version of the dataset that contains columns for each hurricane, and observations for each unique date/media source combination? (This is almost like the first dataset we made with full_join()
except that it has two copies of each of the fully zero rows, and an indicator for which type of media.)
long <- full %>%
gather(key = "key", value = "sentences", -date) %>%
separate(key, into = c("hurricane", "media")) %>%
spread(key = "hurricane", value = "sentences")
long
## # A tibble: 76 x 6
## date media harvey irma jose maria
## <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2017-08-20 online 0 0 0 0
## 2 2017-08-20 tv 0 0 0 0
## 3 2017-08-21 online 0 0 0 0
## 4 2017-08-21 tv 0 0 0 0
## 5 2017-08-22 online 4 1 0 0
## 6 2017-08-22 tv 0 0 0 0
## 7 2017-08-23 online 6 1 0 0
## 8 2017-08-23 tv 0 0 0 0
## 9 2017-08-24 online 309 0 0 0
## 10 2017-08-24 tv 0.126 0 0 0
## # ... with 66 more rows
Now can you replicate these graphs using the longer
dataset? (They don’t have to look as nice as the published ones!)
Here’s my version (hint: I used + ggthemes::theme_fivethirtyeight()
to try to make it look more like a fivethirtyeight graph… it didn’t really help, in this case!). Try it on your own first, but you can look at the code if you need help.
longer %>%
filter(hurricane != "jose") %>%
ggplot() +
geom_line(aes(x = date, y = sentences, col = hurricane)) +
facet_grid(media~., scales = "free") +
ggthemes::theme_fivethirtyeight()
The second fivethirtyeight article adds information about the date of landfall to similar graphs. I’m providing you with this mini dataset of landfall dates.
dates <- tribble(
~date, ~landfall,
"2017-08-21", "Harvey",
"2017-09-10", "Irma",
"2017-09-20", "Maria"
) %>%
mutate(date = as.Date(date))
Let’s say we want to put labels on our graph from above. But we want our labels specifically to be positioned at the highest point on the graph on the landfall days. (See graph at the bottom if you’re not sure what I mean.) That means we need to add to the dataset above a value of sentences
(i.e., the height on the y-axis) to go along with each date of landfall. Since we need a different one for each media type, we’ll need 6 rows altogether.
Can you use a join function with the data above to make this dataset? (See output below to check your answer as you go along.)
Hint 1: I used the condition row_number() == 1
at one point in my code.
Hint2: I used a join function, then group_by()
, then arrange()
, then filter()
.
Note: This is not the only way to do this!
labeldat <- left_join(dates, longer, by = "date") %>%
group_by(landfall, media) %>%
arrange(-sentences) %>%
filter(row_number() == 1)
labeldat
## # A tibble: 6 x 5
## # Groups: landfall, media [6]
## date landfall hurricane media sentences
## <date> <chr> <chr> <chr> <dbl>
## 1 2017-09-10 Irma irma online 2544
## 2 2017-09-20 Maria maria online 545
## 3 2017-09-20 Maria maria tv 0.509
## 4 2017-09-10 Irma irma tv 0.499
## 5 2017-08-21 Harvey harvey online 0
## 6 2017-08-21 Harvey harvey tv 0
Did you know you can use two datasets for graphing in ggplot2
? We can use the new labeldat
dataset to add labels to the graph. We just need to specify the new dataset in the geom_label()
call.
ggplot(data = filter(longer, hurricane != "jose")) +
geom_line(aes(x = date, y = sentences, col = hurricane)) +
geom_label(data = labeldat, aes(x = date, y = sentences, label = landfall)) +
facet_grid(media~., scales = "free") +
ggthemes::theme_fivethirtyeight()
I’d definitely fix this up before publishing it, but pretty cool!