Office Hours 7

October 16/17, 2018

Today’s goals


So far in R Office Hours (whether in person or by following along on your own at home) we have learned a number of important functions to manipulate, organize, visualize data:

What’s new

As always, however, you may find it helpful to look for a function that meets your specific needs in the R documentation, by Googling, or by asking your friendly R Office Hours TAs! Nothing is new today! That is, we’re not going to explicitly teach you any new functions. Instead, we’re going to practice putting it all together. These challenges should help you see the value of what you’ve learned so far, and give you some examples of things you may want to do in your own data!


I love the fivethirtyeight package because you can use the data used on FiveThirtyEight to run your own analyses and compare your conclusions to theirs. Today we’ll attempt to recreate some of the numbers that were used to make the tables and figures in their articles.

Begin by loading the required packages. If you’ve never used the fivethirtyeight package, you’ll have to run install.packages("fivethirtyeight").


You can make any of the available datasets show up in your R environment with the command data(dataset), where dataset is the name of the dataset. You may want to look at the data using glimpse(dataset) or summary(dataset) or another function you like. You can find all the names of the datasets and links to the articles they were used for here.


For each of these challenges, I’ll show or link to the picture containing the table or figure you should recreate the data for. Feel free to attempt them in any order you want!

But remember that there are usually many ways to do the same thing in R, so there’s no single correct answer… but some may be shorter or longer, faster or slower, easier or harder to read than others. You can also see the output, but the code (i.e., the solution to the challenge) is hidden. You can click the Show Code button to check your answers… but no cheating!


Here we’ll use a dataset on films to assess whether they pass the Bechdel Test. You can read the article here, where you can also find this bar plot showing the stats we want to recreate.

You don’t have to recreate the figure, just the data that goes into it, as in the output below. The dataset is called bechdel. (Note that I got 31.1 instead of 31.7 on the first line, so there might be an error in the figure.)

bechdel %>%
  filter(year >= 1990 & clean_test != "dubious") %>%
  group_by(clean_test) %>%
  summarise(med_bud = round(median(budget_2013) / 1000000, 1)) %>%
  ungroup() %>%
  mutate(clean_test = factor(clean_test,
    labels =
        "Fewer than two women",
        "Women don't talk to each other",
        "Women only talk about men",
        "Passes Bechdel Test"
  )) %>%
## # A tibble: 4 x 2
##   clean_test                     med_bud
##   <ord>                            <dbl>
## 1 Passes Bechdel Test               31.1
## 2 Women only talk about men         39.7
## 3 Women don't talk to each other    56.6
## 4 Fewer than two women              43.4


For this challenge we’ll be using data from this article, which reported the results of a survey about how people get information about the weather. Your goal is to re-create this table:

Which should look something like output below. The dataset is called weather_check.

weather_check %>%
  filter(! %>%
  group_by(weather_source) %>%
  summarise(n = n()) %>%
  mutate(percentage = round(100 * (n / sum(n)), 1)) %>%
  arrange(desc(percentage)) %>%
## # A tibble: 8 x 2
##   weather_source                                        percentage
##   <chr>                                                      <dbl>
## 1 The default weather app on your phone                       23.2
## 2 Local TV News                                               20.6
## 3 A specific website or app (please provide the answer)       19.1
## 4 The Weather Channel                                         15.2
## 5 Internet search                                             14.2
## 6 Newspaper                                                    3.5
## 7 Radio weather                                                3.4
## 8 Newsletter                                                   0.9


Finally, we will use data on guests on The Daily Show (with Jon Stewart), as seen in this article. We’ll try to prepare the data for this graph:

If you want, you can recreate the figure itself using the geom_line() function in ggplot2. Your data and graph should look something like this (since we can’t see the actual numbers in the graph, this is just my best guess as to what’s correct). The dataset is called daily_show_guests.

to_plot <- daily_show_guests %>%
  mutate(occupation = fct_recode(group,
    "Acting, Comedy, & Music" = "Musician",
    "Acting, Comedy, & Music" = "Acting",
    "Acting, Comedy, & Music" = "Comedy",
    "Government and Politics" = "Politician",
    "Government and Politics" = "Political Aide",
    "Government and Politics" = "Government",
    "Media" = "media"
  )) %>%
  group_by(year) %>%
  mutate(n_year = n()) %>%
  group_by(year, occupation) %>%
    n_occ_year = n(),
    n_year = first(n_year),
    pct_year = n_occ_year / n_year
  ) %>%
  filter(occupation %in% c(
    "Acting, Comedy, & Music",
    "Government and Politics",
  )) %>%
  select(-n_occ_year, -n_year)
## # A tibble: 51 x 3
## # Groups:   year [17]
##     year occupation              pct_year
##    <int> <fct>                      <dbl>
##  1  1999 Acting, Comedy, & Music   0.904 
##  2  1999 Government and Politics   0.0120
##  3  1999 Media                     0.0663
##  4  2000 Acting, Comedy, & Music   0.740 
##  5  2000 Government and Politics   0.0828
##  6  2000 Media                     0.124 
##  7  2001 Acting, Comedy, & Music   0.726 
##  8  2001 Government and Politics   0.0382
##  9  2001 Media                     0.197 
## 10  2002 Acting, Comedy, & Music   0.623 
## # ... with 41 more rows
ggplot(to_plot) +
  geom_line(aes(year, pct_year, col = occupation))