Motivations and goals

This week’s lesson is focused on the summarise() and group_by() functions. summarise() allows you to to collapse a variable into a single number. group_by() allows you to analyze your dataset by different subgroups. These functions are often used together, which is why we’re covering them together this week.

By the end of this lesson, you should feel comfortable calculating summary statistics for your dataset, grouping by different variables.

Set-up

Start by downloading the the tidyverse package if you haven’t done so yet. You can install the package by typing install.packages("tidyverse") into the console of RStudio and then hitting enter.

Then load the tidyverse as follows:

library(tidyverse)

⊕I found these data using the Google Dataset Search tool link. Next, we need to load in our data. We’re going the same dataset we used in Week 2. This dataset comes from a study of women using emergency medical services (EMS) for third-trimester pregnancy-related complications in India in 2014.

dat <- read_csv("https://louisahsmith.github.io/R-office-hours/data/OH-02-data.csv")

Cleaning numeric variables

To illustrate the usefulness of summarise(), we’re going to focus on two numeric variables in the dataset: distance_to_hospital and ob_antenatal_visits. Before we can start analyzing these variables, we need to do some simple data cleaning. The code below cleans these two variables and creates histograms of each of them.

# Turn distance_to_hospital into a numeric variable:
dat$distance_to_hospital <- as.numeric(dat$distance_to_hospital)

# Make a histogram of distance_to_hospital
ggplot(data = dat, aes(distance_to_hospital)) + geom_histogram()

# If a woman attended 10+ antenatal visits, replace this with "10"
# (Don't do this in real life. You will learn methods for dealing with truncated data like this in PHS2000!)
dat$ob_antenatal_visits[dat$ob_antenatal_visits=="10+"] <- 10

# Turn ob_antenatal_visits into a numeric variable
dat$ob_antenatal_visits <- as.numeric(dat$ob_antenatal_visits)

# Make a histogram of ob_antenatal_visits
ggplot(data = dat, aes(distance_to_hospital)) + geom_histogram()

Calculating means using group_by and summarise

Now that we have cleaned these two variables of interest, we’re ready to use summarise()!

⊕Where else could you look (apart from in this R tutorial) for instructions on how to use a new function? One of the most useful skills I’ve learned in grad school is how to find answers to my coding questions using a combination of google and the ‘Help’ section of R. Try typing summarise() into the Help tab in RStudio. It will give you a list of functions, including “dplyr::summarise,” which is the one we’re interested in. If you click on this, it will give you the instructions for the summarise() function. If you scroll down to the examples, you’ll see helpful examples of different ways to use this function.

Let’s first calculate the mean of each of the variables we’re interested in. Start by just typing mean(variable_name).

dat %>% summarise(mean_ANCvisits= mean(ob_antenatal_visits), mean_distance = mean(distance_to_hospital))

Why didn’t this work?

Try again, giving R different instructions about how to handle missing values.

dat %>% summarise(mean_ANCvisits= mean(ob_antenatal_visits, na.rm=T), mean_distance = mean(distance_to_hospital, na.rm=T))

What is the mean number of antenatal care (ANC) visits each woman attended? What is the mean distance travelled to the hospital?

This is interesting, but it would be more interesting if we could look at these results by sub-groups in the dataset. Let’s look at the averages by incident_area. To do this, we need to tell R to group the dataset by incident_area before it calculates the means.

dat %>%
  group_by(incident_area) %>%
  summarise(mean_ANCvisits= mean(ob_antenatal_visits, na.rm=T), mean_distance = mean(distance_to_hospital, na.rm=T))

What if we group according to whether women had prior c-sections?

dat %>%
  group_by(Prior_csection) %>%
  summarise(mean_ANCvisits= mean(ob_antenatal_visits, na.rm=T), mean_distance = mean(distance_to_hospital, na.rm=T))

What if we wanted to filter the dataset to include only women in rural areas, and then summarize according to whether they had prior c-sections?

Counting observations in different sub-groups

⊕Do you remember how we did this last week when we were looking at factor variables?

Let’s say we’re interested not only in the mean of ANC visits for each type of incident_area but also in the number of woman in each incident_area in our sample. We can calculate both things at once using group_by and summarise() along with n(), which counts the number of observations

dat %>%
  group_by(incident_area) %>%
  summarise(n = n(),
            mean_ANCvisits= mean(ob_antenatal_visits, na.rm=T), 
            mean_distance = mean(distance_to_hospital, na.rm=T))

⊕An aside: I used to count the observations in different sub-groups by creating a variable that was equal to “1” for every observation in the dataset set, and then using group_by and summarise to add up all of the 1s in each subgroup. This is not the most efficient approach. But, if you had to do it this way, how would you write the code?

One last thing: arrange rows and selecting rows based on their position

In some situations, we may be interested in re-ordering our dataset by a particular variable or in selecting variables based on their position in the dataset. For example, if we want to re-arrnage our dataset based on distance_to_hospital, we can write: arrange(dat, distance_to_hospital) or arrange(dat, desc(distance_to_hospital)). How are these different?

We can then identify the women who traveled the shortest or longest distances by selecting the first or last observation using first(dat, var_name) or last(dat, var_name) where var_name is the variable we’re interested in.

How is this relevant for group_by() and summarise()?

Let’s say we want to look at shortest and longest distance traveled, grouping women by incident_area:

dat %>%
  arrange(distance_to_hospital) %>%
  group_by(incident_area) %>%
  summarise(shortest_dist= first(distance_to_hospital), 
            longest_dist = last(distance_to_hospital),
            mean_dist = mean(distance_to_hospital, na.rm=T))

Why doesn’t this work? In this situation, we’re not able to identify the longest distance traveled because some distances are missing. Let’s filter the dataset to remove cases where distance is missing and then re-run the code:

dat %>%
  filter(!is.na(distance_to_hospital)) %>%
  arrange(distance_to_hospital) %>%
  group_by(incident_area) %>%
  summarise(shortest_dist= first(distance_to_hospital), 
            longest_dist = last(distance_to_hospital),
            mean_dist = mean(distance_to_hospital, na.rm=T))

References

This lesson closely follows R for Data Science Chapter 5.6: link

The data used in this lesson came from: Strehlow MC, Newberry JA, Bills CB, Min H(, Evensen AE, Leeman L, Pirrotta EA, Rao GVR, Mahadevan SV (2016) Characteristics and outcomes of women using emergency medical services for third-trimester pregnancy-related problems in India: a prospective observational study. BMJ Open 6(7): e011459. https://doi.org/10.1136/bmjopen-2016-011459

They are available in the Dyrad digital data repository here: trehlow MC, Newberry JA, Bills CB, Min H, Evensen AE, Leeman L, Pirrotta EA, Rao GVR, Mahadevan SV (2016) Data from: Characteristics and outcomes of women utilizing emergency medical services for third-trimester pregnancy-related complaints in India: a prospective observational study. Dryad Digital Repository. https://doi.org/10.5061/dryad.g08gb

Office Hours 6