We rarely receive data in analysis-ready form with all of the variables we need, the right subset of observations, and the right structure. The dplyr
package in R, part of the tidyverse
family introduced last week, provides a helpful set of data transformation functions that can get us there.
Today, we’re going to start with learning how to use the function filter()
and a suite of “logical operators” (‘and’, ‘or’, ‘not’, etc.) to select the rows we want to use from a dataset. We’ll also practice using ggplot to build on last week’s lesson.
We’ll spend the first 15-20 minutes on this lesson. Then, we’ll shift to discussing coding questions from this week’s lab and problem set.
Before we begin, we need to load the packages we’ll be using for this analysis.
If you haven’t installed the tidyverse package yet, you’ll need to do that before you can load it. You can install the package by typing install.packages("tidyverse")
into the console of RStudio and then hitting enter.
library(tidyverse)
I found these data using the new Google Dataset Search tool link. Next, we need to load in our data. We’re going to use a dataset from a study of women using emergency medical services (EMS) for third-trimester pregnancy-related complications in India in 2014.
# if you downloaded it
dat <- read_csv("OH-02-data.csv")
# if not
dat <- read_csv("https://louisahsmith.github.io/R-office-hours/data/OH-02-data.csv")
Let’s start by just looking at what is in this dataset. The function names()
shows us the names of the variables in the dataset.
names(dat)
We’re interested in how far women traveled to the health facility with the EMS. Let’s make a histogram of the distance traveled.
Try making a histogram using the code from last week: ggplot(data = dat, aes(distance_to_hospital)) + geom_histogram()
. Why doesn’t this work?
You probably got an error term saying Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
Since this variable is currently formatted as a character, we need to change it to an integer or numeric value before we can make the histogram.
## Change distance variable to numeric
dat$distance_to_hospital <- as.numeric(dat$distance_to_hospital)
## Make histogram
ggplot(data = dat, aes(distance_to_hospital)) +
geom_histogram() +
ggtitle("Distance traveled to the hospital") +
xlab("Distance (km)")
How would you describe this distribution?
We can also calculate some descriptive statistics. Note that we tell R to ignore the missing values by typing na.rm=T
.
## Mean distance
mean(dat$distance_to_hospital, na.rm=T)
## Sample variance
var(dat$distance_to_hospital, na.rm=T)
## Range of distances
range(dat$distance_to_hospital, na.rm=T)
Let’s say we want to know the distribution of distances traveled by different subsets of women in the dataset. The function filter()
allows us to filter observations based on their values. The first argument of filter()
is the name of the dataset. The next arguments refer to different restrictions we would like to place on the subset of observations.
For example, if we’re only interested in observations from Andhra Pradesh, we can filter the dataset as follows:
filter(dat, state == "Andhra Pradesh")
If we want to save this subset as it’s own data frame, we can use the code:
dat_ap <- filter(dat, state == "Andhra Pradesh")
Now, we can re-make the histogram or calculate summary statistics using only data from Andhra Pradesh.
ggplot(data = dat_ap, aes(distance_to_hospital)) +
geom_histogram() +
ggtitle("Distance traveled to the hospital") +
xlab("Distance (km)")
mean(dat_ap$distance_to_hospital, na.rm=T)
## [1] 18.01044
nrow(dat_ap) # tells us how many rows the new dataset has
## [1] 480
We can identify the observations we’re interested in using comparison operators: >
, >=
, <
, <=
, !=
, and ==
. Note that we use ==
and not =
to denote “equal.” A single equal sign (=
) can be used instead of <-
to assign values to things in R.
If we only want to analyze data for women who traveled more than 20km, we can use the code filter(dat, distance_to_hospital > 20)
.
We can also combine multiple conditions using Boolean operators: &
(“and”), |
(“or”), and !
(“not”).
Let’s say we’re interested in women in Andhra Pradesh and Gujarat who have had previous C-sections. We can identify this group using the code filter(dat, (state=="Andhra Pradesh" | state=="Gujarat") & Prior_csection=="Yes")
.
We can also write this as: filter(dat, (state %in% c("Andhra Pradesh", "Gujarat")) & Prior_csection=="Yes")
. The part that says state %in% c("Andhra Pradesh", "Gujarat")
is telling us that we want a state that is part of the list of Andhra Pradesh and Gujarat.
One way to check if you are subsetting the data correctly is to use the table()
function. For example table(dat$state)
will tell you the number of observations from each state in your dataset. What would the code be if we were interested in subsetting the data to women with either anemia or hypertension who not in Assam?
In the code above, we ignored missing values when we calculated the mean by using na.rm=T
(which you can read as “remove NAs is true.”) If we had tried to calculate the mean without specifying this, R would have told us that the mean itself was NA
. This is because, in R, whenever there is a missing value the result of an operation will also be missing.
Let’s say we want to include missing values in our subset. If we’re interested in women who had a previous C-section or are missing, we can subset our data as filter(dat, Prior_csection=="No" | is.na(Prior_csection))
.
Note that we don’t say Prior_csection==NA
, but this is essentially what is.na()
means.
It’s important to be careful when using the Boolean operator ==
. The computer can’t store infinite digits, so we find that things we know to be true are not actually true in R!
sqrt(2) ^ 2 == 2
## [1] FALSE
1 / 49 * 49 == 1
## [1] FALSE
To get around this, we can use the function near()
.
near(sqrt(2) ^ 2, 2)
## [1] TRUE
near(1 / 49 * 49, 1)
## [1] TRUE
For more exercises like this, refer to Chapter 5 of “R for Data Science” link Here are some exercises to practice using filter()
and the Boolean operators.
How many women delivered in Gujarat, Assam, or Karnataka and were in the 18-35 age range?
Use filter to create separate datasets for women who used EMS services in urban, rural, and tribal areas. Calculate the mean distance traveled in each of these groups.
An convenient way to compare distributions of a variable for different subsets in your data is to use to use ggplot
with facet_wrap
. Make a histogram of the distance traveled to the facility (using the full dataset) and then add facet_wrap(~incidence_area)
to see how this works.
This lesson closely follows R for Data Science Chapter 5.2: link
The data used in this lesson came from: Strehlow MC, Newberry JA, Bills CB, Min H(, Evensen AE, Leeman L, Pirrotta EA, Rao GVR, Mahadevan SV (2016) Characteristics and outcomes of women using emergency medical services for third-trimester pregnancy-related problems in India: a prospective observational study. BMJ Open 6(7): e011459. https://doi.org/10.1136/bmjopen-2016-011459
They are available in the Dyrad digital data repository here: trehlow MC, Newberry JA, Bills CB, Min H, Evensen AE, Leeman L, Pirrotta EA, Rao GVR, Mahadevan SV (2016) Data from: Characteristics and outcomes of women utilizing emergency medical services for third-trimester pregnancy-related complaints in India: a prospective observational study. Dryad Digital Repository. https://doi.org/10.5061/dryad.g08gb