Selecting columns with select()

September 19, 2018

Motivation and goals

Much of what we’re covering in today’s lesson can be found in R for Data Science, Chapter 5.4.

Today’s office hours will focus on selecting columns (i.e., variables, predictors, features) of a dataset using select(). Imagine that you get a dataset with hundreds of variables, but you’re only interested in an analysis that involves ten of them. select() allows you to focus on only those ten, or a subset of variables.

It’s also possible that you want to select variables based on certain conditions. Helper functions used within select(), like starts_with(), ends_with(), contains(), matches(), and num_range(), can help us do this.

Set-up

Generally, good practice is to set the root directory (of a set of directories that you’re working with) as your working directory.

If you’ve been to previous weeks, you’ll probably note that each of us likes to set up our workspace slightly differently. I tend to prefer setting my working directory (i.e. the file that you’re working in - usually where you’ve saved your .Rmd or .R file that you’re working on) from the command line. That way, I generally know where I am among my directories and know what path to call should I need to load other files.

getwd() # find out where you are
setwd("/Users/isabellefeldhaus/Dropbox/Harvard/fall2018/phs2000-r/R-office-hours/OH-03") # set the directory in which you want to work

When you set your working directory, this means that any file or dataset in that directory that you want to use can be loaded using the name of that file rather than specifying the full path each time.

If you haven’t already, install and/or load the tidyverse package:

install.packages("tidyverse") # run this line only if you've never installed or loaded `tidyverse` on your machine
library(tidyverse) # this loads the package you've already installed

For today’s lesson, we’ll use the very widely used nycflights13 dataset. This dataset contains all 336,776 flights that departed from New York City in 2013 and comes from the US Bureau of Transportation Statistics. It’s used in many examples for a range of programming languages and even in classes here at Harvard. It could be nice to be a little familiar with it, so we’ll take a look at it here.

Because it’s so widely used, R offers the dataset as a nice package:

install.packages("nycflights13") 
library(nycflights13)

Look at the data

For this package, we can quickly see what datasets are available to us by typing nycflights13::+ tab in the console. We should see a menu of options pop up. Use the arrow keys to select among the options and press tab again to fill in your command. This automatic fill-in feature is something that makes RStudio so great! (I also use it when I don’t remember what options are available for a particular command.)

Back to the data — I have a habit of taking a first look at my data by using either View() or head(), depending on the size of my dataset:

View(flights) # I use `View()` when I know that my dataset is relatively small and RStudio most likely won't have a problem loading the full table.

head(flights) # Otherwise, I'll just look at the first few observations of the dataset in my console. 

Looking at the data in this way gives a rough idea of what we’re working with in terms of variables and values for particular variables. It gives a high level view of whether the dataset makes sense.

How many variables are in the flights dataset?

There are a number of ways to approach this question:

ncol(flights)
## [1] 19
dim(flights)
## [1] 336776     19
names(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Selecting columns with select()

An Introduction to Statistical Data Sciences via R Chapter 5.9.1 shows these commands with piping, which will be covered in next week’s office hours.

Select columns by name

select(flights, flight, origin, dest, dep_time, arr_time)
## # A tibble: 336,776 x 5
##    flight origin dest  dep_time arr_time
##     <int> <chr>  <chr>    <int>    <int>
##  1   1545 EWR    IAH        517      830
##  2   1714 LGA    IAH        533      850
##  3   1141 JFK    MIA        542      923
##  4    725 JFK    BQN        544     1004
##  5    461 LGA    ATL        554      812
##  6   1696 EWR    ORD        554      740
##  7    507 EWR    FLL        555      913
##  8   5708 LGA    IAD        557      709
##  9     79 JFK    MCO        557      838
## 10    301 LGA    ORD        558      753
## # ... with 336,766 more rows

If you’re curious about what a tibble is, you can read more here. For our purposes, it’s a data frame.

Notice that the resulting data frame has re-ordered the columns of the original dataset to match the order you’ve specified in the command.

Select all columns between two specified columns

select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

Select all columns except specified columns

select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515        2.      830            819       11.
##  2      533            529        4.      850            830       20.
##  3      542            540        2.      923            850       33.
##  4      544            545       -1.     1004           1022      -18.
##  5      554            600       -6.      812            837      -25.
##  6      554            558       -4.      740            728       12.
##  7      555            600       -5.      913            854       19.
##  8      557            600       -3.      709            723      -14.
##  9      557            600       -3.      838            846       -8.
## 10      558            600       -2.      753            745        8.
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that this command is inclusive, i.e., year and day are not included in the resulting data frame.

Select all columns that start with “dep” using starts_with()

select(flights, starts_with("dep"))
## # A tibble: 336,776 x 2
##    dep_time dep_delay
##       <int>     <dbl>
##  1      517        2.
##  2      533        4.
##  3      542        2.
##  4      544       -1.
##  5      554       -6.
##  6      554       -4.
##  7      555       -5.
##  8      557       -3.
##  9      557       -3.
## 10      558       -2.
## # ... with 336,766 more rows

This gives us all of the columns whose names begin with “dep”. Notice that another column, sched_dep_time is not included in the output.

Select all columns that end with “delay” using ends_with()

select(flights, ends_with("delay"))
## # A tibble: 336,776 x 2
##    dep_delay arr_delay
##        <dbl>     <dbl>
##  1        2.       11.
##  2        4.       20.
##  3        2.       33.
##  4       -1.      -18.
##  5       -6.      -25.
##  6       -4.       12.
##  7       -5.       19.
##  8       -3.      -14.
##  9       -3.       -8.
## 10       -2.        8.
## # ... with 336,766 more rows

Select all columns containing “arr” using contains()

Say that we want to create a subset of all of the columns related to a flight’s arrival.

select(flights, contains("arr"))
## # A tibble: 336,776 x 4
##    arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl> <chr>  
##  1      830            819       11. UA     
##  2      850            830       20. UA     
##  3      923            850       33. AA     
##  4     1004           1022      -18. B6     
##  5      812            837      -25. DL     
##  6      740            728       12. UA     
##  7      913            854       19. B6     
##  8      709            723      -14. EV     
##  9      838            846       -8. B6     
## 10      753            745        8. AA     
## # ... with 336,766 more rows

This gives us all of the columns that contain the string “arr”, regardless of where it is in this string is in the column name. Notice that, different from the starts_with() example above, this selection includes the sched_arr_time variable — as well as another surprise variable.

Select columns that exactly match a specified string using matches()

Using matches() results in a similar selection of columns as contains():

select(flights, matches("arr"))
## # A tibble: 336,776 x 4
##    arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl> <chr>  
##  1      830            819       11. UA     
##  2      850            830       20. UA     
##  3      923            850       33. AA     
##  4     1004           1022      -18. B6     
##  5      812            837      -25. DL     
##  6      740            728       12. UA     
##  7      913            854       19. B6     
##  8      709            723      -14. EV     
##  9      838            846       -8. B6     
## 10      753            745        8. AA     
## # ... with 336,766 more rows

Read more about regular expressions and using matches() here. There’s also a general cheat sheet for regular expressions in R.

However, matches() has the added capability of selecting variables based on matching a regular expression.

What’s the difference between these 3 commands?

select(flights, matches(".y."))
select(flights, matches("y."))
select(flights, matches(".y"))
# Selecting variables that contains: any character + "y" + any character 
select(flights, matches(".y."))
## # A tibble: 336,776 x 0
# Selecting variables that begin with "y"
select(flights, matches("y."))
## # A tibble: 336,776 x 1
##     year
##    <int>
##  1  2013
##  2  2013
##  3  2013
##  4  2013
##  5  2013
##  6  2013
##  7  2013
##  8  2013
##  9  2013
## 10  2013
## # ... with 336,766 more rows
# Selecting variables that end with "y"
select(flights, matches(".y"))
## # A tibble: 336,776 x 3
##      day dep_delay arr_delay
##    <int>     <dbl>     <dbl>
##  1     1        2.       11.
##  2     1        4.       20.
##  3     1        2.       33.
##  4     1       -1.      -18.
##  5     1       -6.      -25.
##  6     1       -4.       12.
##  7     1       -5.       19.
##  8     1       -3.      -14.
##  9     1       -3.       -8.
## 10     1       -2.        8.
## # ... with 336,766 more rows

Select columns by numbered index using num_range()

num_range() is used when the suffixes of a set of column names are numbers, i.e. x1, x2, x3, etc. The command only takes a prefix and a number or range of numbers, so the number needs to be last in the variable name. (If the numbers are in the middle, like unit_1_score, unit_2_score, unit_3_score, consider using matches() and a regular expression to define the string pattern.)

Another resource on select() helper functions from R-bloggers.

To see how num_range() works, we need to rename the columns of our dataset quickly.

flights_x <- flights
colnames(flights_x) <- sprintf("x%d", 1:19)
names(flights_x)
##  [1] "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "x8"  "x9"  "x10" "x11"
## [12] "x12" "x13" "x14" "x15" "x16" "x17" "x18" "x19"

Alright, now we can see that each of our variables starts with the same prefix and ends with a number. In this case, the other helper functions that we’ve just covered don’t help us much. Rather, we’d want to use num_range().

select(flights_x, num_range("x", 6:10))
## # A tibble: 336,776 x 5
##       x6    x7    x8    x9 x10  
##    <dbl> <int> <int> <dbl> <chr>
##  1    2.   830   819   11. UA   
##  2    4.   850   830   20. UA   
##  3    2.   923   850   33. AA   
##  4   -1.  1004  1022  -18. B6   
##  5   -6.   812   837  -25. DL   
##  6   -4.   740   728   12. UA   
##  7   -5.   913   854   19. B6   
##  8   -3.   709   723  -14. EV   
##  9   -3.   838   846   -8. B6   
## 10   -2.   753   745    8. AA   
## # ... with 336,766 more rows

Renaming variables with rename()

When cleaning a dataset, you may want to rename variables. select() can be used for this. However, calling select() also drops variables not mentioned in your command line. rename() is a variant of select() that keeps all of the variables and renames the ones you want.

flights_x <- rename(flights_x, year = x1, month = x2)
names(flights_x)
##  [1] "year"  "month" "x3"    "x4"    "x5"    "x6"    "x7"    "x8"   
##  [9] "x9"    "x10"   "x11"   "x12"   "x13"   "x14"   "x15"   "x16"  
## [17] "x17"   "x18"   "x19"

Compare rename() from tidyverse or dplyr to what you would do using base R:

names(flights_x)[17:19] <- c("hour", "minute", "time_hour")
names(flights_x)
##  [1] "year"      "month"     "x3"        "x4"        "x5"       
##  [6] "x6"        "x7"        "x8"        "x9"        "x10"      
## [11] "x11"       "x12"       "x13"       "x14"       "x15"      
## [16] "x16"       "hour"      "minute"    "time_hour"

Some additional thoughts

1. What happens if you include the name of a variable multiple times in a select() call?

2. What’s the result of running the following code?

select(flights, contains("TIME"))

3. How can you change the default of the above code?

For reference, you can take a look at some documentation for select() helper functions.