select()
Much of what we’re covering in today’s lesson can be found in R for Data Science, Chapter 5.4.
Today’s office hours will focus on selecting columns (i.e., variables, predictors, features) of a dataset using select()
. Imagine that you get a dataset with hundreds of variables, but you’re only interested in an analysis that involves ten of them. select()
allows you to focus on only those ten, or a subset of variables.
It’s also possible that you want to select variables based on certain conditions. Helper functions used within select()
, like starts_with()
, ends_with()
, contains()
, matches()
, and num_range()
, can help us do this.
Generally, good practice is to set the root directory (of a set of directories that you’re working with) as your working directory.
If you’ve been to previous weeks, you’ll probably note that each of us likes to set up our workspace slightly differently. I tend to prefer setting my working directory (i.e. the file that you’re working in - usually where you’ve saved your .Rmd or .R file that you’re working on) from the command line. That way, I generally know where I am among my directories and know what path to call should I need to load other files.
getwd() # find out where you are
setwd("/Users/isabellefeldhaus/Dropbox/Harvard/fall2018/phs2000-r/R-office-hours/OH-03") # set the directory in which you want to work
When you set your working directory, this means that any file or dataset in that directory that you want to use can be loaded using the name of that file rather than specifying the full path each time.
If you haven’t already, install and/or load the tidyverse
package:
install.packages("tidyverse") # run this line only if you've never installed or loaded `tidyverse` on your machine
library(tidyverse) # this loads the package you've already installed
For today’s lesson, we’ll use the very widely used nycflights13
dataset. This dataset contains all 336,776 flights that departed from New York City in 2013 and comes from the US Bureau of Transportation Statistics. It’s used in many examples for a range of programming languages and even in classes here at Harvard. It could be nice to be a little familiar with it, so we’ll take a look at it here.
Because it’s so widely used, R offers the dataset as a nice package:
install.packages("nycflights13")
library(nycflights13)
For this package, we can quickly see what datasets are available to us by typing nycflights13::
+ tab in the console. We should see a menu of options pop up. Use the arrow keys to select among the options and press tab again to fill in your command. This automatic fill-in feature is something that makes RStudio so great! (I also use it when I don’t remember what options are available for a particular command.)
Back to the data — I have a habit of taking a first look at my data by using either View()
or head()
, depending on the size of my dataset:
View(flights) # I use `View()` when I know that my dataset is relatively small and RStudio most likely won't have a problem loading the full table.
head(flights) # Otherwise, I'll just look at the first few observations of the dataset in my console.
Looking at the data in this way gives a rough idea of what we’re working with in terms of variables and values for particular variables. It gives a high level view of whether the dataset makes sense.
How many variables are in the flights
dataset?
There are a number of ways to approach this question:
ncol(flights)
## [1] 19
dim(flights)
## [1] 336776 19
names(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
select()
An Introduction to Statistical Data Sciences via R Chapter 5.9.1 shows these commands with piping, which will be covered in next week’s office hours.
select(flights, flight, origin, dest, dep_time, arr_time)
## # A tibble: 336,776 x 5
## flight origin dest dep_time arr_time
## <int> <chr> <chr> <int> <int>
## 1 1545 EWR IAH 517 830
## 2 1714 LGA IAH 533 850
## 3 1141 JFK MIA 542 923
## 4 725 JFK BQN 544 1004
## 5 461 LGA ATL 554 812
## 6 1696 EWR ORD 554 740
## 7 507 EWR FLL 555 913
## 8 5708 LGA IAD 557 709
## 9 79 JFK MCO 557 838
## 10 301 LGA ORD 558 753
## # ... with 336,766 more rows
If you’re curious about what a tibble is, you can read more here. For our purposes, it’s a data frame.
Notice that the resulting data frame has re-ordered the columns of the original dataset to match the order you’ve specified in the command.
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2. 830 819 11.
## 2 533 529 4. 850 830 20.
## 3 542 540 2. 923 850 33.
## 4 544 545 -1. 1004 1022 -18.
## 5 554 600 -6. 812 837 -25.
## 6 554 558 -4. 740 728 12.
## 7 555 600 -5. 913 854 19.
## 8 557 600 -3. 709 723 -14.
## 9 557 600 -3. 838 846 -8.
## 10 558 600 -2. 753 745 8.
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note that this command is inclusive, i.e., year
and day
are not included in the resulting data frame.
starts_with()
select(flights, starts_with("dep"))
## # A tibble: 336,776 x 2
## dep_time dep_delay
## <int> <dbl>
## 1 517 2.
## 2 533 4.
## 3 542 2.
## 4 544 -1.
## 5 554 -6.
## 6 554 -4.
## 7 555 -5.
## 8 557 -3.
## 9 557 -3.
## 10 558 -2.
## # ... with 336,766 more rows
This gives us all of the columns whose names begin with “dep”. Notice that another column, sched_dep_time
is not included in the output.
ends_with()
select(flights, ends_with("delay"))
## # A tibble: 336,776 x 2
## dep_delay arr_delay
## <dbl> <dbl>
## 1 2. 11.
## 2 4. 20.
## 3 2. 33.
## 4 -1. -18.
## 5 -6. -25.
## 6 -4. 12.
## 7 -5. 19.
## 8 -3. -14.
## 9 -3. -8.
## 10 -2. 8.
## # ... with 336,766 more rows
contains()
Say that we want to create a subset of all of the columns related to a flight’s arrival.
select(flights, contains("arr"))
## # A tibble: 336,776 x 4
## arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <chr>
## 1 830 819 11. UA
## 2 850 830 20. UA
## 3 923 850 33. AA
## 4 1004 1022 -18. B6
## 5 812 837 -25. DL
## 6 740 728 12. UA
## 7 913 854 19. B6
## 8 709 723 -14. EV
## 9 838 846 -8. B6
## 10 753 745 8. AA
## # ... with 336,766 more rows
This gives us all of the columns that contain the string “arr”, regardless of where it is in this string is in the column name. Notice that, different from the starts_with()
example above, this selection includes the sched_arr_time
variable — as well as another surprise variable.
matches()
Using matches()
results in a similar selection of columns as contains()
:
select(flights, matches("arr"))
## # A tibble: 336,776 x 4
## arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <chr>
## 1 830 819 11. UA
## 2 850 830 20. UA
## 3 923 850 33. AA
## 4 1004 1022 -18. B6
## 5 812 837 -25. DL
## 6 740 728 12. UA
## 7 913 854 19. B6
## 8 709 723 -14. EV
## 9 838 846 -8. B6
## 10 753 745 8. AA
## # ... with 336,766 more rows
Read more about regular expressions and using matches()
here. There’s also a general cheat sheet for regular expressions in R.
However, matches()
has the added capability of selecting variables based on matching a regular expression.
What’s the difference between these 3 commands?
select(flights, matches(".y."))
select(flights, matches("y."))
select(flights, matches(".y"))
# Selecting variables that contains: any character + "y" + any character
select(flights, matches(".y."))
## # A tibble: 336,776 x 0
# Selecting variables that begin with "y"
select(flights, matches("y."))
## # A tibble: 336,776 x 1
## year
## <int>
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
## # ... with 336,766 more rows
# Selecting variables that end with "y"
select(flights, matches(".y"))
## # A tibble: 336,776 x 3
## day dep_delay arr_delay
## <int> <dbl> <dbl>
## 1 1 2. 11.
## 2 1 4. 20.
## 3 1 2. 33.
## 4 1 -1. -18.
## 5 1 -6. -25.
## 6 1 -4. 12.
## 7 1 -5. 19.
## 8 1 -3. -14.
## 9 1 -3. -8.
## 10 1 -2. 8.
## # ... with 336,766 more rows
num_range()
num_range()
is used when the suffixes of a set of column names are numbers, i.e. x1
, x2
, x3
, etc. The command only takes a prefix and a number or range of numbers, so the number needs to be last in the variable name. (If the numbers are in the middle, like unit_1_score
, unit_2_score
, unit_3_score
, consider using matches()
and a regular expression to define the string pattern.)
Another resource on select()
helper functions from R-bloggers.
To see how num_range()
works, we need to rename the columns of our dataset quickly.
flights_x <- flights
colnames(flights_x) <- sprintf("x%d", 1:19)
names(flights_x)
## [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" "x11"
## [12] "x12" "x13" "x14" "x15" "x16" "x17" "x18" "x19"
Alright, now we can see that each of our variables starts with the same prefix and ends with a number. In this case, the other helper functions that we’ve just covered don’t help us much. Rather, we’d want to use num_range()
.
select(flights_x, num_range("x", 6:10))
## # A tibble: 336,776 x 5
## x6 x7 x8 x9 x10
## <dbl> <int> <int> <dbl> <chr>
## 1 2. 830 819 11. UA
## 2 4. 850 830 20. UA
## 3 2. 923 850 33. AA
## 4 -1. 1004 1022 -18. B6
## 5 -6. 812 837 -25. DL
## 6 -4. 740 728 12. UA
## 7 -5. 913 854 19. B6
## 8 -3. 709 723 -14. EV
## 9 -3. 838 846 -8. B6
## 10 -2. 753 745 8. AA
## # ... with 336,766 more rows
rename()
When cleaning a dataset, you may want to rename variables. select()
can be used for this. However, calling select()
also drops variables not mentioned in your command line. rename()
is a variant of select()
that keeps all of the variables and renames the ones you want.
flights_x <- rename(flights_x, year = x1, month = x2)
names(flights_x)
## [1] "year" "month" "x3" "x4" "x5" "x6" "x7" "x8"
## [9] "x9" "x10" "x11" "x12" "x13" "x14" "x15" "x16"
## [17] "x17" "x18" "x19"
rename()
from tidyverse
or dplyr
to what you would do using base R:names(flights_x)[17:19] <- c("hour", "minute", "time_hour")
names(flights_x)
## [1] "year" "month" "x3" "x4" "x5"
## [6] "x6" "x7" "x8" "x9" "x10"
## [11] "x11" "x12" "x13" "x14" "x15"
## [16] "x16" "hour" "minute" "time_hour"
1. What happens if you include the name of a variable multiple times in a select()
call?
2. What’s the result of running the following code?
select(flights, contains("TIME"))
3. How can you change the default of the above code?
For reference, you can take a look at some documentation for select()
helper functions.