Using lubridate for dates and times

December 4/5, 2018

Motivation and goals

Today’s session follows R for Data Science, Chapter 16.

Dates and times can cause confusion when we’re working with them in datasets. Some years are leap years. With daylight savings time, some days have 23 hours, while others have 25 hours. And some minutes have 61 seconds every now and then and others have leap seconds added. In raw datasets, dates and times can be included in datasets as different types: strings, separate date-time components, or from an existing date/time object.

Today, we’ll use tools from the lubridate package that can help us to deal with issues presented by dates and times in our dataset.

Set up

As usual, let’s load tidyverse. We’ll be using the lubridate package, so go ahead and load that as well. We’ll also use the nycflights13 dataset as an example. (If you haven’t installed that yet, type install.packages("nycflights13") before you load it.)

library(tidyverse)
library(lubridate)
library(nycflights13) 

Classes for dates and times

R doesn’t have a native class for storing times. It only has a class for dates and date-times. If you need time classes, look into the hms package.

To get the current date, use today().

today()
## [1] "2018-12-04"

To get the current date-time, use now().

now()
## [1] "2018-12-04 00:10:12 EST"

For parsing strings into dates and times: R for Data Science, Chapter 11.3.4

Converting date/time data from strings

lubridate helper functions automatically work out the format of dates once you’ve specified the ordering of the components. For example:

ymd("2017-01-31")
## [1] "2017-01-31"
ymd(20170131)
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

To create a date-time, add h (hour), m (minute), or s (second) to your parsing function:

ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"

Adding a time zone also creates a date-time object:

ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"

Getting date-time from multiple columns

Let’s take a look at the flights data as an example:

flights %>% 
  select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ... with 336,766 more rows

Using mutate() and make_date() or make_datetime(), we can create date/time objects:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ... with 336,766 more rows

Switching between date-time and date objects

Use as_datetime() and as_date():

as_datetime(today())
## [1] "2018-12-04 UTC"
as_date(now())
## [1] "2018-12-04"

Exercise: Use the appropriate function to parse each of the following dates.

d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014

Getting individual components of a date or time

You can pull out individual parts of the date and time with year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second() functions.

datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6

Setting label=TRUE for month() and wday() returns the abbreviated name of the month or day of the week. Try also setting abbr=FALSE.

month(datetime, label=T)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label=T, abbr=F)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Rounding

You may want to round your times to a nearby unit of time. To do this, you can use floor_date(), round_date(), or ceiling_date. These functions take a vector of dates and then the name of the unit you want to round to. For example, to plot the number of flights per week:.

# This code just cleans the data -- don't focus on this
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

#This code rounds the flight times to the nearest week!
flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

Setting components

You can change individual components of a date one-at-at-time using year(), month(), or day(). Alternatively, you can change the whole date using update(). For example:

(datetime <- ymd_hms("2016-07-08 12:34:56"))
## [1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"
hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"

If you set a value too big, it will roll over.

ymd("2015-02-01") %>% 
  update(hour = 400)
## [1] "2015-02-17 16:00:00 UTC"

Time spans

In your analysis, you may need to do arithmetic with dates. For example, you may want calculate someone’s age from their birthday and today’s date, or you may want to know how many second there are between right now and the end of your last final exam. You can calculate both of those timespans using lubridate. We’ll talk about three types of timespans in this lesson: durations, periods, and intervals. Durations measure a lengths of time in seconds; periods measure lengths of time in units like hours or weeks, and intervals represent a starting point and an ending point.

Durations

In regular R, if you subtract two dates, you get a difftime object.

# How old is my longest-surviving plant?
fred_age <- today() - ymd(20160812)
fred_age
## Time difference of 844 days

Difftime objects can be hard to work with because they can record time spans in seconds, minutes, hours, days, or weeks. Lubridate provides an alternative that always uses seconds: the duration. You can construct durations using a lot of different functions. You can also add or multiple them.

as.duration(fred_age)
## [1] "72921600s (~2.31 years)"
dminutes(10)
## [1] "600s (~10 minutes)"
dhours(12)
## [1] "43200s (~12 hours)"
ddays(0:5)
## [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
2*dyears(1)
## [1] "63072000s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38847600s (~1.23 years)"

Periods

There’s one main issue with durations that you should pay attention to: human times (the ones we use in our day-to-day life) don’t always have constant durations. As mentioned above, some days have 23 hours, some years have 366 days, and so on. To deal with this, lubridate has ‘periods,’ which are units of ‘human time’ such as a day or a month.

one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")

one_pm + ddays(1) # if we add one duration day to one pm on March 12, we don't end up with quite what we would expect
## [1] "2016-03-13 14:00:00 EDT"
one_pm + days(1) # because of daylight savings time, we should add a period day instead
## [1] "2016-03-13 13:00:00 EDT"

We can use periods to fix an oddity in the flight dataset: some flights appear to arrive at their destination before they departed from New York City. This is because they are overnight flights. We can address the issue by adding one day to all of the overnight flights.

flights_dt %>% 
  filter(arr_time < dep_time) 
## # A tibble: 10,633 x 9
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00
##  2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00
##  3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00
##  4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00
##  5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00
##  6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00
##  7 EWR    MCO          41        43 2013-01-01 21:21:00 2013-01-01 20:40:00
##  8 JFK    LAX          -7       -24 2013-01-01 21:28:00 2013-01-01 21:35:00
##  9 EWR    FLL          49        28 2013-01-01 21:34:00 2013-01-01 20:45:00
## 10 EWR    FLL          -9       -14 2013-01-01 21:36:00 2013-01-01 21:45:00
## # ... with 10,623 more rows, and 3 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>
flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

flights_dt %>% 
  filter(overnight, arr_time < dep_time) 
## # A tibble: 0 x 10
## # ... with 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
## #   arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   overnight <lgl>

Intervals

In some situations, knowing the length of time (e.g. one year) doesn’t give us enough precise information. For example, if we ran the code years(1) / days(1), this could be 365 days if the year is 2015 or 366 days if the year is 2016.

We can use intervals to get more precise. Intervals are durations with a starting date. For example:

next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365

And that’s everything! In general, take extra care when using time variables in analysis as there can be details that surprise you. Lubridate helps address many of these, but always be sure to check what is actually happening in your analysis!