De-bugging R code

December 11-12, 2018

Motivation and goals

This semester, we’ve covered a lot of material about how to clean, wrangle, and visualize data in R. These are important foundational skills for whatever coding you might need to do in the future. However, they won’t be enough. Inevitably, you will hit up against challenges in R – the code you wrote might not run, or you might not be able to figure out the right steps to follow to get your data into the format you want for a particular analysis. This is where de-bugging and searching skills come in. One of the most useful things you can learn about coding is how to problem-solve when your code isn’t doing what you want it to. In this week’s session we’ll give you some tips for these situations. We’ll start by talking about how to figure out where you went wrong if your code doesn’t run. Then, we’ll talk about how to search for answers when you aren’t able to figure out a coding problem on your own. Finally, we’ll discuss some common problems to be aware of in R.

Throughout this lesson, we’ll use examples from the fivethirtyeight package, as we did in Office Hours 7.

Let’s start by loading the tidyverse and the fivethirtyeight package:

library(fivethirtyeight)
library(tidyverse)

Figuring out where you went wrong

A first step in de-bugging R is to try to figure out where you went wrong. We’re going to discuss a few different ways of doing this.

Locate the bug within your code by running small sections of code one-at-a-time.

Start from the top of your code, and run small segments one-at-a-time. Alternatively, if you have no idea where the error is located in a long piece of code, run the first half of your code to see if it is there, and then gradually narrow down into the segment where the error is located.

If your code is formatted as a for-loop, set i to a particular value, and then run each line of the for-loop for that value of i. This will allow you to see which line of the loop caused a problem.

Look at the data you’re working with (or, at least some of it)

We’re going to work with the the dataset comic_characters from the fivethirtyeight package as an example.

If you want to look at the data, you have several options:

dat <- comic_characters

# Look at the whole dataset
View(dat)

# Look at a few rows of the dataset
head(dat)
## # A tibble: 6 x 16
##   publisher page_id name  urlslug id    align eye   hair  sex   gsm   alive
##   <chr>       <int> <chr> <chr>   <chr> <ord> <chr> <chr> <chr> <chr> <chr>
## 1 Marvel       1678 Spid… "\\/Sp… Secr… Good… Haze… Brow… Male… <NA>  Livi…
## 2 Marvel       7139 Capt… "\\/Ca… Publ… Good… Blue… Whit… Male… <NA>  Livi…
## 3 Marvel      64786 "Wol… "\\/Wo… Publ… <NA>  Blue… Blac… Male… <NA>  Livi…
## 4 Marvel       1868 "Iro… "\\/Ir… Publ… Good… Blue… Blac… Male… <NA>  Livi…
## 5 Marvel       2460 Thor… "\\/Th… No D… Good… Blue… Blon… Male… <NA>  Livi…
## 6 Marvel       2458 Benj… "\\/Be… Publ… Good… Blue… No H… Male… <NA>  Livi…
## # ... with 5 more variables: appearances <int>, first_appearance <chr>,
## #   month <chr>, year <int>, date <date>
# Look at the variable names
names(dat)
##  [1] "publisher"        "page_id"          "name"            
##  [4] "urlslug"          "id"               "align"           
##  [7] "eye"              "hair"             "sex"             
## [10] "gsm"              "alive"            "appearances"     
## [13] "first_appearance" "month"            "year"            
## [16] "date"
# Look at a few variables of the dataset
View(data.frame(dat$publisher, dat$year, dat$name))

Clear everything and run your code again

Something might have gone wrong if you ran code out-of-order or you ran the same lines multiple times. In these cases, it can help to clear everything in your environment (by clicking on the broom) and then run your code again.

You can also clear the environment by running rm(list=ls()).

Restart R

You may be inadvertently using some variable that you saved globally. Try re-starting R to see if your code will run without an error.

If you have complex functions, try using options(warn = 1)

Normally, if there is a warning, R will print this when it’s done running the function. If you want to have warnings printed as they occur, you can specify this in a function.

Make a mini example

If you’re not sure how R is getting to the (incorrect) calculations it’s giving you, try running the code on a small subset of your dataset so that you can calculate things by hand and compare to the R results.

Searching for answers

If you can’t figure out the problem by looking through your data and code, there are a number of useful resources you can turn to. Some good places to start are:

  1. Google your error message.

  2. Google what you’re trying to do (e.g. ‘lag a variable twice within groups in my dataset in R.’) Be specific! You would be surprised by how often you find that someone else has had the exact same issue (or a very similar issue) and written about it on stack overflow or another help website.

  3. Post your question on stack exchange or another help website. (Make sure to look around for a bit first to see if someone has already answered your question, or be prepared for some users to be annoyed with you.)

Common problems

Something went wrong with missing values.

You might be having issues with missing values in your code. One thing I commonly do to check missings is I count up how many missings are in each row and each column of my dataset. You can do this as follows:

# To count up the missing values in each row:
dat$missings_byrow <- apply(dat, 1, function(x) sum(is.na(x)))
table(dat$missings_byrow)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11 
##   81 5401 7180 5316 3260 1323  436  157   51   50   14    3
# To count up the missing values in each column:
missings_bycol <- apply(dat, 2, function(x) sum(is.na(x)))
missings_bycol
##        publisher          page_id             name          urlslug 
##                0                0                0                0 
##               id            align              eye             hair 
##             5783             6186            13395             6538 
##              sex              gsm            alive      appearances 
##              979            23118                6             1451 
## first_appearance            month             year             date 
##               69              815              884              886 
##   missings_byrow 
##                0

It may also be helpful to look at the rows where a particular value is missing in order to understand what is going on. To do this, you can filter and look at the dataset:

dat_check <- filter(dat, is.na(dat$align))
View(dat_check)

The function you are using is part of multiple packages in R.

In some situations, a particular function name may be used by multiple packages in R. If you have loaded both of these packages, R won’t know which one to use. In these situations, you can type the name of the package you’re interested in using followed by :: (e.g. dplyr::) before the function name to ensure R uses the function you want it to.

Other issues:

Exercises

We’re going to start with the dataset classic_rock_song_list. This dataset contains the data used in this article. We’re going to try to recreate a table from the article, showing the 25 most commonly played songs across 25 classic rock stations:

The correct code to make this table is included below, but don’t look ahead…the point of this lesson is to learn how to troubleshoot, not to look at the answers!

Hint: start by loading the dataset into R:

dat <- classic_rock_song_list
# Load the data
dat <- classic_rock_song_list

# Remove the observations that are missing release years, and organize in order of playcount
fortable <- dat %>% filter(!is.na(release_year)) %>% arrange(desc(playcount_has_year)) %>% select(song, artist, release_year)

# Select the first 15 rows
fortable <- fortable[1:15,]

# Look at the results
View(fortable)

Next, let’s look at the dataset US_births_1994_2003, which was used in this article. We’re going to try to recreate this figure, showing that people are less likely to have babies on Friday the 13th:

Hint: start by loading the dataset into R:

dat <- US_births_1994_2003

I haven’t included the answer here (because I haven’t fully figured it out myself!), but this is a good example to practice on.

References

This lesson draws on the sources below. Some of these include much fancier techniques for de-bugging code than the ones we discussed today.

https://seananderson.ca/2013/08/23/debugging-r/

http://adv-r.had.co.nz/Exceptions-Debugging.html

https://github.com/berkeley-scf/tutorial-R-debugging/blob/master/R-debugging.Rmd