Today’s lesson follows part of Chapter 21 in R for Data Science. Today we’ll look at iteration. Often when you’re programming, you have to do something over and over. Maybe you’re running a simulation and you need to calculate a statistic over and over again in different datasets. Maybe you’re doing an analysis and you need to redo it over and over again with different outcomes.
You may have heard you should avoid for loops in R because they’re slow, but that’s simply not the case anymore. You can use other options such as the apply
or map
families of functions, which may make your code easier to read and write, but for loops are often simpler to understand at first. We looked at one tool to help us with this problem: writing your own functions. Today we’ll look at for loops and other types of loops.
This has no scientific meaning, it’s just an example! We’ll use the gapminder dataset, which we’ve looked at before. Let’s say we just want to compute some means. The dataset includes life expectancy (lifeExp
), population size (pop
), and per capita GDP (gdpPercap
), so we’ll just look at the averages across the dataset for each of the variables.
library(tidyverse)
library(gapminder)
data(gapminder)
# variables we're interested in
variables <- c("lifeExp", "pop", "gdpPercap")
We have 3 things (here, variables) to loop over. So we could create a for loop structure like this:
# for each of the 3 values (we'll call the one we want i)
for (i in 1:3) {
# do this thing (usually with one of those values)
# save that thing somewhere
} # start over as long as there are more values to deal with
but that requires us know a priori that there are only three things we want to do. What if you decide to add on or take away another variable? A safer way is to take the value 3 directly from the length of the set of variables. You could write this 1:length(variables)
but an even safer way to do this is with the seq_along()
function, which does the same thing (just has better behavior if there’s nothing in variables
):
for (i in seq_along(variables)) {
print(i) # to see that we're really getting 1:3
}
## [1] 1
## [1] 2
## [1] 3
OK, now we have to replace the middle part of our loop with what we actually want to happen with the values 1, 2, and 3.
for (i in seq_along(variables)) {
var_of_interest <- variables[i] # choose the i'th variable from our list
col_of_interest <- gapminder[[var_of_interest]] # extract that column
print(mean(col_of_interest, na.rm = TRUE)) # print the mean
}
## [1] 59.47444
## [1] 29601212
## [1] 7215.327
We often use the letter i
to denote the value we’re iterating over, though you certainly don’t have to. And it often refers to the iteration number itself, because we often want to reuse that number. It is important for speed in R to create sufficient space for whatever you’re outputting before you begin. For example, we printed the results above, but usually we’d want to save them somewhere. Before we run the for loop, we want to create an empty vector or matrix to store these numbers. This vector will be indexed by i
too, to go along with the variables.
You could use something like this, where you create an empty vector of the type of output you expect (here, numbers):
means <- vector("numeric", length(variables))
means
## [1] 0 0 0
This creates a vector of 0s. I don’t especially like this as the default for numeric vectors because it makes it hard to tell if you make a mistake (you may expect some 0s in your outpit). I like to start off with NA
:
means <- rep(NA, length(variables))
means
## [1] NA NA NA
Now we just have to replace those missing values with those that we calculate.
for (i in seq_along(variables)) {
var_of_interest <- variables[i]
col_of_interest <- gapminder[[var_of_interest]]
# now store each of the values in its correct position
means[i] <- mean(col_of_interest, na.rm = TRUE)
}
means
## [1] 5.947444e+01 2.960121e+07 7.215327e+03
Notice that when we create a for loop, if we want anything to come out of it directly (i.e., be printed to the console), we have to use the print()
function. But we can access objects that we create/fill during the loop to see output too.
We don’t have to use i
and what we’re iterating over doesn’t need to be a number, however. For example, we could iterate directly over the variable names:
for (var in variables) {
print(var) # to show what we get
}
## [1] "lifeExp"
## [1] "pop"
## [1] "gdpPercap"
for (var in variables) {
col_of_interest <- gapminder[[var]]
print(mean(col_of_interest, na.rm = TRUE))
}
## [1] 59.47444
## [1] 29601212
## [1] 7215.327
We get the same answer and skip a line of code! However, since we haven’t used numbers, it just means we need to think a little harder about how to store (as opposed to print) our output.
Find the mean GDP per capita within each year and store the results. (There are easier ways to do this that we’ve already seen, so if you finish the for loop, try to recall another way!)
years <- unique(gapminder$year)
means <- rep(NA, length(years))
for (i in seq_along(years)) {
year_dat <- filter(gapminder, year == years[i])
gdp_dat <- year_dat[["gdpPercap"]]
means[i] <- mean(gdp_dat, na.rm = TRUE)
}
means
## [1] 3725.276 4299.408 4725.812 5483.653 6770.083 7313.166 7518.902
## [8] 7900.920 8158.609 9090.175 9917.848 11680.072
(Easier solution without for loop)
gapminder %>%
group_by(year) %>%
summarise(mean(gdpPercap))
## # A tibble: 12 x 2
## year `mean(gdpPercap)`
## <int> <dbl>
## 1 1952 3725.
## 2 1957 4299.
## 3 1962 4726.
## 4 1967 5484.
## 5 1972 6770.
## 6 1977 7313.
## 7 1982 7519.
## 8 1987 7901.
## 9 1992 8159.
## 10 1997 9090.
## 11 2002 9918.
## 12 2007 11680.
While loops let you continue to iterate until some condition is met, instead of a set number of times. These are often used when you’re trying to get the amount of error below a certain value, for example.
Here’s an example from R for Data Science to find how many tries it takes to get three heads in a row:
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
## [1] 3
Can you calculate the mean of each of the gapminder variables within each of the years using two for loops (a for loop inside another for loop) and store them in a matrix? Make sure you don’t use the same variable to index both the year and the variable! (Often people will use i
for one and j
for the other, for example.)
years <- unique(gapminder$year)
variables <- c("lifeExp", "pop", "gdpPercap")
means <- matrix(NA, ncol = length(variables), nrow = length(years))
for (i in seq_along(years)) {
year_dat <- filter(gapminder, year == years[i])
for (j in seq_along(variables)) {
var_of_interest <- variables[j]
col_of_interest <- year_dat[[var_of_interest]]
means[i,j] <- mean(col_of_interest, na.rm = TRUE)
}
}
means
## [,1] [,2] [,3]
## [1,] 49.05762 16950402 3725.276
## [2,] 51.50740 18763413 4299.408
## [3,] 53.60925 20421007 4725.812
## [4,] 55.67829 22658298 5483.653
## [5,] 57.64739 25189980 6770.083
## [6,] 59.57016 27676379 7313.166
## [7,] 61.53320 30207302 7518.902
## [8,] 63.21261 33038573 7900.920
## [9,] 64.16034 35990917 8158.609
## [10,] 65.01468 38839468 9090.175
## [11,] 65.69492 41457589 9917.848
## [12,] 67.00742 44021220 11680.072