Todayâ€™s lesson follows part of Chapter 21 in R for Data Science. Today weâ€™ll look at *iteration*. Often when youâ€™re programming, you have to do something over and over. Maybe youâ€™re running a simulation and you need to calculate a statistic over and over again in different datasets. Maybe youâ€™re doing an analysis and you need to redo it over and over again with different outcomes.

You may have heard you should avoid for loops in R because theyâ€™re slow, but thatâ€™s simply not the case anymore. You can use other options such as the `apply`

or `map`

families of functions, which may make your code easier to read and write, but for loops are often simpler to understand at first. We looked at one tool to help us with this problem: writing your own functions. Today weâ€™ll look at for loops and other types of loops.

- for all of these values of something
- do this thing (usually with one of those values)
- save that thing somewhere
- start over as long as there are more values to deal with

This has no scientific meaning, itâ€™s just an example! Weâ€™ll use the gapminder dataset, which weâ€™ve looked at before. Letâ€™s say we just want to compute some means. The dataset includes life expectancy (`lifeExp`

), population size (`pop`

), and per capita GDP (`gdpPercap`

), so weâ€™ll just look at the averages across the dataset for each of the variables.

```
library(tidyverse)
library(gapminder)
data(gapminder)
# variables we're interested in
variables <- c("lifeExp", "pop", "gdpPercap")
```

We have 3 things (here, variables) to loop over. So we could create a for loop structure like this:

```
# for each of the 3 values (we'll call the one we want i)
for (i in 1:3) {
# do this thing (usually with one of those values)
# save that thing somewhere
} # start over as long as there are more values to deal with
```

but that requires us know a priori that there are only three things we want to do. What if you decide to add on or take away another variable? A safer way is to take the value 3 directly from the length of the set of variables. You could write this `1:length(variables)`

but an even safer way to do this is with the `seq_along()`

function, which does the same thing (just has better behavior if thereâ€™s nothing in `variables`

):

```
for (i in seq_along(variables)) {
print(i) # to see that we're really getting 1:3
}
```

```
## [1] 1
## [1] 2
## [1] 3
```

OK, now we have to replace the middle part of our loop with what we actually want to happen with the values 1, 2, and 3.

```
for (i in seq_along(variables)) {
var_of_interest <- variables[i] # choose the i'th variable from our list
col_of_interest <- gapminder[[var_of_interest]] # extract that column
print(mean(col_of_interest, na.rm = TRUE)) # print the mean
}
```

```
## [1] 59.47444
## [1] 29601212
## [1] 7215.327
```

We often use the letter `i`

to denote the value weâ€™re iterating over, though you certainly donâ€™t have to. And it often refers to the iteration number itself, because we often want to reuse that number. It **is** important for speed in R to create sufficient space for whatever youâ€™re outputting before you begin. For example, we printed the results above, but usually weâ€™d want to save them somewhere. Before we run the for loop, we want to create an empty vector or matrix to store these numbers. This vector will be indexed by `i`

too, to go along with the variables.

You could use something like this, where you create an empty vector of the type of output you expect (here, numbers):

```
means <- vector("numeric", length(variables))
means
```

`## [1] 0 0 0`

This creates a vector of 0s. I donâ€™t especially like this as the default for numeric vectors because it makes it hard to tell if you make a mistake (you may expect some 0s in your outpit). I like to start off with `NA`

:

```
means <- rep(NA, length(variables))
means
```

`## [1] NA NA NA`

Now we just have to replace those missing values with those that we calculate.

```
for (i in seq_along(variables)) {
var_of_interest <- variables[i]
col_of_interest <- gapminder[[var_of_interest]]
# now store each of the values in its correct position
means[i] <- mean(col_of_interest, na.rm = TRUE)
}
means
```

`## [1] 5.947444e+01 2.960121e+07 7.215327e+03`

Notice that when we create a for loop, if we want anything to come out of it directly (i.e., be printed to the console), we have to use the `print()`

function. But we can access objects that we create/fill during the loop to see output too.

We donâ€™t have to use `i`

and what weâ€™re iterating over doesnâ€™t need to be a number, however. For example, we could iterate directly over the variable names:

```
for (var in variables) {
print(var) # to show what we get
}
```

```
## [1] "lifeExp"
## [1] "pop"
## [1] "gdpPercap"
```

```
for (var in variables) {
col_of_interest <- gapminder[[var]]
print(mean(col_of_interest, na.rm = TRUE))
}
```

```
## [1] 59.47444
## [1] 29601212
## [1] 7215.327
```

We get the same answer and skip a line of code! However, since we havenâ€™t used numbers, it just means we need to think a little harder about how to store (as opposed to print) our output.

Find the mean GDP per capita within each year and store the results. (There are easier ways to do this that weâ€™ve already seen, so if you finish the for loop, try to recall another way!)

```
years <- unique(gapminder$year)
means <- rep(NA, length(years))
for (i in seq_along(years)) {
year_dat <- filter(gapminder, year == years[i])
gdp_dat <- year_dat[["gdpPercap"]]
means[i] <- mean(gdp_dat, na.rm = TRUE)
}
means
```

```
## [1] 3725.276 4299.408 4725.812 5483.653 6770.083 7313.166 7518.902
## [8] 7900.920 8158.609 9090.175 9917.848 11680.072
```

(Easier solution without for loop)

```
gapminder %>%
group_by(year) %>%
summarise(mean(gdpPercap))
```

```
## # A tibble: 12 x 2
## year `mean(gdpPercap)`
## <int> <dbl>
## 1 1952 3725.
## 2 1957 4299.
## 3 1962 4726.
## 4 1967 5484.
## 5 1972 6770.
## 6 1977 7313.
## 7 1982 7519.
## 8 1987 7901.
## 9 1992 8159.
## 10 1997 9090.
## 11 2002 9918.
## 12 2007 11680.
```

While loops let you continue to iterate until some condition is met, instead of a set number of times. These are often used when youâ€™re trying to get the amount of error below a certain value, for example.

Hereâ€™s an example from R for Data Science to find how many tries it takes to get three heads in a row:

```
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
```

`## [1] 3`

Can you calculate the mean of each of the gapminder variables within each of the years using two for loops (a for loop inside another for loop) and store them in a matrix? Make sure you donâ€™t use the same variable to index both the year and the variable! (Often people will use `i`

for one and `j`

for the other, for example.)

```
years <- unique(gapminder$year)
variables <- c("lifeExp", "pop", "gdpPercap")
means <- matrix(NA, ncol = length(variables), nrow = length(years))
for (i in seq_along(years)) {
year_dat <- filter(gapminder, year == years[i])
for (j in seq_along(variables)) {
var_of_interest <- variables[j]
col_of_interest <- year_dat[[var_of_interest]]
means[i,j] <- mean(col_of_interest, na.rm = TRUE)
}
}
means
```

```
## [,1] [,2] [,3]
## [1,] 49.05762 16950402 3725.276
## [2,] 51.50740 18763413 4299.408
## [3,] 53.60925 20421007 4725.812
## [4,] 55.67829 22658298 5483.653
## [5,] 57.64739 25189980 6770.083
## [6,] 59.57016 27676379 7313.166
## [7,] 61.53320 30207302 7518.902
## [8,] 63.21261 33038573 7900.920
## [9,] 64.16034 35990917 8158.609
## [10,] 65.01468 38839468 9090.175
## [11,] 65.69492 41457589 9917.848
## [12,] 67.00742 44021220 11680.072
```