Euler Problem 1 In R

6 minute read

The problem: If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

A simple solution using a for loop

A simple solution would be to collect all multiples of 3 and 5 that are less than 1000 by looping over all integers from 1 to 999. For each integer we encounter in the loop, we’ll need to check whether the integer is divisible by 3 or 5: the R modulus operator %% is handy here, because a %% b returns \(a\textrm{mod}(b)\), which if zero, indicates that \(a\) is divisible by \(b\).

sum_function <- function(nmax){
  # this is where we'll collect the sum
  output_sum <- 0
  # loop over over number between 1 and the maximum specified
  for(i in 1:(nmax - 1)){
    # if the remainder of dividing i by 3 or 5 is 0, then
    # add to the running sum
    if(i %% 3 == 0 | i %% 5 == 0) output_sum <- output_sum + i
  }
  # return the sum
  output_sum
}

sum_function accepts the argument nmax which is the postive integer below which we want to sum multiples of 3 and 5. Let’s check that this works for nmax = 10 and nmax = 1000 (we know the result of the former should be 23, since only 3, 5, 6 and 9 are eligible).

sum_function(10)
## [1] 23
sum_function(1000)
## [1] 233168

This seems ok. Although the loop is easy to understand, it is inefficient. This is because we already know that the multiples of 3 and 5 below some bound form a regular sequence, and we therefore actually know in advance that we don’t need to check most of the integers between 1 and 1000. In fact, we can avoid the loop altogether if we can construct the sequences directly, which we can do very easily in R using seq.

A better solution using seq

If we can generate the sequence of integers divisible by 3, and a sequence divisible by 5, such that all values are less than 1000, then we have what we need to solve the problem. However! We need to be careful to avoid double counting integers that are divisible by both 3 and 5 - as these will appear in both sequences, so we can’t simply add up both sequences. One strategy would be to construct the sequence of integers divisible by 3, and another sequence of integers divisible by 5, sum these, and then subtract all integers they have in common. The integers in common must be divisible by 3 and 5, which means that they must be the sequence of integers divisible by 15 - this is also easy to compute. The solution now looks like this:

sum_function2 <- function(nmax){
  # sequence starting at 3 and increasing in steps of 3
  seq_3  <- seq(3, nmax - 1, by = 3)
  # sequence starting at 5 and increasing in steps of 5
  seq_5  <- seq(5, nmax - 1, by = 5)
  # sequence starting at 15 and increasing in steps of 15
  # we don't need to do this if nmax - 1 < 15 - hence the if statement
  if(nmax - 1 > 15) seq_15 <- seq(15, nmax - 1, by = 15) else seq_15 = 0
  # add them up
  sum(seq_3) + sum(seq_5) - sum(seq_15)
}

Which gives the same result:

sum_function2(10)
## [1] 23
sum_function2(1000)
## [1] 233168

So the solution is the same, but other than looking a bit more elegant, is it actually quicker? We can quickly check the performance difference using the function microbenchmark::microbenchmark:

library(microbenchmark)
microbenchmark(sum_function(1000), sum_function2(1000))
## Unit: microseconds
##                 expr     min      lq     mean  median       uq      max
##   sum_function(1000) 473.997 512.195 675.8950 538.266 770.6480 2043.511
##  sum_function2(1000)  68.359  86.537 119.2168 106.074 131.3385  315.897
##  neval
##    100
##    100

The benchmarking is clear: the sequence approach is over 6 times faster than the loop approach! But can we do even better than this?

A quick mathematical trick

Suppose we have the sum of a sequence of \(n\) integers that are divisible by 3, we can write them as follows: \[\begin{eqnarray*} \sum_{k=1}^n 3k &=& 3 + 6 + 9 + \ldots + 3(n-1) + 3n. \end{eqnarray*}\] If we add the sequence to itself, then the result is \(2\times\) the above. The trick is to add the sequence to itself in reverse order: \[\begin{eqnarray*} 2\sum_{k=1}^n 3k = \sum_{k=1}^n 3k + \sum_{k=n}^1 3k & = & (3 + 3n) + (6 + 3(n-1)) + \ldots + (3(n-1) + 6) + (3n + 3) \\ & = & 3(n + 1) + 3(n + 1) + \ldots + 3(n + 1) + 3(n + 1) \\ & = & 3n(n+1). \end{eqnarray*}\] This means that the sum of the sequence of 3’s is \(\sum_{k=n}^1 3k = 3n(n+1)/2\). Why does that help? It means that instead of working out the sum by generating a sequence of integers that are divisible by 3 and adding them up (like we did in the previous section), we can work out the sum directly using an expression like the one above. By the same argument, similar expressions hold for sequences of 5’s and 15’s: \[\begin{eqnarray*} \sum_{k=1}^n 3k &=& 3n(n+1)/2\\ \sum_{k=1}^n 5k &=& 5n(n+1)/2\\ \sum_{k=1}^n 15k &=& 15n(n+1)/2 \end{eqnarray*}\]

The only problem remaining is the value \(n\): this is the number of values in each sequence, and it doesn’t seem like we know what this should be? In other words, how do we know how many values less than 1000 are divisible by 3? Although it isn’t obvious, it’s easy to compute - \(n\) is the number of times 3 divides 1000 without remainder - sometimes called integer division. In R we can calculate this using the binary operator %/%: for example, \(10 %/% 3\) would give 3; \(20 %/% 3\) would give 6. Putting it all together we have the function

sum_function3 <- function(nmax){
  # find the number of values in each sequence
  n_5    <- (nmax - 1) %/% 5
  n_3    <- (nmax - 1) %/% 3
  n_15   <- (nmax - 1) %/% 15
  # this is just: sum(seq_3) + sum(seq_5) - sum(seq_15)
  (3 * n_3 * (n_3 + 1) / 2) + (5 * n_5 * (n_5 + 1) / 2) - (15 * n_15 * (n_15 + 1) / 2)
}

Which also returns the same output

sum_function3(10)
## [1] 23
sum_function3(1000)
## [1] 233168

Benchmarking these with the previous solutions we have:

microbenchmark(sum_function(1000), sum_function2(1000), sum_function3(1000))
## Unit: microseconds
##                 expr     min       lq      mean   median       uq      max
##   sum_function(1000) 467.017 483.5950 554.29936 518.0600 587.2535 1444.666
##  sum_function2(1000)  67.482  75.6280 101.15819  83.9175  95.1265 1086.534
##  sum_function3(1000)   1.102   1.5765   2.53894   2.2790   3.0660    8.586
##  neval
##    100
##    100
##    100

Woop! That’s quick. Using the sequence trick is \(215\times\) faster than a loop, and \(25\times\) faster than using sequences. The difference get’s more pronounced as nmax increases (note the scale below uses logs)

library(tidyverse)
get_speed <- function(nmax){
  z <- summary(microbenchmark(sum_function(nmax), sum_function2(nmax), sum_function3(nmax)))$mean
}
spd_list     <- lapply(as.list(round(10^(seq(1, 4, length.out = 20)), 0)), get_speed)
df           <- cbind(10^(seq(2, 5, length.out = 20)), do.call("rbind", spd_list))
colnames(df) <- c("nmax", "Loop", "Sequence", "Analytical")
gather(as.tibble(df), type, value, -nmax) %>% 
  ggplot(aes(x = nmax, y = value, group = type, col = type)) +
  coord_trans(x = 'sqrt', y = 'sqrt') + 
  geom_line(size = 1.2) +
  ylab("Computer time (s)") + 
  ggtitle("Average computer time to sum all integers divisble by 3 or 5 that are less than nmax")

Updated: