This evening I was reading Norman Matloff’s excellent book, The Art of R Programming. He mentions,

If you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it’s better to allocate a large matrix in the first place.

The reason for this is that every time a new matrix is defined the machine needs to allocate the memory, and this can be expensive. But just how much of a performance concern is this? In order to answer that question I wrote a simple example code.

# The TBind function allocates memory to create # a new matrix with the same name n times, which # makes it very inefficient. TBind <- function(n) { t1 <- Sys.time() x <- rnorm(10) for (i in 2:n) { x <- rbind(x, rnorm(10)) } t2 <- Sys.time() return(difftime(t2, t1)) } # The TAllocate function instead preallocates the entire matrix, # and then goes back in and fills in each row. TAllocate <- function(n) { t1 <- Sys.time() x <- matrix(nrow = n, ncol = 10) for (i in 1:n) { x[i, ] <- rnorm(10) } t2 <- Sys.time() return(difftime(t2, t1)) }

Check out the results when building a matrix with 10 columns and one hundred thousand rows- 1 second versus 22 minutes!

> TAllocate(1e5) Time difference of 1.148581 secs > TBind(1e5) Time difference of 22.04133 mins

And how much time does this take to create this matrix the proper way, without using a for loop?

TEff <- function(n) { t1 <- Sys.time() x <- matrix(runif(10 * n), ncol = 10) t2 <- Sys.time() return(difftime(t2, t1)) }

Which gives the following result:

> TEff(1e5) Time difference of 0.206775 secs

Thanks to Matt Leonawicz for the blog post on how to use Sys.time().