Efficient R – Preallocating Memory for Matrices and Vectors

This evening I was reading Norman Matloff’s excellent book, The Art of R Programming. He mentions,

If you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it’s better to allocate a large matrix in the first place.

The reason for this is that every time a new matrix is defined the machine needs to allocate the memory, and this can be expensive. But just how much of a performance concern is this? In order to answer that question I wrote a simple example code.

# The TBind function allocates memory to create
# a new matrix with the same name n times, which
# makes it very inefficient.

TBind <- function(n) {
 t1 <- Sys.time()
 x <- rnorm(10)
 for (i in 2:n) {
 x <- rbind(x, rnorm(10))
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

# The TAllocate function instead preallocates the entire matrix,
# and then goes back in and fills in each row.

TAllocate <- function(n) {
 t1 <- Sys.time()
 x <- matrix(nrow = n, ncol = 10)
 for (i in 1:n) {
 x[i, ] <- rnorm(10)
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Check out the results when building a matrix with 10 columns and one hundred thousand rows- 1 second versus 22 minutes!

> TAllocate(1e5)
Time difference of 1.148581 secs
> TBind(1e5)
Time difference of 22.04133 mins

And how much time does this take to create this matrix the proper way, without using a for loop?

TEff <- function(n) {
 t1 <- Sys.time()
 x <- matrix(runif(10 * n), ncol = 10)
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Which gives the following result:

> TEff(1e5)
Time difference of 0.206775 secs

Thanks to Matt Leonawicz for the blog post on how to use Sys.time().

Advertisements