Memory for Rstudio Server on AWS micro instance

Lately I have been in the habit of using Rstudio Server hosted on Amazon Web Services (AWS) Elastic Compute 2 (EC2) cloud. It’s convenient to be able to continue my work right where I left off, whether I’m in the office or at home. At work it avoids the corporate firewall. I can run large jobs and still have my laptop available. Most importantly, it prepares me for scaling up when it comes time for a big project. If you are interested in getting started with EC2 I suggest Louis Aslett’s excellent site.

The natural choice when starting is the free micro instance, which includes 615 MB of memory. Depending on how you use R, this may not meet your requirements. Amazon provides instructions that suggest long running jobs in particular are not well suited for a micro instance.

Today I saw an error code for the first time on Rstudio Server warning me that I was out of memory.

<pre tabindex="0">Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) : 
  cannot popen '/usr/bin/which 'svn' 2>/dev/null', probable reason 'Cannot allocate memory'

Or if you open a shell window you’ll get this popup.

Screen shot 2013-08-13 at 10.43.41 PM

I logged out, but that did not free up the memory. I ended up killing the process from the Linux shell, and then jumping back on. After an hour of typical R work I issued the shell command to check memory:

~$ free -m

It said that I had 65 MB free. I restarted R from within Rstudio Server and checked again; this time 298 MB were free.

Screen shot 2013-08-13 at 9.45.38 PM

So does this mean that I always have to restart R to clear my memory? According to this answer at StackExchange, yes.

Advertisements

Efficient R – Preallocating Memory for Matrices and Vectors

This evening I was reading Norman Matloff’s excellent book, The Art of R Programming. He mentions,

If you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it’s better to allocate a large matrix in the first place.

The reason for this is that every time a new matrix is defined the machine needs to allocate the memory, and this can be expensive. But just how much of a performance concern is this? In order to answer that question I wrote a simple example code.

# The TBind function allocates memory to create
# a new matrix with the same name n times, which
# makes it very inefficient.

TBind <- function(n) {
 t1 <- Sys.time()
 x <- rnorm(10)
 for (i in 2:n) {
 x <- rbind(x, rnorm(10))
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

# The TAllocate function instead preallocates the entire matrix,
# and then goes back in and fills in each row.

TAllocate <- function(n) {
 t1 <- Sys.time()
 x <- matrix(nrow = n, ncol = 10)
 for (i in 1:n) {
 x[i, ] <- rnorm(10)
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Check out the results when building a matrix with 10 columns and one hundred thousand rows- 1 second versus 22 minutes!

> TAllocate(1e5)
Time difference of 1.148581 secs
> TBind(1e5)
Time difference of 22.04133 mins

And how much time does this take to create this matrix the proper way, without using a for loop?

TEff <- function(n) {
 t1 <- Sys.time()
 x <- matrix(runif(10 * n), ncol = 10)
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Which gives the following result:

> TEff(1e5)
Time difference of 0.206775 secs

Thanks to Matt Leonawicz for the blog post on how to use Sys.time().