R Tip: Support Vector Machines

This evening I was playing around with the support vector machines in R’s e1071 package. Following the suggestion of this excellent introduction I tried to tune the SVM, when I ran into this error:

> tune.mod1 <- tune(svm, fm1, data = train
, ranges = list(gamma = 2^(-1:1), cost = 2^(2:4))
, tunecontrol = tune.control(sampling = "cross", cross = 5)
, type = "C")

Error in double(nr * (nclass - 1)) : invalid 'length' argument

What does that mean?
After a while of messing with it I saw this error:

Error in tune(svm, fm1, data = train, ranges = list(gamma = 2^(-1:1), :
Dependent variable has wrong type!

This put me on the right track. I expected that the tune function would allow me to keep the dependent variable as an integer vector containing only 1’s and 0’s, since the function svm() allows this. I’d also expect that a factor with two levels would work fine.  For the tune function, neither is true. The dependent variable MUST BE LOGICAL!

Slides from Oct 2013 Talk

On Wednesday, October 2nd, 2013 I gave a brief talk on using Amazon’s EC2 cloud computing capabilities to the Bay Area R User Group. A summary of all the speakers talks can be found on Revolution’s blog. You can check out my slides below:

Here is the code that scrapes the conversion rate data. To add an entry to your crontab to have it run every hour you would type something like this in the Linux shell:

crontab -e
0 * * * * sudo R CMD BATCH /home/get_conv_rate.R

You can find the code that produces the graph here.

Object size in R

EDIT: For a more thorough treatment of this subject, please see Hadley Wickham’s post.

Today I was thinking about factors in R. They should be more memory efficient, right? But how much more memory efficient are they compared to other classes? Here’s the scoop:

> x <- sample(1:3, 1000, replace = TRUE)

> class(x)
[1] "integer"

> object.size(x)
4040 bytes

Assuming 40 bytes are for overhead, we see that each integer is stored in 4 bytes, or 32 bits per integer. If one bit stores the sign then the maximum integer is 2^{32} -1.

> as.integer(2^31 - 1)
[1] 2147483647
> as.integer(2^31)
[1] NA

Sure enough. Back to the original train of thought:

> object.size(as.numeric(x))
8040 bytes

This means that each double precision number is stored as 8 bytes = 64 bits, as expected.

> object.size(as.factor(x))
4576 bytes

Factors have more overhead than integers- but they are stored as the same 32 bit integers. This could be much more of a savings if the value was some long character string.

> object.size(as.character(x))
8184 bytes

This one is a little more mysterious. Why would a single character take up 8 bytes? I don’t have an answer. Remember x was nothing but a sample of 1:3.

> y <- as.character(x)
> y[y == 1] <- "Here is some long string"
> y[y == 2] <- "And another bunch of letters"
> y[y == 3] <- "Make it even bigger"

> head(y)
[1] "Here is some long string"     "Make it even bigger"
[3] "Here is some long string"     "And another bunch of letters"
[5] "And another bunch of letters" "And another bunch of letters"

> object.size(y)
8256 bytes

So even though we went from a string with 1 character to a string with around 20 characters the size of the object hardly changed. It’s worth noting that the class was coerced when we did the operations y == 1.

> object.size(as.factor(y))
4648 bytes

Reassuringly, when converted to a factor it’s consistent with having 32 bit integers as values, plus a bit for overhead. But what if I check the size of a random string of 20 characters?

> y2 <- sapply(1:1000, function(x) paste(sample(letters, 20), collapse = ""))
> head(y2)
[1] "eyrxsgakipqmdbvzcohu" "nhecigzjowqpxfuaylsv" "pndljgyvkchtbmxfiwes"
[4] "flyzmxgwoqnejpakihdt" "jotwkuysniqvmdxlrgpb" "tngmldiuvpscohryzjxf"
> object.size(y2)
80040 bytes

This is nearly 10 times larger than the above case when the vector was made up of only 3 distinct character strings. I had thought that strings were stored in something like ASCII, with a byte per character. But with these object size varying by an order of magnitude, that can’t be the case.

Presentation Layers

Clean formatting and reporting is important.

from Lyx to LaTeX

In community college I experimented with a \LaTeX GUI called Lyx to create homework for my math and physics classes. At Berkeley the math graduate student instructors turned me on to just typing in \LaTeX. Once I got over the learning curve I liked it much better than Lyx. Although intimidating at first, typing in a markup language offers more speed and control.

Lately I’ve come to the realization that I love R, and I enjoy coding. I plan to do it for many years. It makes sense then to spend some time up front learning how to generate a proper report, and to share work and results with others.

markup? markdown!

I’m writing this blog post in Rstudio using markdown, following Yihui Xie’s instructions. Once I’m done I’ll publish it on WordPress and save the source .Rmd file to a GitHub repository.

My first impressions of markdown are positive. It’s a snap to add math- just check out this post from Rstudio. I plan to use it at work to generate reports in HTML. Having the ability to easily do that feels empowering. Thanks to all those who make it possible.

EDIT: I’m unable to maintain the \LaTeX typesetting when I knit from Rstudio to WordPress, even when I follow the instructions from RStudio. Anyone know how to fix this?

Memory for Rstudio Server on AWS micro instance

Lately I have been in the habit of using Rstudio Server hosted on Amazon Web Services (AWS) Elastic Compute 2 (EC2) cloud. It’s convenient to be able to continue my work right where I left off, whether I’m in the office or at home. At work it avoids the corporate firewall. I can run large jobs and still have my laptop available. Most importantly, it prepares me for scaling up when it comes time for a big project. If you are interested in getting started with EC2 I suggest Louis Aslett’s excellent site.

The natural choice when starting is the free micro instance, which includes 615 MB of memory. Depending on how you use R, this may not meet your requirements. Amazon provides instructions that suggest long running jobs in particular are not well suited for a micro instance.

Today I saw an error code for the first time on Rstudio Server warning me that I was out of memory.

<pre tabindex="0">Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) : 
  cannot popen '/usr/bin/which 'svn' 2>/dev/null', probable reason 'Cannot allocate memory'

Or if you open a shell window you’ll get this popup.

Screen shot 2013-08-13 at 10.43.41 PM

I logged out, but that did not free up the memory. I ended up killing the process from the Linux shell, and then jumping back on. After an hour of typical R work I issued the shell command to check memory:

~$ free -m

It said that I had 65 MB free. I restarted R from within Rstudio Server and checked again; this time 298 MB were free.

Screen shot 2013-08-13 at 9.45.38 PM

So does this mean that I always have to restart R to clear my memory? According to this answer at StackExchange, yes.

Efficient R – Preallocating Memory for Matrices and Vectors

This evening I was reading Norman Matloff’s excellent book, The Art of R Programming. He mentions,

If you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it’s better to allocate a large matrix in the first place.

The reason for this is that every time a new matrix is defined the machine needs to allocate the memory, and this can be expensive. But just how much of a performance concern is this? In order to answer that question I wrote a simple example code.

# The TBind function allocates memory to create
# a new matrix with the same name n times, which
# makes it very inefficient.

TBind <- function(n) {
 t1 <- Sys.time()
 x <- rnorm(10)
 for (i in 2:n) {
 x <- rbind(x, rnorm(10))
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

# The TAllocate function instead preallocates the entire matrix,
# and then goes back in and fills in each row.

TAllocate <- function(n) {
 t1 <- Sys.time()
 x <- matrix(nrow = n, ncol = 10)
 for (i in 1:n) {
 x[i, ] <- rnorm(10)
 }
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Check out the results when building a matrix with 10 columns and one hundred thousand rows- 1 second versus 22 minutes!

> TAllocate(1e5)
Time difference of 1.148581 secs
> TBind(1e5)
Time difference of 22.04133 mins

And how much time does this take to create this matrix the proper way, without using a for loop?

TEff <- function(n) {
 t1 <- Sys.time()
 x <- matrix(runif(10 * n), ncol = 10)
 t2 <- Sys.time()
 return(difftime(t2, t1))
}

Which gives the following result:

> TEff(1e5)
Time difference of 0.206775 secs

Thanks to Matt Leonawicz for the blog post on how to use Sys.time().