Object size in R

EDIT: For a more thorough treatment of this subject, please see Hadley Wickham’s post.

Today I was thinking about factors in R. They should be more memory efficient, right? But how much more memory efficient are they compared to other classes? Here’s the scoop:

> x <- sample(1:3, 1000, replace = TRUE)

> class(x)
[1] "integer"

> object.size(x)
4040 bytes

Assuming 40 bytes are for overhead, we see that each integer is stored in 4 bytes, or 32 bits per integer. If one bit stores the sign then the maximum integer is 2^{32} -1.

> as.integer(2^31 - 1)
[1] 2147483647
> as.integer(2^31)
[1] NA

Sure enough. Back to the original train of thought:

> object.size(as.numeric(x))
8040 bytes

This means that each double precision number is stored as 8 bytes = 64 bits, as expected.

> object.size(as.factor(x))
4576 bytes

Factors have more overhead than integers- but they are stored as the same 32 bit integers. This could be much more of a savings if the value was some long character string.

> object.size(as.character(x))
8184 bytes

This one is a little more mysterious. Why would a single character take up 8 bytes? I don’t have an answer. Remember x was nothing but a sample of 1:3.

> y <- as.character(x)
> y[y == 1] <- "Here is some long string"
> y[y == 2] <- "And another bunch of letters"
> y[y == 3] <- "Make it even bigger"

> head(y)
[1] "Here is some long string"     "Make it even bigger"
[3] "Here is some long string"     "And another bunch of letters"
[5] "And another bunch of letters" "And another bunch of letters"

> object.size(y)
8256 bytes

So even though we went from a string with 1 character to a string with around 20 characters the size of the object hardly changed. It’s worth noting that the class was coerced when we did the operations y == 1.

> object.size(as.factor(y))
4648 bytes

Reassuringly, when converted to a factor it’s consistent with having 32 bit integers as values, plus a bit for overhead. But what if I check the size of a random string of 20 characters?

> y2 <- sapply(1:1000, function(x) paste(sample(letters, 20), collapse = ""))
> head(y2)
[1] "eyrxsgakipqmdbvzcohu" "nhecigzjowqpxfuaylsv" "pndljgyvkchtbmxfiwes"
[4] "flyzmxgwoqnejpakihdt" "jotwkuysniqvmdxlrgpb" "tngmldiuvpscohryzjxf"
> object.size(y2)
80040 bytes

This is nearly 10 times larger than the above case when the vector was made up of only 3 distinct character strings. I had thought that strings were stored in something like ASCII, with a byte per character. But with these object size varying by an order of magnitude, that can’t be the case.