EDIT: For a more thorough treatment of this subject, please see Hadley Wickham’s post.
Today I was thinking about factors in R. They should be more memory efficient, right? But how much more memory efficient are they compared to other classes? Here’s the scoop:
> x <- sample(1:3, 1000, replace = TRUE) > class(x)  "integer" > object.size(x) 4040 bytes
Assuming 40 bytes are for overhead, we see that each integer is stored in 4 bytes, or 32 bits per integer. If one bit stores the sign then the maximum integer is .
> as.integer(2^31 - 1)  2147483647 > as.integer(2^31)  NA
Sure enough. Back to the original train of thought:
> object.size(as.numeric(x)) 8040 bytes
This means that each double precision number is stored as 8 bytes = 64 bits, as expected.
> object.size(as.factor(x)) 4576 bytes
Factors have more overhead than integers- but they are stored as the same 32 bit integers. This could be much more of a savings if the value was some long character string.
> object.size(as.character(x)) 8184 bytes
This one is a little more mysterious. Why would a single character take up 8 bytes? I don’t have an answer. Remember x was nothing but a sample of 1:3.
> y <- as.character(x) > y[y == 1] <- "Here is some long string" > y[y == 2] <- "And another bunch of letters" > y[y == 3] <- "Make it even bigger" > head(y)  "Here is some long string" "Make it even bigger"  "Here is some long string" "And another bunch of letters"  "And another bunch of letters" "And another bunch of letters" > object.size(y) 8256 bytes
So even though we went from a string with 1 character to a string with around 20 characters the size of the object hardly changed. It’s worth noting that the class was coerced when we did the operations y == 1.
> object.size(as.factor(y)) 4648 bytes
Reassuringly, when converted to a factor it’s consistent with having 32 bit integers as values, plus a bit for overhead. But what if I check the size of a random string of 20 characters?
> y2 <- sapply(1:1000, function(x) paste(sample(letters, 20), collapse = "")) > head(y2)  "eyrxsgakipqmdbvzcohu" "nhecigzjowqpxfuaylsv" "pndljgyvkchtbmxfiwes"  "flyzmxgwoqnejpakihdt" "jotwkuysniqvmdxlrgpb" "tngmldiuvpscohryzjxf" > object.size(y2) 80040 bytes
This is nearly 10 times larger than the above case when the vector was made up of only 3 distinct character strings. I had thought that strings were stored in something like ASCII, with a byte per character. But with these object size varying by an order of magnitude, that can’t be the case.