In R to create a contingency table of the counts of the combination of two variables, I would normally resort to table(). But how fast is it? A question that becomes more relevant when working on large tables and when you have to run it very often. As you will see in the examples below, it isn’t terribly fast and there are other ways to create cross tables faster.
In all the examples I am using the table SPFn which has two variables (pvnid and wdpaint) and 1.5 million records. I want to create a cross table with the counts of each combination of the two variables. This is to be repeated 12 times, whereby he values in the wdpaint column are shuffled in each iteration. Timing of the code was done using system.time(). A cautionary note: the two variables pnvid and wdpaint were initially declared as numerical. Changing them to integers dramatically improved the speed of all code below (2-6 times, depending on the function used). The timings given are after changing these variables to integers.
With table()
The arguably easiest way to create a cross table with counts is with the table() function.
c <- vector('list', 12) for(i in 1:12){ c[[i]] <- table(SPFn$pnvid,sample(SPFn$wdpaint, replace=F)) } c <- do.call('rbind', c)
It takes about 18 seconds to run. That isn’t terribly fast. Especially if I want to run the loop 10,000 times.
With tapply
Another way to create a cross table is with the tapply function. This is a bit more complicated, mostly to get the data in the required format.
a <- list(as.factor(SPFn$pnvid)) c <- vector('list', 12) for(i in 1:12){ a[[2]] <- as.factor(sample(SPFn$wdpaint, replace=F)) c[[i]] <- tapply(SPFn$cat, a, function(x) length(x)) } c <- do.call('rbind', rvp) c[] <- sapply(c, function(x) replace(x,is.na(x),0))
It took about the same time as using table() in the example above. Note that when the grouping factors (wdpaint and pnvid) are numerical, using tapply is much faster then table(). Still, the better solution will still be to change them to integer first as mentioned above.
With data.table
The third approach is with data.table. It is an extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns. Creating a cross table is very simple, most code below is again to get it in the right format.
c <- vector('list', 12) for(i in 1:12){ a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid) b <- a[,.N,by=list(V1,V2)] c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) } c <- do.call('rbind', c) c[] <- sapply(c, function(x) replace(x,is.na(x),0))
This took 2.4 seconds to run. That is 7.5 times faster then with tapply or table().
Edit: see below a small modification from the code above. It gives a marginally (2%) better time:
c <- vector('list', 12) a1 <- data.table(SPFn$wdpaint,SPFn$pnvid) a2 <- SPFn$wdpaint for(i in 1:12){ a3 <- a1[,V1:=sample(a2,replace=F)] b <- a3[,.N,by=list(V1,V2)] c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) } c <- do.call('rbind', c) c[] <- sapply(c, function(x) replace(x,is.na(x),0))
What about parallelization?
R normally uses only one core. So what if you have a machine with multiple cores? The package ‘multicore‘ provides a way of running parallel computations in R on machines with multiple cores or CPUs. Below the same code as above is used, written twice (p and q). The parallel function ensures p and q are run as parallel jobs, each using a different core.
p <- parallel({ c <- vector('list', 12) for(i in 1:12){ a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid) b <- a[,.N,by=list(V1,V2)] c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum) } c <- do.call('rbind', c) c[] <- sapply(c, function(x) replace(x,is.na(x),0)) c }, mc.set.seed=TRUE) q <- parallel({ f <- vector('list', 12) for(i in 1:12){ d <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid) e<- d[,.N,by=list(V1,V2)] f[[i]] <- tapply(e$N,list(as.factor(e$V2), as.factor(e$V1)), sum) } f <- do.call('rbind', f) f[] <- sapply(f, function(x) replace(x,is.na(x),0)) f }, mc.set.seed=TRUE) test <- collect(list(p,q)) test <- rbind(test[[1]],test[[2]])
More code, but also almost twice as fast, reducing the total run time to below 1.3 seconds. So we are down from 18 seconds in the first example to 1.3 seconds in the last example. Not bad.
I’ll leave it here for now, but I am sure there are other ways to further increase the speed of the code (and improve otherwise). There are many interesting ideas and detailed explanations on how to speed up code in R. Some examples are:
- Speed up the loop operation in R on Stackoverflow
- Speeding up parentheses (and lots more) in R: gives some very useful, and sometimes curious tips to speed up your code.
- Speeding up R computations: another good post with tips how to speed up your code
- In{s}a(ne)!!: another post about how the use of brackets influence the speed of your code
- Speeding up R computations Pt II: compiling and Speed up your R code using a just-in-time (JIT) compiler: compiling your code is a promising way to speed up code
- Efficient loops in R — the complexity versus speed trade-off
If you have other suggestions to speed up the code above, or other useful links, let me know in the comments below. I also would like to thank Matthew Dowle, one of the authors of the data.table package, for a number of very helpful suggestions that helped a lot to improving the code presented here (see also this email thread on the datatable-help mailing list).
Pingback: l'aménagerie » Blog Archive » Ce que j’ai lu cette semaine
Another great and useful post! Thanks for sharing.
Pingback: Sometimes Table is not the Answer – a Faster 2×2 Table | A HopStat and Jump Away