Cross tables in R, some ways to do it faster

In R to create a contingency table of the counts of the combination of two variables, I would normally resort to table(). But how fast is it? A question that becomes more relevant when working on large tables and when you have to run it very often. As you will see in the examples below, it isn’t terribly fast and there are other ways to create cross tables faster.

In all the examples I am using the table SPFn which has two variables (pvnid and wdpaint) and 1.5 million records. I want to create a cross table with the counts of each combination of the two variables. This is to be repeated 12 times, whereby he values in the wdpaint column are shuffled in each iteration. Timing of the code was done using system.time(). A cautionary note: the two variables pnvid and wdpaint were initially declared as numerical. Changing them to integers dramatically improved the speed of all code below (2-6 times, depending on the function used). The timings given are after changing these variables to integers.

With table()

The arguably easiest way to create a cross table with counts is with the table() function.

c <- vector('list', 12)
  for(i in 1:12){
    c[[i]] <- table(SPFn$pnvid,sample(SPFn$wdpaint, replace=F))
  }
  c <- do.call('rbind', c)

It takes about 18 seconds to run. That isn’t terribly fast. Especially if I want to run the loop 10,000 times.

With tapply

Another way to create a cross table is with the tapply function. This is a bit more complicated, mostly to get the data in the required format.

a <- list(as.factor(SPFn$pnvid))
  c <- vector('list', 12)
  for(i in 1:12){
    a[[2]] <- as.factor(sample(SPFn$wdpaint, replace=F))
    c[[i]] <- tapply(SPFn$cat, a, function(x) length(x))
  }
  c <- do.call('rbind', rvp)
  c[] <- sapply(c, function(x) replace(x,is.na(x),0))

It took about the same time as using table() in the example above. Note that when the grouping factors (wdpaint and pnvid) are numerical, using tapply is much faster then table(). Still, the better solution will still be to change them to integer first as mentioned above.

With data.table

The third approach is with data.table. It is an extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns. Creating  a cross table is very simple, most code below is again to get it in the right format.

c <- vector('list', 12)
  for(i in 1:12){
    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
    b <- a[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
  }
  c <- do.call('rbind', c)
  c[] <- sapply(c, function(x) replace(x,is.na(x),0))

This took 2.4 seconds to run. That is 7.5 times faster then with tapply or table().

Edit: see below a small modification from the code above. It gives a marginally (2%) better time:

c <- vector('list', 12)
  a1 <- data.table(SPFn$wdpaint,SPFn$pnvid)
  a2 <- SPFn$wdpaint
  for(i in 1:12){
    a3 <- a1[,V1:=sample(a2,replace=F)]
    b <- a3[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
  }
  c <- do.call('rbind', c)
  c[] <- sapply(c, function(x) replace(x,is.na(x),0))

What about parallelization?

R normally uses only one core. So what if you have a machine with multiple cores? The package ‘multicore‘ provides a way of running parallel computations in R on machines with multiple cores or CPUs. Below the same code as above is used, written twice (p and q). The parallel function ensures p and q are run as parallel jobs, each using a different core.

p <- parallel({
  c <- vector('list', 12)
  for(i in 1:12){
    a <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
    b <- a[,.N,by=list(V1,V2)]
    c[[i]] <- tapply(b$N,list(as.factor(b$V2), as.factor(b$V1)), sum)
  }
  c <- do.call('rbind', c)
  c[] <- sapply(c, function(x) replace(x,is.na(x),0))
  c
}, mc.set.seed=TRUE)

q <- parallel({
  f <- vector('list', 12)
  for(i in 1:12){
    d <- data.table(sample(SPFn$wdpaint,replace=F),SPFn$pnvid)
    e<- d[,.N,by=list(V1,V2)]
    f[[i]] <- tapply(e$N,list(as.factor(e$V2), as.factor(e$V1)), sum)
  }
  f <- do.call('rbind', f)
  f[] <- sapply(f, function(x) replace(x,is.na(x),0))
  f
}, mc.set.seed=TRUE)

test <- collect(list(p,q))
test <- rbind(test[[1]],test[[2]])

More code, but also almost twice as fast, reducing the total run time to below 1.3 seconds. So we are down from 18 seconds in the first example to 1.3 seconds in the last example. Not bad.

I’ll leave it here for now, but I am sure there are other ways to further increase the speed of the code (and improve otherwise). There are many interesting ideas and detailed explanations on how to speed up code in R. Some examples are:

If you have other suggestions to speed up the code above, or other useful links, let me know in the comments below. I also would like to thank Matthew Dowle, one of the authors of the data.table package, for a number of very helpful suggestions that helped a lot to improving the code presented here (see also this email thread on the datatable-help mailing list).

3 thoughts on “Cross tables in R, some ways to do it faster

  1. Pingback: l'aménagerie » Blog Archive » Ce que j’ai lu cette semaine

  2. Pingback: Sometimes Table is not the Answer – a Faster 2×2 Table | A HopStat and Jump Away

Leave a comment