A faster way to calculate MESS in R – or – a tribute to Stack overflow

Jean-Pierre Rossi introduced a function to calculate the “Multivariate Environmental Similarity Surfaces” in R. Since then, the function has become part of the dismo package, which is a package maintained by Robert J. Hijmans and which offers a whole lot of functions for species distribution modelling.

The function calculates the MESS based on a point layer (the reference points) and a set of raster layers (the environmental data layers). It works great for smaller data sets, but you may run into trouble for very large data sets. So I needed to find a way to run this faster. And… I couldn’t. It is already a pretty neat piece of code, and I could not think of a better way of doing this in R. (I am working on For an alternative to do this in GRASS, but more about that later: see here).

That is where I decided to try my luck on Stack Overflow. Stack Overflow is a programming Q & A site with a very active community, with questions often getting answered within hours. Or, in my case, within an hour! You can check out my question and the answer by Joshua Ulrich here. Below I will use the answer and expand on it to create an alternative way to run messi. Obviously, due credit goes to Joshua Ulrich as well as to Jean-Pierre for the original code.

The key part of the mess function in the dismo package is the messi function, given below:

messi <- function(p, v) {
        niinf <- length(which(v <= p))
        f <- 100 * niinf/length(v)
        if (f == 0)
            simi <- 100 * (p - min(v))/(max(v) - min(v))
        if (0 < f & f <= 50)
            simi <- 2 * f
        if (50 <= f & f < 100)
            simi <- 2 * (100 - f)
        if (f == 100)
            simi <- 100 * (max(v) - p)/(max(v) - min(v))
        return(simi)
    }

If you have an environmental layer e, this is converted in a vector and the messi function is run on every element of the vector to calculate the environmental dissimilarity. OK, I am not sure this is clear, check out the complete code otherwise. The point is that if you have a vector e the messi function is run length(e) times. It doesn’t make a large difference for small vectors, but the overhead from R function calls really starts to add up with larger vectors.

The answer on Stack Overflow was to use the findInterval function to calculate niinf (i.e., to replace the: length(which(v <= p)) in the function above. That also means the rest of the code needs to be adapted. One way is:

messi2 <- function(p,v){
  f<-100*findInterval(p,sort(v))/length(v)
  a <- length(p)
  maxv <- max(v)
  minv <- min(v)
  opt1 <- 100*(p-minv)/(maxv-minv)
  opt2 <- 2*f
  opt3 <- 2 * (100-f)
  opt4 <- 100*(maxv-p)/(maxv-minv)
  simi <- ifelse(f==0, opt1, ifelse(f<=50, opt2, ifelse(f<100, opt3,opt4)))
  return(simi)
}

So, how much faster is the new code? Let’s see:

set.seed(21)
e <- rnorm(1e5)
g <- rnorm(1e4)

system.time(o1 <- sapply(X=e, FUN=messi, v=g))
   user  system elapsed
 13.861   0.028  13.916

system.time(o2 <- messi2(e,g))
   user  system elapsed
  0.084   0.000   0.087

all.equal(o1,o2)
[1] TRUE

Not much further to say about those number…

Update: Robert Hijmans implemented the code above in the latest version of dismo.. and he managed to further improve the code, making it another 2-3 times faster!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s