VIF stepwise variable selection

Abstract

In modelling, multicollinearity in the set of predictor variables is a potential problem. One way to detect multicollinearity is the variance inflation factor analysis (VIF). In GRASS GIS, the VIF for a set of variables can be computed using the r.vif addon. This addon furthermore let’s you select a subset of variables using a stepwise variable selection procedure, in which variables are removed till the highest VIF values is less than a user-defined threshold value. In this post I introduce the addon and provide some examples how to use the addon.

Introduction

Multicollinearity

Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. Issues with multicollinearity are that the standard errors of the affected coefficients tend to be large, the data redundancy in the explanatory variables may result in model overfitting, it may result in erratic changes of coefficient estimates in response to small changes in the model or the data, and if the pattern of multicollinearity in the new data differs from that in the data that was fitted, extrapolation may introduce large errors in the predictions (O’brien 2007; Dormann et al. 2012). Predictive uncertainty caused by multicollinearity thus poses a challenge for predictive environmental niche or species distribution modelling, especially when used to predict the distribution of a species under novel conditions (e.g., future climates or in new areas).

Variance inflation factor

One way to detect multicollinearity is the variance inflation factor analysis (Graham 2003). The VIF is widely used as a measure of the degree of multi-collinearity of the ith independent variable with the other independent variables in a regression model. If we have explanatory variables X1, X2, X3, … Xi, the VIF for an explanatory variable X1 can be calculated by running an ordinary least square regression that has X1 as a function of all the other explanatory variables X2 … Xi. The VIF is than computed following Equation 1.

equation_1           (Equation 1)

where R2 is the coefficient of determination of the regression equation. This can be repeated for each of the explanatory variables. The size of VIF gives the magnitude of the multicollinearity. The square root of the VIF shows how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. Thus, with a VIF of 10 for variable Xi, the standard error for the coefficient of that variable is √10 = 3.2 times larger than it would be if Xi would be uncorrelated to the other predictor variables. No formal cut-off value or method exists to determine when a VIF is too large. As a rule of thumb, VIF values in excess of 5 or 10 are often considered an indication that multicollinearity may by a cause of problem (Neter et al. 1989; Menard 1995; Mason et al. 2003).

To select a set of variables with sufficient low multicollinearity, a stepwise selection routine can be implemented to remove variables causing loss of precision in parameter estimates by starting with the variable having the largest VIF (Craney & Surles 2002). This is done by computing the VIF values for the full set of explanatory variables (X1 … Xi), after which the variable with the highest VIF is removed. Next, the VIF values are computed again for the reduced set of variables. This is repeated till the largest VIF is smaller than an user defined maximum VIF threshold value.

Selecting meaningful variables

Using an automated selection procedure based on the VIF results in a set of independent variables. However, they may not be the most relevant for all species. Ideally, the selection of predictor variables should be based on the species’ eco-physiological tolerances (Guisan & Zimmermann 2000). If you know, or have an idea, about specific habitat requirements of your target species, but you also would like to include other (uncorrelated) variables, you can tell the r.vif function to retain your target variable. I.e., it tells the function to ignore the VIF score of that particular variable at each round. If it happens to have the highest VIF score at any specific round, the function will remove variable with the second highest VIF instead (see example 3).

Computing VIF in GRASS GIS

The r.vif addon

The r.vif addon for GRASS GIS was created to compute the variance inflation factor for a user-defined set of variables. Like any function in GRASS GIS, the addon can be run from the command line interface (CLI) or using a graphical user interface (GUI; Figure 1). An explanation of the different options is provided in the manual page, which is accessible in the GUI under the tab ‘Manual’ or can be viewed online. The source code of the addon can be viewed and downloaded from osgeo grass trac site.

Figure 1
Figure 1. The graphical user interface (GUI) of the r.vif addon for GRASS GIS.

As a minimum input the module requires the user to provide the names of the raster layers representing the explanatory variables of a model. With these, the model will compute the variance inflation factor (VIF) for each of the variables. Results are written to the command output (if using the GUI) or the console (if using CLI). In addition, the user can opt to have the results written to an output text file.

The user can also provide a maximum VIF (maxvif), in which case the addon will run a stepwise procedure described above. Results of each round will be printed to the console and to the user define output file as well. A third option is to retain one or more variables in the stepwise selection. This can be useful if one or more variables are known to be important determinants of the dependent variable. For example, for a species known to be sensitive for low temperatures one may want to include the mean minimum temperature of the coldest month (bio_6). If the user opts to retain this variable, it will be kept at each round of the stepwise procedure. If the variables happens to have the highest VIF, the variable with the next highest VIF will be removed instead.

Examples

Sample data set

The data used in the examples below are the 19 bioclimatic raster layers downloaded from http://worldclim.org. They represent variables that are derived from the monthly temperature and rainfall values from the Worldclim dataset (Hijmans et al. 2005) in order to generate more biologically meaningful variables. For the examples below, these will be imported in the North Carolina (NC) sample GRASS database, following the steps outline in the tutorial on how to import and reproject data.

Example 1 – computing the VIF

Because the bioclimatic variables are all derived from the same baseline data, multi-collinearity is likely to be a problem. How much can be examined by computing the VIF for each of the 19 bioclimatic variables. Note that in the script below, first all bioclimatic variables are listed and assigned to the variable ‘MAPS’ using the g.list function. This way there is no need to enter all 19 variables as input in the r.vif function (a good example of the convenience of working on the command line rather than using the GUI).

MAPS=`g.list type=raster pattern="^bio*" sep=,`
r.vif maps=$MAPS

The results of the run are written to the console or, if using the GUI, the command output. In addition, the results are (optionally) written to a comma delimited file. In this case, this is the file example_1.csv.

variable   vif  sqrtvif 
bio1   1745.20    41.78
bio10  2923.98    54.07
bio11  3378.38    58.12
bio12   177.43    13.32
bio13    74.42     8.63
bio14    33.43     5.78
bio15   135.52    11.64
bio16    95.77     9.79
bio17   120.80    10.99
bio18    39.61     6.29
bio19    60.99     7.81
bio2     86.83     9.32
bio3     31.08     5.57
bio4    356.63    18.88
bio5       inf      inf
bio6       inf      inf
bio7       inf      inf
bio8      6.93     2.63
bio9      5.70     2.39

The results show that multi-collinearity is indeed a serious problem. The regression model of bio 5, bio 6 and bio 7 against the other variables even give a nearly perfect fit (R2 = 1) which results in an undefined VIF (denoted by inf in the table with results). But VIF values are very high for bio 1, bio 10 and bi 11 as well, indicating that these variables can be linearly predicted with a high level of accuracy.

Example 2 – variable selection

So, the next step is to select a set of variables with low multi-collinearity. This can be done by setting the maxvif parameter. This will tell the function to run a stepwise selection procedure as explained above. In the example below, the maximum VIF is set to 10.

MAPS=`g.list type=raster pattern="^bio*" sep=,`
r.vif maps=$MAPS file=example_2.csv maxvif=10

Below an excerpt of the output is given. It shows that there were 13 rounds needed, i.e., 12 variables had to be removed, before the remaining variables all had a VIF < 10. These are the 7 variables listed at the end.

...
...

VIF round 13
--------------------------------------
variable   vif  sqrtvif
bio1      2.82     1.68
bio14     2.89     1.70
bio18     3.79     1.95
bio2      1.26     1.12
bio4      3.45     1.86
bio8      4.67     2.16
bio9      2.50     1.58
 
selected variables are: 
--------------------------------------
bio1, bio14, bio18, bio2, bio4, bio8, bio9

If you have provided an output file name, the results are also written to a comma delimited file. It contains the same information as printed to screen, but it can be easily imported and used in other programs like LibreOffice, Excel or R. Below an example is given. The first column gives the round, the second the variable removed in the previous round, and the variable names and corresponding VIF and square root of the VIF in that round.

r_vif_example1
Figure 2. The comma delimited file with results, opened in LibreOffice.

Example 3 – retaining a variable

Now what if you are modelling the potential distribution of a species, and you know from other sources that the species is intolerant to frost and sensitive to extended dry periods. That means that e.g., the mean minimum temperature of the coldest month (bio 6) and the Precipitation of Driest Quarter (bio 17) are likely to be important variables. They are, however, not included in the set of selected variables in the previous example. If you want to include specific variables for your analysis, the r.vif function offers the option to retain one or more variables.

MAPS=`g.list type=raster pattern=&quot;^bio*&quot; sep=,`
r.vif maps=$MAPS file=example_3.csv maxvif=10 retain=bio6,bio17

The output is similar to that in example 2, but this time you can see that bio 6 and bio 17 are included in the selected variables. The number of variables is the same, but that may not always be the case.

...
...

VIF round 13
--------------------------------------
variable      vif  sqrtvif
bio17     3.44     1.86
bio18     4.11     2.03
bio2      1.30     1.14
bio4      4.00     2.00
bio6      3.55     1.88
bio8      4.55     2.13
bio9      2.41     1.55

selected variables are: 
--------------------------------------
bio17, bio18, bio2, bio4, bio6, bio8, bio9

Statistics are written to example_3.csv

Example 4 – using the output in other functions

What if you are less interested in the (potentially long) output, but rather want to use the list with selected variables in another function? Well, you can set the ‘s’ flag, which will make the function to only print the list with selected variables to screen, making it easier to parse this list in a script, or pipe it to another function.

In the example below, the r.vif prints the list of selected variables, which is captured in the variable SELECT. This is used as input in the i.group function, which is used to create a group of layers.

MAPS=`g.list type=raster pattern=&quot;^bio*&quot; sep=,`
 SELECT=`r.vif -s maps=$MAPS maxvif=10`
 i.group group=mygroup input=$SELECT

Now, we can run i.group with the l flag to list all layers that are part of the newly created image group ‘mygroup’. It should come as no surprise to you that these are the layers bio1, bio14, bio18, bio2, bio4, bio8, and bio9.

i.group -l group=mygroup
 group  references the following raster maps
 -------------
 <bio1@species> <bio14@species> <bio18@species> <bio2@species>
 <bio4@species> <bio8@species> <bio9@species>
 -------------

References

  • Craney, T.A., & Surles, J.G. 2002. Model-Dependent Variance Inflation Factor Cutoff Values. Quality Engineering 14: 391–403.
  • Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J.R.G., Gruber, B., Lafourcade, B., Leitão, P.J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., & Lautenbach, S. 2012. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. doi: 10.1111/j.1600-0587.2012.07348.x
  • Graham, M.H. 2003. Confronting multicollinearity in ecological multiple regression. Ecology 84: 2809–2815.
    Guisan A, Zimmermann NE. Predictive habitat distribution models in ecology. Ecological Modelling. 2000;135: 147–186. doi:10.1016/S0304-3800(00)00354-9
  • Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., & Jarvis, A. 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965–1978.
  • Mason, R.L., Gunst, R.F., & Hess, J.L. 2003. Statistical design and analysis of experiments: with applications to engineering and science. John Wiley & Sons.
  • Menard, S. 1995. Applied logistic regression analysis: Sage university series on quantitative applications in the social sciences. Thousand Oaks, CA: Sage.
  • Neter, J., Wasserman, W., & Kutner, M.H. 1989. Applied linear regression models. Irwin Homewood, IL.
  • O’brien, R.M. 2007. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Quality & Quantity 41: 673–690.

Links

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s