A common technique to estimate the accuracy of a predictive model is k-fold cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into a number of sub-samples with an approximately equal number of records. Of these sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining sub-samples are combined to be used as training data. The cross-validation process is then repeated as many times as there are sub-samples, with each of the sub-samples used exactly once as the validation data (Table 1).
The k evaluation results can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
Functions for modelling and machine learning in e.g., R and Python’s Scikit-learn often contain build-in cross-validation routines. But it is also fairly easy to build such a routine yourself. This tutorial shows how one can easily build a k-fold cross-validation routine in GRASS GIS, e.g., to evaluate the predictive performance of two interpolation techniques, the inverse Distance Weighting and bilinear spline interpolation.
After four months of development the new update release GRASS GIS 7.2.1 is available. It provides more than 150 stability fixes and manual improvements compared to the first stable release version 7.2.0. An overview of new features in this release series is available at New Features in GRASS GIS 7.2. See here the original announcement on the GRASS GIS website.
GRASS GIS offers some useful but basic plotting options for raster data. However, for plotting of data in attribute tables and for more advanced graphs, we need to use other software tools. In this tutorial I explore some of the possibilities offered by Pandas plot() and how we can further tune plots using matplotlib / pyplot library.
In this post I show how to import an attribute table of a vector layer in a GRASS GIS database into a Pandas data frame. Pandas stands for Python Data Analysis Library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For people familiar with R, the Pandas data frame is an object similar to the R data frame. They are a lot like the most common way in which spreadsheets are used, with the data presented in rectangular form with columns holding variables and rows holding observations. An important characteristic is that the data frame, like a spreadsheet, can hold different types of data in different columns: numbers, character data, dates and so on. Continue reading “GRASS and Pandas – from attribute table to pandas dataframe”→
I just updated the r.vif add-on. The add-on let’s you do a step-wise variance inflation factor (VIF) procedure. As explained in more detail here, the VIF can be used to detect multicollinearity in a set of explanatory variables. The step-wise selection procedure provides a way to select a et of variables with sufficient low multicollinearity.
The update should make the computation of VIF much faster. For very large raster layers it is possible to have the VIF computed based on a random subset of raster cells. There is also a low-memory option. This allows one to run this add-on with much larger data sets. But, as explained in the r.vif manual page, it also runs considerably slower.