GRASS and Pandas – from attribute table to pandas dataframe

Introduction

In this post I show how to import an attribute table of a vector layer in a GRASS GIS database into a Pandas data frame. Pandas stands for Python Data Analysis Library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For people familiar with R, the Pandas data frame is an object similar to the R data frame. They are a lot like the most common way in which spreadsheets are used, with the data presented in rectangular form with columns holding variables and rows holding observations. An important characteristic is that the data frame, like a spreadsheet, can hold different types of data in different columns: numbers, character data, dates and so on.

I use the North Carolina data set, which is a demo data set in GRASS GIS format that can be downloaded from the GRASS GIS website. I will use the vector layer with the bus routes of the NC State University bus service Wolfline (busroutesall) as example. I first compute the length of the line segments for each route and subsequently upload the values to the attribute table of the vector layer. Next, I compute the total length per route and import the attribute table into a Pandas data frame. I show how this can be done using different approaches.

busroutesall
Routes of the NC State University bus service Wolfline (red lines). Background layer is the Openstreet Map (provider OSM-WMS Uni Heidelberg).

Compute the route lengths

The next screen shots show how you can compute the length of each of the route segments (click on the images to open a larger version). The values will be written to a new column in the attribute table of the vector layer with bus routes.

Using the GUI

screenshot1

Step 1: Copy the original layer ‘busroutesall’ to ‘busroutes_nc’ using g.copy.

screenshot2

Step 2: Add a column ‘length’ to the attribute table of busroutes_nc using v.db.addcolumn.

screenshot3

Step 3: upload the length of the segments to the newly created column using v.to.db.

On the command line

The 3 steps shown above can also be carried out on the command line:

g.copy vector=busroutesall,busroutes_nc
v.db.addcolumn map=busroutes_nc columns="length double precision"
v.to.db map=busroutes_nc@PERMANENT option=length columns=length

Using Python

If you are writing a python script, you may want to implement the steps above as part of your python script. GRASS GIS offers different API’s (application programming interface) that provide direct access to the many functions of GRASS GIS. Below I use the GRASS Python Scripting Library. More information on the use of GRASS GIS and Python can be found here. If you are new to using Python scripts in GRASS, make sure to check it out first. Note that you need to run the GRASS Python script from inside a GRASS session, i.e. either from the command line or from the Python shell embedded in the wxGUI.

In the script below, first the required library is imported, viz. the grass.script library. Next, the steps 1-3 explained above are run again, but this time as a python script. For those of you familiar with rgrass7 addon in R, which offers a convenient access to GRASS functions in R, you’ll notice that the syntax is very similar.

# import libraries
import grass.script as gs
import pandas as pd
import sqlite3

# Compute length of route segments.
gs.run_command("g.copy", vector=["busroutesall", "busroutes_nc"])
gs.run_command("v.db.addcolumn", map="busroutes_nc", columns="length double precision")
gs.run_command("v.to.db", map="busroutes_nc", option="length", columns="length ")

Examining the results

You can use the Attribute Table Manager to inspect the results. As you will see, you have now a new column with holds for each line segment the length (in meters). The problem is that we do not want to have the length per segment, but per route. So what we need to do is to aggregate the data, summing up the lenght of each segments of each of the routes. This is what we I’ll do in the next section.

screenshot4

Import the attribute table in Pandas

Now, let’s read in the attribute table into a Pandas data frame. Again, first step is to import the required libraries, which in this case include the grass.script library as well as the pandas and sqlite3 libraries.

Next, you use the convenient GRASS command db.databases to get the path to the sqlite database that holds the attribute table. You can of course provide the path directly (in this case that would be ~/data/GRASSdb/nc_spm_08_grass7/PERMANENT/sqlite), but this is much easier. You use the path to connect to the database.

Now, you can use Panda’s read_sql_query function to aggregate the data and read in the results using a SQL query. The advantage, especially if you are dealing with very large data tables, is that the data is aggregated in the SQLite database first, and you are only reading in the results. After you are ready, make sure to close the connection to the database.

# import libraries
import grass.script as gs
import pandas as pd
import sqlite3

# Read in the attribute table
sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath)
sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE"
df = pd.read_sql_query(sqlstat, con)
con.close()

To check the data types of the columns in your dataframe, use the dtype function. This will show you that the data types for the columns cat, ROUTE and length are integer, object and float respectively. Note that the data type object is assigned to a column if it has mixed types (numbers and strings). The shape function will show you the dimensions of the table.

df.dtypes
Out[28]:
cat         int64
ROUTE      object
length    float64
dtype: object

df.shape
Out[5]: (13, 2)

So let’s have a look at the data. You can view the data by printing the data to the console using the print command, or if you want to show a few lines only, the head command. If you examine the whole table you will see that the names of the routes in the column ROUTE indeed consist of both numbers and strings.

df.head()
Out[7]:
  ROUTE  totallength
0  None   534.347153
1     1  9926.863490
2    10  7209.341378
3    11  8686.935003
4     2  8933.922698

As you can see, there are one or more segments without busroute number. We can use the dropna() function in Pandas to remove all rows with in at least one of the column a missing value (this includes both NaN or None).

df = df.dropna()
df.head()

Out[9]:
  ROUTE  totallength
1     1  9926.863490
2    10  7209.341378
3    11  8686.935003
4     2  8933.922698
5     3  7756.842336

Alternative approaches

The approach above is not the only one possible. You can for example also use the GRASS function db.select to query the database (instead of the read_sql_query function we used above). Disadvantage of this approach is that it requires a bit more effort to get the data into a Dataframe. But if you do not want to introduce the extra dependency on Pandas, e.g., in a GRASS addon, this approach may be preferably, or if you need the data as a list rather than Dataframe.

# Aggregate data using SQL group by and read into Python
sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE"
df = gs.read_command("db.select", sql=sqlstat)

# Convert string in list
df = [z.split('|') for z in df.split('\n')]
colnames = df.pop(0)

# Convert list to panda dataframe and remove rows with empty / nodata
# Note that the replace statement replaces empty strings for nodata,
# so we can then use dropna() to remove those rows.
df = pd.DataFrame(df, columns=colnames)
df['ROUTE'].replace('', pd.np.nan, inplace=True)
df = df.dropna()

A second alternative is to read in the whole table using the read_sql_query function , and then use Pandas pivot_table function to aggregate the values in the column ‘length’ by ROUTE. Although it makes less sense in this particular case, you may for example use this approach when you need to do a whole range of different computations on the attribute table. The pivot_table function is ideal for number crunching and data aggregation, offering similar (and more) functionality as the pivot table in Excel and Libreoffice.

# Read in the attribute table
sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '')
con = sqlite3.connect(sqlpath)
sqlstat="SELECT * FROM busroutes_nc"
df = pd.read_sql_query(sqlstat, con)
con.close()

# Remove rows with nodata and aggregate data
df = df.dropna()
df.pivot_table(index='ROUTE', values='length', aggfunc="sum")
print(df)

ROUTE
1      9926.863490
10     7209.341378
11     8686.935003
2      8933.922698
3      7756.842336
4      8972.997628
5      8216.070478
6     10359.835495
7      5858.153859
7a     5942.113479
8      8453.168063
9      8542.797914
Name: length, dtype: float64

Other sources

Besides the various links in this post, make sure to check out this very nice tutorial on using Panda’s dataframe by Datacamp. And see this tutorial by Datacarpentry if you want to know more about how to deal with data types and formats in Python.

Advertisements

3 thoughts on “GRASS and Pandas – from attribute table to pandas dataframe

  1. Thanks for another good post. Have you seen geopandas (https://github.com/geopandas/geopandas)? I’ve used it a lot to work with shapefiles, and I know it can read postgis databases as well. As an alternative to your method, I’m pretty sure you could do all of this, including the route length calculation, within geopandas. Good to see how to do it in GRASS too though. Thanks again.

    1. Hi Jared. Thanks for the tip. I have seen, but not yet tried out Geopandas. This post was part of my exploration of how I can use Panda dataframes in a GRASS GIS script I am writing (nothing to do with the example, that was just a quick example using the standard NC dataset). In this script I need to read in some attribute data, so no need for spatial capabilities, that part is taken care of by GRASS functions. But I certainly thing geopandas offers some interesting tools that might come in handy. I would be curious how it compares to the sp and associated packages in R.

  2. Pingback: GRASS and Pandas – from attribute table to pandas dataframe – GeoNe.ws

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s