In this post I show how to import an attribute table of a vector layer in a GRASS GIS database into a Pandas data frame. Pandas stands for Python Data Analysis Library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For people familiar with R, the Pandas data frame is an object similar to the R data frame. They are a lot like the most common way in which spreadsheets are used, with the data presented in rectangular form with columns holding variables and rows holding observations. An important characteristic is that the data frame, like a spreadsheet, can hold different types of data in different columns: numbers, character data, dates and so on.
I use the North Carolina data set, which is a demo data set in GRASS GIS format that can be downloaded from the GRASS GIS website. I will use the vector layer with the bus routes of the NC State University bus service Wolfline (busroutesall) as example. I first compute the length of the line segments for each route and subsequently upload the values to the attribute table of the vector layer. Next, I compute the total length per route and import the attribute table into a Pandas data frame. I show how this can be done using different approaches.
Compute the route lengths
The next screen shots show how you can compute the length of each of the route segments (click on the images to open a larger version). The values will be written to a new column in the attribute table of the vector layer with bus routes.
Using the GUI
Step 1: Copy the original layer ‘busroutesall’ to ‘busroutes_nc’ using g.copy.
Step 2: Add a column ‘length’ to the attribute table of busroutes_nc using v.db.addcolumn.
Step 3: upload the length of the segments to the newly created column using v.to.db.
On the command line
The 3 steps shown above can also be carried out on the command line:
g.copy vector=busroutesall,busroutes_nc v.db.addcolumn map=busroutes_nc columns="length double precision" v.to.db map=busroutes_nc@PERMANENT option=length columns=length
If you are writing a python script, you may want to implement the steps above as part of your python script. GRASS GIS offers different API’s (application programming interface) that provide direct access to the many functions of GRASS GIS. Below I use the GRASS Python Scripting Library. More information on the use of GRASS GIS and Python can be found here. If you are new to using Python scripts in GRASS, make sure to check it out first. Note that you need to run the GRASS Python script from inside a GRASS session, i.e. either from the command line or from the Python shell embedded in the wxGUI.
In the script below, first the required library is imported, viz. the grass.script library. Next, the steps 1-3 explained above are run again, but this time as a python script. For those of you familiar with rgrass7 addon in R, which offers a convenient access to GRASS functions in R, you’ll notice that the syntax is very similar.
# import libraries import grass.script as gs import pandas as pd import sqlite3 # Compute length of route segments. gs.run_command("g.copy", vector=["busroutesall", "busroutes_nc"]) gs.run_command("v.db.addcolumn", map="busroutes_nc", columns="length double precision") gs.run_command("v.to.db", map="busroutes_nc", option="length", columns="length ")
Examining the results
You can use the Attribute Table Manager to inspect the results. As you will see, you have now a new column with holds for each line segment the length (in meters). The problem is that we do not want to have the length per segment, but per route. So what we need to do is to aggregate the data, summing up the lenght of each segments of each of the routes. This is what we I’ll do in the next section.
Import the attribute table in Pandas
Now, let’s read in the attribute table into a Pandas data frame. Again, first step is to import the required libraries, which in this case include the grass.script library as well as the pandas and sqlite3 libraries.
Next, you use the convenient GRASS command db.databases to get the path to the sqlite database that holds the attribute table. You can of course provide the path directly (in this case that would be ~/data/GRASSdb/nc_spm_08_grass7/PERMANENT/sqlite), but this is much easier. You use the path to connect to the database.
Now, you can use Panda’s read_sql_query function to aggregate the data and read in the results using a SQL query. The advantage, especially if you are dealing with very large data tables, is that the data is aggregated in the SQLite database first, and you are only reading in the results. After you are ready, make sure to close the connection to the database.
# import libraries import grass.script as gs import pandas as pd import sqlite3 # Read in the attribute table sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '') con = sqlite3.connect(sqlpath) sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE" df = pd.read_sql_query(sqlstat, con) con.close()
To check the data types of the columns in your dataframe, use the dtype function. This will show you that the data types for the columns cat, ROUTE and length are integer, object and float respectively. Note that the data type object is assigned to a column if it has mixed types (numbers and strings). The shape function will show you the dimensions of the table.
df.dtypes Out: cat int64 ROUTE object length float64 dtype: object df.shape Out: (13, 2)
So let’s have a look at the data. You can view the data by printing the data to the console using the print command, or if you want to show a few lines only, the head command. If you examine the whole table you will see that the names of the routes in the column ROUTE indeed consist of both numbers and strings.
df.head() Out: ROUTE totallength 0 None 534.347153 1 1 9926.863490 2 10 7209.341378 3 11 8686.935003 4 2 8933.922698
As you can see, there are one or more segments without busroute number. We can use the dropna() function in Pandas to remove all rows with in at least one of the column a missing value (this includes both NaN or None).
df = df.dropna() df.head() Out: ROUTE totallength 1 1 9926.863490 2 10 7209.341378 3 11 8686.935003 4 2 8933.922698 5 3 7756.842336
The approach above is not the only one possible. You can for example also use the GRASS function db.select to query the database (instead of the read_sql_query function we used above). Disadvantage of this approach is that it requires a bit more effort to get the data into a Dataframe. But if you do not want to introduce the extra dependency on Pandas, e.g., in a GRASS addon, this approach may be preferably, or if you need the data as a list rather than Dataframe.
# Aggregate data using SQL group by and read into Python sqlstat="SELECT ROUTE, sum(length) as totallength FROM busroutes_nc group by ROUTE" df = gs.read_command("db.select", sql=sqlstat) # Convert string in list df = [z.split('|') for z in df.split('\n')] colnames = df.pop(0) # Convert list to panda dataframe and remove rows with empty / nodata # Note that the replace statement replaces empty strings for nodata, # so we can then use dropna() to remove those rows. df = pd.DataFrame(df, columns=colnames) df['ROUTE'].replace('', pd.np.nan, inplace=True) df = df.dropna()
A second alternative is to read in the whole table using the read_sql_query function , and then use Pandas pivot_table function to aggregate the values in the column ‘length’ by ROUTE. Although it makes less sense in this particular case, you may for example use this approach when you need to do a whole range of different computations on the attribute table. The pivot_table function is ideal for number crunching and data aggregation, offering similar (and more) functionality as the pivot table in Excel and Libreoffice.
# Read in the attribute table sqlpath = gs.read_command("db.databases", driver="sqlite").replace('\n', '') con = sqlite3.connect(sqlpath) sqlstat="SELECT * FROM busroutes_nc" df = pd.read_sql_query(sqlstat, con) con.close() # Remove rows with nodata and aggregate data df = df.dropna() df.pivot_table(index='ROUTE', values='length', aggfunc="sum") print(df) ROUTE 1 9926.863490 10 7209.341378 11 8686.935003 2 8933.922698 3 7756.842336 4 8972.997628 5 8216.070478 6 10359.835495 7 5858.153859 7a 5942.113479 8 8453.168063 9 8542.797914 Name: length, dtype: float64
Besides the various links in this post, make sure to check out this very nice tutorial on using Panda’s dataframe by Datacamp. And see this tutorial by Datacarpentry if you want to know more about how to deal with data types and formats in Python.