Import census data
The best way to start the journey with geospatial data analysis is to practice with census data, which provides a picture of all the people and households in the countries of the world at a granular level.
In this tutorial, we are going to use a data set that provides the number of cars or vans in the UK and comes from the UK data service. The link to the data set is here.
I’ll start with a dataset that contains no geographic information:
Each row in the data set corresponds to a specific output area, which is the lowest geographic level at which the census is provided in the UK. There are three characteristics: the geographic code, the country, and the number of cars or trucks owned by one or more members of a household.
If we wanted to visualize the map right now, we wouldn’t be able to because we don’t have the necessary geographic information. We need one more step before showing the potential of GeoPandas.
Add geometry to census data
To visualize our census data, we need to add a column that stores geographic information. The process for adding geographic information, for example, adding latitude and longitude for each city, is called geocoding.
In this case, it’s not just one coordinate pair, but there are different coordinate pairs that are connected and closed, forming the boundaries of the output areas. We need to export the Shapefile from this link. Provides the boundary for each output area.
Once the dataset is imported, we can merge these two tables using their common field, geo_code:
After evaluating that the data frame dimension did not change after the left join, we need to check for null values in the new column:
df.geometry.isnull().sum()
# 0
Fortunately, there are no null values and we can convert our data frame to a Geodataframe using the GeoDataFrame class, where we set the geometry column to our geodataframe’s geometry:
Now, geographic and non-geographic information are combined in a single table. All geographic information is contained in a single field, called geometry. As in a normal dataframe, we can print the information of this geodataframe:
From the output, we can see that our geodataframe is an instance of the geopandas.GeoDataFrame
object and geometry is encoded using the geometry type. To have a better understanding, we can also show the type of the geometry column in the first row:
type(gdf.geometry[0])# shapely.geometry.polygon.Polygon
It is important to know that there are three common classes in the geometric object: Points, Lines and Polygons. In our case, we are dealing with polygons, which makes sense since they are the boundaries of the output areas. Then the data set is ready and we can start creating good visualizations from now on.
Create a map with GeoPandas
Now, we have all the ingredients to visualize the map with GeoPandas. Since one of the drawbacks of GeoPandas is the fact that it has problems with large amounts of data and we have more than 200 thousand rows, we will only focus on the Northern Ireland census data:
gdf_ni = gdf.query(‘Country==”Northen Ireland”’)
To create a map, you just need to call plot()
method on the Geodataframe:
We would also like to see how the number of cars/vans is distributed within Northern Ireland by coloring each departure area based on their frequency:
From this graph, we can see that most of the areas have around 200 vehicles, except for small areas marked in green.
Extract centroid of geometry
Suppose we want to change the geometry and have the coordinates at the center of the output areas, instead of the polygons. This is possible by using the gdf.geomtry.centroid
property to calculate the centroid of each output area:
gdf_ni[‘centroid’] = gdf.geometry.centroid
gdf_ni.sample(3)
If we display the data frame information again, we can notice that both geometry and centroid are encoded as geometry types.
The best way to understand what we actually got is to visualize both the geometry and the centroid columns on a single map. To plot the centroids, it is necessary to change the geometry using set_geometry()
method.
Create more complex maps
There are some advanced features to display more details on the map, without creating any other informative column. We have shown the number of cars or vans in each departure area before, but it was more confusing than informative. It would be better to create a categorical feature based on our numeric column. With GeoPandas, we can skip that passage and trace it directly. Specifying the argument scheme=’intervals’
we can create car/truck classes based on equal intervals.
The map didn’t change much, but you can see that the legend is much clearer compared to the previous version. A better way to visualize the map would be to color it based on the levels created using quantiles:
Now, it is possible to detect more variability within the map, since each level contains a more distributed number of areas. It is worth noting that most of the areas belong to the last two levels, corresponding to the largest number of vehicles. On first visualization, 200 vehicles seemed like a low number, but instead there were a large number of outliers with high frequencies that distorted our interpretation.
At this point, we would also like to have a background map to better contextualize our results. The most popular way to do this is by using the contextual library, which allows you to get a background map. This library requires the Web Mercator coordinate reference system (EPSG:3857). For this reason, we need to convert our data to this crs. The code to render the map remains the same, except for one extra line to add the basemap from the Contextily library:
That’s great! Now we have a more professional and detailed map!
Final thoughts:
This was an introductory tutorial to start practicing with geospatial data using Python. GeoPandas is a Python library specialized in working with vector data. It is very easy and intuitive to use, as it has properties and methods similar to Pandas, but it becomes very slow as the amount of data grows, particularly when graphing the data.
Adding to its downside is the fact that it depends on the Fiona library to read and write vector data formats. In case Fiona doesn’t support some formats, even GeoPandas can support them. One solution can be to use in combination GeoPandas to manipulate data and QGIS to display the map. Or try other Python libraries to visualize the data, like Folium. Do you know of other alternatives? Suggest them in the comments, if you have other ideas.
The code can be found here. I hope you found the article useful. Have a nice day!