Tuesday, 1 September 2020

Working with Open Data

A short post this time around, with loads of graphs and a very current and relevant topic, but with absolutely nothing to do with Structural Engineering. 

I have been thinking of doing something with open data kinda for a while now, and got triggered again by all the discussions about covid.


Online data

It didn't even require a lot of research, as it happens. Searching for "covid open data" got me a link to a (Dutch) government website with an overview of links to publicly available data sources concerning covid data. From there it was just a few clicks to the European Center for Disease Control (ECDC), which keeps a day-to-day overview of covid related data.

Sidenote: in this search I stumbled on the website Our World in Data. They have interesting data stories on all kinds of topics, of  which https://ourworldindata.org/coronavirus and https://ourworldindata.org/coronavirus-testing are just two - covid related - examples with some very cool infographics! Don't expect me to produce similar kinds of graphs in this blog (yet)! ;-)

Retrieving the data

For the Python scripting I'm using Jupyterlab again, just like in my previous (Structural Engineering related) blog. Retrieving the data itself requires very little effort, just some importing of data packages and reading a URL - from an online CSV data file to a Pandas dataframe:


As you can see, it's easy as pie to get to this data. Of course, it will need so work to be presentable :-)

Converting the data

The data types of the different columns are not usable yet, and some columns are not required (in my case). The dateRep column is converted to an actual Date/Time data type, some columns are dropped (using an array, since that'll keep the script kinda flexible) and others are renamed for ease-of-use (who came up with the original column names?):



As you can see in the bottom part of the screenshot above, this results in a nice little table with data to work with. Lastly, to be able to display results per country, I've grouped it by country:



Plotting the data

Finally we get to plot the data. I wrote the plot data in such a way, that I can easily show my own collection of data. This is done by means of the "data_plot", "c_plot" and "emphasize" parameters in the code below:


Now I can, for instance, plot the 14 day cumulative number of COVID cases for the Netherlands, Belgium, France and Spain, with an emphasis (bold line) for Netherlands and Belgium:


Or we could calculate our own 7 day and 14 day rolling average daily confirmed cases for some of the countries that were hit hardest by the current pandemic (ignoring the last few data steps, I still need to study a bit on the rolling functions for Pandas I think):



As you can see, the combination of Python (in Jupyterlab) and open data is amazing! Almost real-time (open) data at your disposal with just a few lines of code.

Not sure how tot use it in my daily structural engineering work, but we'll find a good use for it! :-) If you have good suggestions, feel free to post them below.

FYI, some other interesting open data links I came across:

Link to my Github gist for this script: