March 2020 – Antoine Augusti

In 2018, I was working at the French maritime affairs, part of the Ministry of Ecology. France has the second largest sea territory in the world. It carries out 13,000 missions / year, saving ~5,600 people and assisting 14,500 more people. Sea search and rescue as well as assistance missions led to the engagement of 10,500 vessels and 1,500 helicopters or planes.

The SNSM (national society for sea rescue) and Marine Nationale out at sea

One of our goals was to improve knowledge and the collaboration among the numerous actors involved in the sea rescue process. One of the first proposition to achieve that goal was to open the sea search and rescue data. The goal was to make raw data for everyone to use, without constraints. We published more than 250,000 sea rescue operations carried out since 1985 in July 2018 on the French’s official open data platform, data.gouv.fr.

To me this dataset can be considered a “high quality open dataset”. I will now explain in further details how it was made.

Raw data

Even on a politically sensitive subject like sea rescue, we chose to publish raw data about the alert, ships involved, weather details, precise localisation, ships, vehicles or helicopters engaged, as well as what happened to the various people involved. One row per operation, in 4 tables, totalling 120 columns. It gives enough information, even for professionals and agencies, while taking into account national security and data privacy.

Designing for reusers

The original database schema was made of close to 30 tables. Acquiring the data from various actors involved in sea rescue was challenging because of the differences in technical vocabularies. We needed to build a simple schema to merge the information that everyone could easily understand

We ended up with just 4 tables, without loosing crucial information. Out of the 4 tables, 3 are raw data, as filled by agents during and after operations. The last table uses data already available in other tables and makes it more convenient to use: we perform common aggregates, filters or add convenience columns: splitting dates, converting units, adding bank holidays, sunset times etc.

As publishers, we used this dataset a lot for our own analysis. The same dataset is used internally to prepare reports, investigate new regulations and prepare prevention actions. We matured the documentation and schema by training people to perform queries on this dataset. People asked clarifications, suggested new columns or reported unclear documentation. Working in close collaboration with the sea rescue experts helped us a lot to improve the dataset quality.

Open source processes

Our data is made available online, with an open licence in the end. But what about the code written for extraction and transformation? We thought it was important to make this code open source, so that people can see how we build the dataset, can report bugs or suggest improvements. The code is published on GitHub. We benefited from this choice: people got in touch with us through this medium and knowing it was made available publicly, we felt more accountable.

In this repository, we publish the code written to extract the original data from an Oracle database, transform it, add columns, join with other datasets and prepare final files, which will end on the open data platform.

Documentation

Good open data comes up with a documentation. Right? We tried to follow this principle and went a bit further. In the web documentation, we explained how sea rescue works in France, how people are asked to fill forms when the situation is unclear, changes to the dataset are linked to the relevant code commits, schemas, tables, unique values in key columns, sample queries etc.

We felt all these pages were important and useful to deal with the complexity of the data, reflecting the reality of sea rescue operations.

UML schema describing how tables fit together

Software engineering and open data

We tried to use modern software engineering practices for our open data principles. For us, it means: version control, pull-requests with reviews, tests, pipelines, monitoring, continuous integration, data quality checks. For example, tests make it impossible to have an extra column without documentation or without an end-to-end extraction/transformation test in place. Before publishing new data, we also perform general quality tests to prevent serious regressions.

One of the transformation pipelines, using Apache Airflow

Thanks to this, we are able to publish with confidence this dataset daily, just the day after a mission happened.

We believe it is quite unique to publish an accidentology open dataset on a daily basis, without human interaction needed, at a country level.

Interactive map

Not everyone is comfortable using CSV files with hundreds of thousands of lines. Most people want to apply some filters (specific operation types, people involved, date, zone) and see common stats. As sea rescue operations are geographic data, it made sense to offer an interactive map. We decided to make this interactive map available on the Internet, without restrictions.

Interactive map of French sea rescue operations

It makes it much easier for people to give a quick look at the data. People can apply filters, explore and then download the unique dataset they see on their screen for further investigations.

As with the raw data, the map is open source, a documentation is available. When people export from this app, the schema is the same than the open data dataset.

I’m moving to Montréal, Canada ??!

It’s quite a change after living for the last 4 years in Paris. We decided to move Manon and me, for multiple reasons:

We both wanted to live outside of France for at least a few years. I already lived in the UK, Manon in the USA, now was our chance to live elsewhere together
Manon got her PhD in late 2019 and needed to work in another country for her research career

But why Montréal? We didn’t want to stay in Europe, it was too close to France and we wanted to live further away. We went on holiday in the Québec province in 2017 during 3 weeks and we liked it a lot. It’s also an important city for machine learning. Finally, it has a vibrant international community and the Québec province is very close in term of culture and social benefits to France.

Moving to another countries involves a lot of planning and changes. I’ve tried to draft a list of pros/cons and goals.

Pros/goals

Discovering another country and culture. Fitting in. I’m very happy to discover in depth a non-european culture. As a public servant, I’m also looking forward to understand how Canada works in terms of government, politics and policies.
Speaking English. Bilingualism is important to Canadians, and this is something I’m attached to as well. Most canadians are native English, I’m native French so for me it’ll be a chance to get better at speaking English. And speaking québecois as well!
Traveling in America. While living in Europe, we’ve explored mostly european countries, while going once or twice to America or Asia. I’m looking forward to traveling in America, there is much to do!
Different weather, as Canada is known for its cold winters. For now, I find it really enjoyable: Montréal under the snow is beautiful and it’s sunny and dry!
Discovering new food.

Cons/goals

Moving to another flat, filling administrative forms. I don’t find packing stuff, moving out, looking for a flat, ordering furnitures really enjoyable. We also had a ton of administrative things to take care of (immigration, health insurance, banks etc). It has been often long and stressful.
Staying connected to family and friends
Meeting new people, making new friends
Having enough time off to travel

I’m looking forward to being a proud canadian resident and enjoying our time here!

Antoine Augusti

Month: March 2020

High-quality open data: the making of the French sea rescue operations dataset