High-quality open data: the making of the French sea rescue operations dataset

Standard

In 2018, I was working at the French maritime affairs, part of the Ministry of Ecology. France has the second largest sea territory in the world. It carries out 13,000 missions / year, saving ~5,600 people and assisting 14,500 more people. Sea search and rescue as well as assistance missions led to the engagement of 10,500 vessels and 1,500 helicopters or planes.

The SNSM (national society for sea rescue) and Marine Nationale out at sea

One of our goals was to improve knowledge and the collaboration among the numerous actors involved in the sea rescue process. One of the first proposition to achieve that goal was to open the sea search and rescue data. The goal was to make raw data for everyone to use, without constraints. We published more than 250,000 sea rescue operations carried out since 1985 in July 2018 on the French’s official open data platform, data.gouv.fr.

To me this dataset can be considered a “high quality open dataset”. I will now explain in further details how it was made.

Raw data

Even on a politically sensitive subject like sea rescue, we chose to publish raw data about the alert, ships involved, weather details, precise localisation, ships, vehicles or helicopters engaged, as well as what happened to the various people involved. One row per operation, in 4 tables, totalling 120 columns. It gives enough information, even for professionals and agencies, while taking into account national security and data privacy.

Designing for reusers

The original database schema was made of close to 30 tables. Acquiring the data from various actors involved in sea rescue was challenging because of the differences in technical vocabularies. We needed to build a simple schema to merge the information that everyone could easily understand

We ended up with just 4 tables, without loosing crucial information. Out of the 4 tables, 3 are raw data, as filled by agents during and after operations. The last table uses data already available in other tables and makes it more convenient to use: we perform common aggregates, filters or add convenience columns: splitting dates, converting units, adding bank holidays, sunset times etc.

As publishers, we used this dataset a lot for our own analysis. The same dataset is used internally to prepare reports, investigate new regulations and prepare prevention actions. We matured the documentation and schema by training people to perform queries on this dataset. People asked clarifications, suggested new columns or reported unclear documentation. Working in close collaboration with the sea rescue experts helped us a lot to improve the dataset quality.

Open source processes

Our data is made available online, with an open licence in the end. But what about the code written for extraction and transformation? We thought it was important to make this code open source, so that people can see how we build the dataset, can report bugs or suggest improvements. The code is published on GitHub. We benefited from this choice: people got in touch with us through this medium and knowing it was made available publicly, we felt more accountable.

In this repository, we publish the code written to extract the original data from an Oracle database, transform it, add columns, join with other datasets and prepare final files, which will end on the open data platform.

Documentation

Good open data comes up with a documentation. Right? We tried to follow this principle and went a bit further. In the web documentation, we explained how sea rescue works in France, how people are asked to fill forms when the situation is unclear, changes to the dataset are linked to the relevant code commits, schemas, tables, unique values in key columns, sample queries etc.

We felt all these pages were important and useful to deal with the complexity of the data, reflecting the reality of sea rescue operations.

UML schema describing how tables fit together

Software engineering and open data

We tried to use modern software engineering practices for our open data principles. For us, it means: version control, pull-requests with reviews, tests, pipelines, monitoring, continuous integration, data quality checks. For example, tests make it impossible to have an extra column without documentation or without an end-to-end extraction/transformation test in place. Before publishing new data, we also perform general quality tests to prevent serious regressions.

One of the transformation pipelines, using Apache Airflow

Thanks to this, we are able to publish with confidence this dataset daily, just the day after a mission happened.

We believe it is quite unique to publish an accidentology open dataset on a daily basis, without human interaction needed, at a country level.

Interactive map

Not everyone is comfortable using CSV files with hundreds of thousands of lines. Most people want to apply some filters (specific operation types, people involved, date, zone) and see common stats. As sea rescue operations are geographic data, it made sense to offer an interactive map. We decided to make this interactive map available on the Internet, without restrictions.

Interactive map of French sea rescue operations

It makes it much easier for people to give a quick look at the data. People can apply filters, explore and then download the unique dataset they see on their screen for further investigations.

As with the raw data, the map is open source, a documentation is available. When people export from this app, the schema is the same than the open data dataset.

 

Rejoindre le secteur public en tant que professionnel du numérique

Standard

Récemment dans le cadre de mon travail j’ai eu la chance d’organiser un sondage à destination des professionnels du numérique qui sont intéressés pour rejoindre le service public à un moment dans le carrière. La fonction publique est souvent décriée par ces professionnels : méthodes waterfall, processus longs, métiers non compris, salaires peu attractifs, peu de télétravail. C’est le prix à payer pour un métier servant l’intérêt général et qui donne du sens à son travail. Ben Balter a écrit à ce sujet un article de blog très pertinent : 19 reasons why technologists don’t want to work at your government agency.

Des réflexions sont menées en interne dans l’administration pour attirer les talents nécessaires. Les anciens professionnels venant du privé rappellent régulièrement les points sur lesquels il faut s’attarder. Ce sondage était l’occasion de montrer que ce sujet d’actualité est traité et que l’administration est à l’écoute des premiers concernés : les professionnels qui veulent rejoindre le service public un jour ou l’autre. Vous pouvez retrouver ce sondage, les principaux résultats de celui-ci et les données brutes des soumissions reçues dans un article de blog sur Etalab.

Laravel package for the recommendation system Easyrec

Standard

During a few days in August I have coded a Laravel wrapper for the recommendation system Easyrec. If you want to display most viewed items, best rated items, related items or something like that, a recommendation system is the perfect way to go. If you are familiar with Machine Learning techniques, you now that a recommendation system is something very difficult. If you remember, Netflix offered $ 1,000,000 to improve its collaborative filtering algorithm.

Easyrec provides a REST API that you can call for free. This is something very convenient if don’t have a lot of data and that you still want to use a recommender system for your web service.

Features overview

After registering the service provider and the alias in your app/config/app.php file, it will be super easy to use! Here is a little overview of the available functions:

Easyrec::view(42, 'Post 42', 'http://example.com/posts/42', 200, null, null, 'POST'); // User #200 has viewed post #42
Easyrec::rate(42, 8, 'Book 42', 'http://example.com/books/42', 200, null, null, 'BOOK'); // User #200 has rated 8/10 book #42
Easyrec::mostViewedItems(10, 'WEEK', 'BOOK'); // Retrieves the 10 most viewed books for the last week
Easyrec::bestRatedItems(10, 'MONTH', 'POST'); // Retrieves the 10 best rated posts for the last month

Documentation and download

Of course the package is available via Composer. The full documentation can be found on GitHub: github.com/AntoineAugusti/laravel-easyrec.

Do not hesitate to open up an issue if something is not working as expected. Pull-requests are very welcome also!

Étude des données de Teen Quotes

Standard

Teen Quotes ?

Teen Quotes est un projet sur lequel je travaille depuis novembre 2010. En quelques mots, Teen Quotes est un site regroupant des citations du quotidien des adolescents à propos de leurs centres d’intérêts : l’école, les amis, les premiers amours, les premières peines. Teen Quotes est intégralement en anglais, a déjà enregistré plus de 1,5 M visiteurs et est présent sur Twitter, Facebook, le web, le web mobile et l’App Store.

Miam, des données !

Poussé par la réalisation d’un projet pour mon cursus à l’INSA de Rouen, j’ai décidé de prendre le temps de faire une analyse plus complètes des données associées à Teen Quotes. Teen Quotes étant déjà un projet ouvert (le code source est majoritairement libre de consultation), il apparait comme logique que l’étude de ces données soit publique.

Sans plus attendre, voici les liens pour consulter cette étude :