Collecting and storing data without infrastructure

Standard

Over the last few months, I’ve used multiple times a pattern to collect data and store it, without infrastructure. It comes handy when you need to build a dataset by running a script on a regular basis and saving things to a file. Often, I need to scrape data from a 3rd-party website (webpage, API etc.) on a regular basis (up to multiple times per hour) and store the resulting data over time in a tabular file.

If you’ve already infrastructure in place, typically a workflow management platform like Apache Airflow, I recommend using it instead. But if your need is relatively simple and isolated, you should consider this approach.

My pattern is to leverage Git and a continuous integration platform. The cloud version of this pattern is to use GitHub and GitHub Actions. The Git repository will hold your script to fetch and store the data, and you’ll also version your tabular file (fancy name for a CSV) in it. You’ll then schedule jobs on your favourite continuous integration platform to run this script, which will take care of computing data and storing it. After this, all you need to do is to git commit and git push to version your data over time in the Git repository.

From here, you have:

  • your code to compute data and store it;
  • a continuous integration configuration file to run your code on a regular basis;
  • a log of your continuous integration runs;
  • a trail of tabular files, with the latest version always up to date on your master branch.

If you’re using GitHub and are okay about exposing your code and data, you can have all this for free! You’ve got access to compute and storage without thinking about infrastructure. If you don’t want to expose your code or your data, you can run this on a private repository or your own infrastructure.

Now that you have your data in a Git repository, nothing prevents you from publishing it over HTTPS or building a simple read-only API based on your files. The most simple way to do this is to leverage GitHub Pages or you can use Netlify to serve a more advanced JSON API as I’ve written before.

Example repository

If you’re interested in this pattern and want to look at how to get this running, I’ve made recently a repository doing exactly this. A Python script fetches data from a JSON endpoint and appends the data in a CSV file. What’s valuable is the resulting time series, if you run it often and for multiple weeks. Take a look at the repository, especially the GitHub Actions workflow and the Python script.

PS: Git may need be your best choice when storing large data files. In that case, take a look at DVC – Data Version Control.