Discovering content caching with NGINX with proxy_cache

Lately, I’ve had the chance to discover how to leverage the caching of HTTP responses from proxied servers using NGINX and the proxy_cache configuration directives. The web application I’m talking about is dedicated to show sales of properties in France, recently made available in open data. This represents 15 million sales of real estate (houses, flats, lands, forests etc.) in 5 years. The application, realised by Etalab, gets a national press coverage because it’s hosted on an official government domain and was introduced by the Minister of Public Action and Accounts of France.

The web application is composed of a Python backend, talking to a PostgreSQL database with a standard geographical interface with filters and various zoom levels. You can see a demo of the first version of the application in video and browse the code created by Marion Paclot. Regarding traffic, NGINX handles a traffic of 2500 requests/minute during the day with peaks up to 5000-6000 requests/minute, the analytics are available publicly. Knowing people mainly browse their neighbourhood, it’s important to keep areas with a high population density in cache.

The goal was to keep up with this load with a single server of 8 cores and 32 Go of RAM. NGINX delivers this thanks to its proxy cache. We can serve the application with a load average of a 1-3 and an average RAM usage of 3 Go and 8 Go of proxy cache size. You’ll find the commented NGINX configuration below

# Define a cache of up to 10 Go, with files up to 10 Mo. Files that have
# been created more than 2 days ago will be deleted.
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=dvf:10m max_size=10g inactive=2d use_temp_path=off;
# Default key example: md5(GETapp.dvf.etalab.gouv.fr/api/mutations/75101/000AI/from=01-01-2014&to=30-06-2018)
proxy_cache_key "$request_method$host$request_uri";

# Rate limiting by IP, up to 50 Mo of storage and limited to 10 requests per second
limit_req_zone $binary_remote_addr zone=hit_per_ip:50m rate=10r/s;

server {
  server_name app.dvf.etalab.gouv.fr;
  root /srv/dvf/static;

  # Serve directly geojson files with a browser cache of 30 days
  location ~ ^/(cadastre|donneesgeo) {
    expires 30d;
    access_log off;
    add_header Cache-Control "public";
  }

  # Cache static files (index.html / *.js) only 30s in the proxy
  # cache and browser cache of 1 minute to be able to deploy
  # quickly changes.
  location / {
    expires 1m;
    add_header Cache-Control "public";

    proxy_cache dvf;
    proxy_cache_valid 200 30s;
    # Change the default query to drop the query parameters. News sites
    # often add query parameters to the index we are not interested in
    # and could bust our cache
    proxy_cache_key "$request_method$host$request_filename";
    include includes/dvf_proxy.conf;
  }

  # The API tells which sales happened for a specific geographic area
  # between 2 dates.
  # This where we need to talk to Python + PostgreSQL. Keep API responses
  # in cache for 1 day and set the browser cache to 12 hours.
  # Allow a burst of up to 50 requests / second, but requests will be
  # queued to respect the max of 10 requests / second.
  location /api {
    expires 12h;
    add_header Cache-Control "public";

    limit_req zone=hit_per_ip burst=50 nodelay;
    limit_req_status 429;

    proxy_cache dvf;
    proxy_cache_valid 200 1d;
    include includes/dvf_proxy.conf;
  }

  listen 443 ssl http2; # managed by Certbot
  ssl_certificate /etc/letsencrypt/live/app.dvf.etalab.gouv.fr/fullchain.pem; # managed by Certbot
  ssl_certificate_key /etc/letsencrypt/live/app.dvf.etalab.gouv.fr/privkey.pem; # managed by Certbot
  include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
  ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}
server {
  if ($host = app.dvf.etalab.gouv.fr) {
    return 301 https://$host$request_uri;
  } # managed by Certbot

  server_name app.dvf.etalab.gouv.fr;
  listen 80;
  return 404; # managed by Certbot
}

And the include/dvf_proxy.conf file, which proxies requests to Gunicorn, the Python server:

add_header X-Proxy-Cache $upstream_cache_status;
    
add_header X-Frame-Options SAMEORIGIN;
add_header Content-Security-Policy "frame-ancestors 'self'";
add_header X-Content-Security-Policy "frame-ancestors 'self'";

proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_pass http://127.0.0.1:8000;