Entur to InfluxDB: Real-time Public Transport Data

Project status – open source 🎉

The full source code is now publicly available on GitHub:

The repository contains the Docker‑Compose stack, all Python utilities, and a detailed README.md that explains how to spin up the whole pipeline locally.

Why I built EnturPulse

I needed a large, complex, real‑world dataset to test the performance of InfluxDB 3.x with time‑series data collected from Entur. The PostgreSQL component was a necessary extra to repair faulty data and enrich it as much as possible before inserting it into InfluxDB.

Public transport data from Entur is perfect because it mixes two very different data streams:

Data type	Description
Live vehicle locations	A constant, high‑frequency Mini‑SIRI feed (≈ every 15 seconds).
Static timetable dump	A massive, deeply relational NeTEx archive that describes the entire Norwegian network.

Data quality – the elephant in the room

The biggest surprise was how messy the source data is. I tried to repair it by adding a PostgreSQL layer that enriches the live points with static schedule information, but the improvements were modest.

The OperatorRefValid field is frequently missing or malformed – see the official quality dashboard: https://data-quality.entur.no/SIRI_ET
Many line, route, and journey references are broken, leading to massive foreign‑key violations during the NeTEx import.
Even after extensive cleaning, a large fraction of records are dropped because they cannot be linked to a valid parent entity.

Below is a snippet from a typical import run that shows the amount of data that gets filtered out:

--- Filtering data against valid parent records ---
  > Routes: Kept 13054 of 22596 (removed 9542 with missing lines).
  > JourneyPatterns: Kept 16980 of 28517 (removed 11537 with missing routes).
  > PassengerAssignments: Kept 211961 of 211961 (removed 0 with missing quays).

Despite the effort, the PostgreSQL enrichment step only managed to enrich ~2540 live points per poll, and many of those still contain missing or inaccurate fields.

What I actually spent time on

Task	Time spent (approx.)	Comments
Building the Docker‑Compose stack & automation	30%	Getting all containers to start without manual intervention was a nightmare.
Learning BigQuery & pulling static data	20%	First time I ever used BigQuery – the learning curve was steep but rewarding.
Writing parsers for the massive NeTEx XML files	25%	4725 XML files, 472MB of compressed data – a lot of edge‑case handling.
Debugging InfluxDB 3.x token creation (no env‑var support)	15%	Had to strip ANSI escape codes from the binary output to extract the token.
Actual data ingestion & visualization	10%	The fun part, but it feels like the tip of the iceberg.

The token‑extraction nightmare

InfluxDB 3.x does not allow setting the admin token via environment variables. The influxdb-token-init service prints a coloured line like:

Token: 1234567890abcdef...

Because of the embedded ANSI colour codes, a naïve awk extraction fails. The final workaround looks like this:

# FIX: First, strip all ANSI escape codes that add color/style.
# Then, use awk to reliably find the line and extract the token value.
RAW=$(./influxdb3 token create --admin)
TOKEN=$(printf "%s\n" "${RAW}" | sed -e 's/\x1b\[[0-9;]*m//g' | awk -F 'Token: ' '/Token: / {print $2}')

Live data collection (Mini‑SIRI)

The live‑collector service polls Entur’s Mini‑SIRI endpoint every 15 seconds:

Fetching live data from https://api.entur.io/realtime/v1/rest/vm?maxSize=100000...
Successfully fetched data.
Found 2540 vehicle activities.

Successfully parsed 2540 data points.

PostgreSQL credentials found. Attempting data enrichment...
Connected to PostgreSQL database.
Successfully enriched 2540 data points.
PostgreSQL connection closed.

Connecting to InfluxDB database: 'enturpulse'...
Writing 2540 enriched points to InfluxDB...
Successfully wrote enriched data to InfluxDB.
[Thu Jul 31 16:32:29 UTC 2025] Sleeping for 60 seconds...

Each point is written to the bus_location measurement with tags such as vehicle_id, operator_ref, line_id, journey_id, and fields like latitude, longitude, delay_seconds, etc.

Timetable import (NeTEx)

The timetable‑importer runs once a day (usually between 00:00‑04:00) and pulls the latest aggregated Netex zip from Entur’s public bucket:

Starting EnturPulse Timetable Importer Service...
[Fri Aug  1 00:46:07 CEST 2025] Current time is between 00:00 and 04:00. Importing yesterday's data.
[Fri Aug  1 00:46:07 CEST 2025] Running initial import for date 20250731...
PRODUCTION MODE: Downloading from 'https://storage.googleapis.com/download/storage/v1/b/sarpanit-production/o/netex%2Frb_norway-aggregated-netex-20250731.zip?alt=media' for date 20250731
Download complete.

--- Starting extraction from 4725 XML files ---
  Processing file 4725/4725: MOR_MOR-Line-257_257_Skoleruter-Sykkylven.xml ...
--- Extraction Complete ---
Connected to PostgreSQL database: enturpulse
--- Checking database for existing parent records ---
--- Attempting to repair missing references from BigQuery ---
  > Searching for 2504 missing lines...
    > Could not find any of the missing lines in BigQuery.
  > Searching for 10079 missing routes...
    > Could not find any of the missing routes in BigQuery.

--- Filtering data against valid parent records ---
  > Routes: Kept 13054 of 22596 (removed 9542 with missing lines).
  > JourneyPatterns: Kept 16980 of 28517 (removed 11537 with missing routes).
  > PassengerAssignments: Kept 211961 of 211961 (removed 0 with missing quays).

--- Writing final, cleaned NeTEx data to PostgreSQL ---
Successfully processed 16980 records for journey_patterns via fast path.
Foreign key violation in journeys. Switching to resilient row‑by‑row insertion...
Successfully inserted 298003 records into journeys (skipped 53186 invalid records).
Foreign key violation in route_sequences. Switching to resilient row‑by‑row insertion...
Successfully inserted 361015 records into route_sequences (skipped 187113 invalid records).
...
[Fri Aug  1 00:49:14 CEST 2025] Initial import finished.
----------------------------------------------------
Scheduler is now running. Next import is scheduled.
View logs with 'docker logs <container_name>'.
----------------------------------------------------

The importer cleans the data, drops orphaned rows, and finally stores the relational model in PostgreSQL for later enrichment of live points.

Lessons learned & next steps

Data quality matters – Even with heavy cleaning, the underlying feed contains many inconsistencies that limit the usefulness of joins.
Automation is a double‑edged sword – Docker‑Compose made deployment reproducible, but the lack of proper env‑var support in InfluxDB 3.x forced a lot of brittle scripting.
BigQuery is powerful but unfamiliar – Pulling static reference data from the public Entur dataset was a steep learning curve, but it paid off by giving me a “ground truth” for validation.
Visualization is still a work in progress – I have a few Grafana dashboards showing live vehicle locations, but heatmaps and on‑time performance analyses are still pending.

Future work will focus on:

Building more robust data‑quality pipelines (e.g., fuzzy matching for missing OperatorRefs).
Adding a caching layer to avoid re‑downloading the same Netex archive when nothing has changed.
Experimenting with InfluxDB 3.x’s SQL‑based joins to see if we can push more of the enrichment logic into the time‑series engine.
Extending Grafana dashboards with heatmaps, delay histograms, and route‑level KPIs.

I had fun?

EnturPulse turned out to be more about tooling and data‑wrangling than pure analytics. I spent more time writing parsers, fixing token extraction, and learning BigQuery than visualising the data. Still, the project gave me a deep appreciation for the challenges of real‑time public‑transport analytics and a solid foundation for anyone who wants to experiment with InfluxDB 3.x and open transport data.

Feel free to fork the repo, open issues, or just stare at the code and wonder why the 23 bus is always late. 🚍💨