A production-style data engineering pipeline built during the Lufthansa Industry Solutions Data Intelligence Program.
The project ingests operational and reference data from the Lufthansa OpenAPI, processes it through a Bronze → Silver → Gold medallion architecture on Databricks, and provisions infrastructure using Terraform.
This project was an opportunity to explore how modern data platforms are designed and orchestrated in real-world environments.
The focus was not on building a dashboard itself, but on understanding the systems behind scalable data ingestion and transformation pipelines, including:
- distributed processing with PySpark,
- infrastructure orchestration with Terraform,
- medallion architectures,
- data quality handling,
- and production-style workflow automation.
One of the most valuable aspects of the project was adapting to unfamiliar tooling and building a working pipeline collaboratively within a short timeframe.
Lufthansa OpenAPI
↓
Ingestion Layer
(Python Scripts)
↓
Bronze Layer (Raw JSON)
↓
Silver Layer (Typed / Cleaned)
↓
Gold Layer (Analytics / Enriched)
↓
Dashboard & Aggregations
Infrastructure and orchestration are managed through:
- Databricks
- PySpark
- Terraform
- Spark Declarative Pipelines
- Medallion architecture (Bronze / Silver / Gold)
- Distributed data processing with PySpark
- Infrastructure-as-Code with Terraform
- Automated ingestion and transformation workflows
- Data quality validation and quarantine handling
- Delta tables and structured pipelines
- API pagination and retry strategies
Some of the more interesting engineering challenges included:
When a paginated API request failed mid-sweep, the ingestion logic recursively isolated the problematic record using binary search while preserving valid data batches.
The pipeline implemented exponential backoff handling for rate limits and transient API failures.
The Lufthansa API occasionally returned single objects instead of arrays, requiring normalization logic to maintain consistent downstream schemas.
Silver-layer transformations handled type casting, deduplication, and routing of invalid records into quarantine tables.
- Python
- PySpark
- Databricks
- Terraform
- Delta Tables
- Spark Declarative Pipelines
DataIntelligenceLHIND/
├── config.yaml # Local dev credentials (git-ignored)
├── pyproject.toml # Python project config (uv)
├── src/
│ ├── ingestion/
│ │ ├── fetch_departures_from_airport.py # Hub departure sweeper (primary ops script)
│ │ ├── fetch_all_references.py # Toggleable reference data ingestion
│ │ └── fetch_flights_on_route.py # Ad-hoc route-based flight lookup
│ ├── bronze/
│ │ ├── ops_ingestion.py # Auto Loader → ops_flights DLT table
│ │ └── ref_ingestion.py # Auto Loader → ref_* DLT tables (5 types)
│ ├── silver/
│ │ ├── silver_operations.py # Flatten + deduplicate flights; quarantine
│ │ └── silver_references.py # Flatten + deduplicate reference data; quarantine
│ ├── gold/
│ │ └── gold_flight_analytics.py # Enriched fact table (flights + dimensions)
│ └── utils/
│ └── helpers.py # LufthansaClient — auth, pagination, retry, save
└── terraform/
- Python 3.11+
- Databricks workspace
- Terraform
- Lufthansa OpenAPI credentials
Create a local config.yaml file:
client_id: "your_client_id"
client_secret: "your_client_secret"uv syncuv run src/ingestion/fetch_departures_from_airport.pycd terraform
terraform init
terraform applyThe project was designed around Databricks workflows and Spark Declarative Pipelines, so a fully reproducible deployment requires access to a configured Databricks environment.
- Real-time streaming ingestion
- Expanded monitoring and observability
- Automated schema evolution handling
- More advanced analytics layers
- Broader operational datasets
MIT License