Lufthansa Data Intelligence Pipeline

A production-style data engineering pipeline built during the Lufthansa Industry Solutions Data Intelligence Program.

The project ingests operational and reference data from the Lufthansa OpenAPI, processes it through a Bronze → Silver → Gold medallion architecture on Databricks, and provisions infrastructure using Terraform.

Motivation

This project was an opportunity to explore how modern data platforms are designed and orchestrated in real-world environments.

The focus was not on building a dashboard itself, but on understanding the systems behind scalable data ingestion and transformation pipelines, including:

distributed processing with PySpark,
infrastructure orchestration with Terraform,
medallion architectures,
data quality handling,
and production-style workflow automation.

One of the most valuable aspects of the project was adapting to unfamiliar tooling and building a working pipeline collaboratively within a short timeframe.

Architecture Overview

Lufthansa OpenAPI
        ↓
   Ingestion Layer
    (Python Scripts)
        ↓
 Bronze Layer (Raw JSON)
        ↓
 Silver Layer (Typed / Cleaned)
        ↓
 Gold Layer (Analytics / Enriched)
        ↓
 Dashboard & Aggregations

Infrastructure and orchestration are managed through:

Databricks
PySpark
Terraform
Spark Declarative Pipelines

Key Engineering Concepts

Medallion architecture (Bronze / Silver / Gold)
Distributed data processing with PySpark
Infrastructure-as-Code with Terraform
Automated ingestion and transformation workflows
Data quality validation and quarantine handling
Delta tables and structured pipelines
API pagination and retry strategies

Interesting Implementation Details

Some of the more interesting engineering challenges included:

Poison Pill Handling

When a paginated API request failed mid-sweep, the ingestion logic recursively isolated the problematic record using binary search while preserving valid data batches.

Retry & Backoff Logic

The pipeline implemented exponential backoff handling for rate limits and transient API failures.

Schema Normalization

The Lufthansa API occasionally returned single objects instead of arrays, requiring normalization logic to maintain consistent downstream schemas.

Deduplication & Quarantine

Silver-layer transformations handled type casting, deduplication, and routing of invalid records into quarantine tables.

Technology Stack

Python
PySpark
Databricks
Terraform
Delta Tables
Spark Declarative Pipelines

Repository Structure

DataIntelligenceLHIND/
├── config.yaml                          # Local dev credentials (git-ignored)
├── pyproject.toml                       # Python project config (uv)
├── src/
│   ├── ingestion/
│   │   ├── fetch_departures_from_airport.py  # Hub departure sweeper (primary ops script)
│   │   ├── fetch_all_references.py           # Toggleable reference data ingestion
│   │   └── fetch_flights_on_route.py         # Ad-hoc route-based flight lookup
│   ├── bronze/
│   │   ├── ops_ingestion.py             # Auto Loader → ops_flights DLT table
│   │   └── ref_ingestion.py             # Auto Loader → ref_* DLT tables (5 types)
│   ├── silver/
│   │   ├── silver_operations.py         # Flatten + deduplicate flights; quarantine
│   │   └── silver_references.py         # Flatten + deduplicate reference data; quarantine
│   ├── gold/
│   │   └── gold_flight_analytics.py     # Enriched fact table (flights + dimensions)
│   └── utils/
│       └── helpers.py                   # LufthansaClient — auth, pagination, retry, save
└── terraform/

Local Setup

Prerequisites

Python 3.11+
Databricks workspace
Terraform
Lufthansa OpenAPI credentials

Create a local config.yaml file:

client_id: "your_client_id"
client_secret: "your_client_secret"

Running the Pipeline

Install dependencies

uv sync

Run ingestion locally

uv run src/ingestion/fetch_departures_from_airport.py

Deploy infrastructure

cd terraform
terraform init
terraform apply

Notes

The project was designed around Databricks workflows and Spark Declarative Pipelines, so a fully reproducible deployment requires access to a configured Databricks environment.

Future Improvements

Real-time streaming ingestion
Expanded monitoring and observability
Automated schema evolution handling
More advanced analytics layers
Broader operational datasets

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
src		src
terraform		terraform
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lufthansa Data Intelligence Pipeline

Motivation

Architecture Overview

Key Engineering Concepts

Interesting Implementation Details

Poison Pill Handling

Retry & Backoff Logic

Schema Normalization

Deduplication & Quarantine

Technology Stack

Repository Structure

Local Setup

Prerequisites

Running the Pipeline

Install dependencies

Run ingestion locally

Deploy infrastructure

Notes

Future Improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lufthansa Data Intelligence Pipeline

Motivation

Architecture Overview

Key Engineering Concepts

Interesting Implementation Details

Poison Pill Handling

Retry & Backoff Logic

Schema Normalization

Deduplication & Quarantine

Technology Stack

Repository Structure

Local Setup

Prerequisites

Running the Pipeline

Install dependencies

Run ingestion locally

Deploy infrastructure

Notes

Future Improvements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages