Skip to content

fredch16/DataIntelligenceLHIND

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lufthansa Data Intelligence Pipeline

A production-style data engineering pipeline built during the Lufthansa Industry Solutions Data Intelligence Program.

The project ingests operational and reference data from the Lufthansa OpenAPI, processes it through a Bronze → Silver → Gold medallion architecture on Databricks, and provisions infrastructure using Terraform.


Motivation

This project was an opportunity to explore how modern data platforms are designed and orchestrated in real-world environments.

The focus was not on building a dashboard itself, but on understanding the systems behind scalable data ingestion and transformation pipelines, including:

  • distributed processing with PySpark,
  • infrastructure orchestration with Terraform,
  • medallion architectures,
  • data quality handling,
  • and production-style workflow automation.

One of the most valuable aspects of the project was adapting to unfamiliar tooling and building a working pipeline collaboratively within a short timeframe.


Architecture Overview

Lufthansa OpenAPI
        ↓
   Ingestion Layer
    (Python Scripts)
        ↓
 Bronze Layer (Raw JSON)
        ↓
 Silver Layer (Typed / Cleaned)
        ↓
 Gold Layer (Analytics / Enriched)
        ↓
 Dashboard & Aggregations

Infrastructure and orchestration are managed through:

  • Databricks
  • PySpark
  • Terraform
  • Spark Declarative Pipelines

Key Engineering Concepts

  • Medallion architecture (Bronze / Silver / Gold)
  • Distributed data processing with PySpark
  • Infrastructure-as-Code with Terraform
  • Automated ingestion and transformation workflows
  • Data quality validation and quarantine handling
  • Delta tables and structured pipelines
  • API pagination and retry strategies

Interesting Implementation Details

Some of the more interesting engineering challenges included:

Poison Pill Handling

When a paginated API request failed mid-sweep, the ingestion logic recursively isolated the problematic record using binary search while preserving valid data batches.

Retry & Backoff Logic

The pipeline implemented exponential backoff handling for rate limits and transient API failures.

Schema Normalization

The Lufthansa API occasionally returned single objects instead of arrays, requiring normalization logic to maintain consistent downstream schemas.

Deduplication & Quarantine

Silver-layer transformations handled type casting, deduplication, and routing of invalid records into quarantine tables.


Technology Stack

  • Python
  • PySpark
  • Databricks
  • Terraform
  • Delta Tables
  • Spark Declarative Pipelines

Repository Structure

DataIntelligenceLHIND/
├── config.yaml                          # Local dev credentials (git-ignored)
├── pyproject.toml                       # Python project config (uv)
├── src/
│   ├── ingestion/
│   │   ├── fetch_departures_from_airport.py  # Hub departure sweeper (primary ops script)
│   │   ├── fetch_all_references.py           # Toggleable reference data ingestion
│   │   └── fetch_flights_on_route.py         # Ad-hoc route-based flight lookup
│   ├── bronze/
│   │   ├── ops_ingestion.py             # Auto Loader → ops_flights DLT table
│   │   └── ref_ingestion.py             # Auto Loader → ref_* DLT tables (5 types)
│   ├── silver/
│   │   ├── silver_operations.py         # Flatten + deduplicate flights; quarantine
│   │   └── silver_references.py         # Flatten + deduplicate reference data; quarantine
│   ├── gold/
│   │   └── gold_flight_analytics.py     # Enriched fact table (flights + dimensions)
│   └── utils/
│       └── helpers.py                   # LufthansaClient — auth, pagination, retry, save
└── terraform/

Local Setup

Prerequisites

  • Python 3.11+
  • Databricks workspace
  • Terraform
  • Lufthansa OpenAPI credentials

Create a local config.yaml file:

client_id: "your_client_id"
client_secret: "your_client_secret"

Running the Pipeline

Install dependencies

uv sync

Run ingestion locally

uv run src/ingestion/fetch_departures_from_airport.py

Deploy infrastructure

cd terraform
terraform init
terraform apply

Notes

The project was designed around Databricks workflows and Spark Declarative Pipelines, so a fully reproducible deployment requires access to a configured Databricks environment.


Future Improvements

  • Real-time streaming ingestion
  • Expanded monitoring and observability
  • Automated schema evolution handling
  • More advanced analytics layers
  • Broader operational datasets

License

MIT License

About

Production-style data pipeline built with Databricks, PySpark, Terraform, and medallion architecture concepts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors