💵 Real-Time Fraud Detection Pipeline

A production-grade fraud detection system built on Databricks. Ingests raw credit card transactions, processes them through a Bronze → Silver → Gold medallion architecture, trains unsupervised anomaly detection models, and scores live transactions in real time — firing email alerts the moment a suspicious transaction is detected.

⚙️ How It Works

Kaggle CSV
    │
    ▼
Bronze Table  ──►  Silver Table  ──►  Gold Tables
(raw ingest)      (clean + FE)       (KPIs + model training)
                                              │
                              K-Means + Statistical models trained
                                              │
                                              ▼
                                  UC Volume (live_transactions)
                                              │
                                     Auto Loader (Files)
                                              │
                                              ▼
                                    Silver Transformations
                                    (filters + scaling)
                                              │
                                              ▼
                                ┌───────────────────────────┐
                                │     Real-Time Inference   │
                                │  K-Means  +  Statistical  │
                                │    →  is_fraud_alert      │
                                └─────────────┬─────────────┘
                                              │
                                        foreachBatch()
                                    ┌─────────┴─────────┐
                                    ▼                   ▼
                             Delta Append           Email Alert
                           live_fraud_alerts       (Gmail SMTP)

🔄 Pipeline Phases

Phase	Notebook	Output
1 — Ingest	`fraud-detection-analysis.ipynb`	`bronze_transactions`
2 — Clean & Engineer	`fraud-detection-analysis.ipynb`	`silver_transactions`
3 — Analytics & Modelling	`fraud-detection-analysis.ipynb`	5× Gold tables, trained models
4 — Visualisation	`fraud-detection-analysis.ipynb`	4× SQL views for dashboards
5 — Real-Time Streaming	`fraud-detection-streaming-only.ipynb`	`live_fraud_alerts` + email alerts

Silver feature engineering applied consistently in both batch and stream:

amount_category: zero / small / medium / large / very_large
Quality filters

🚨 Alert System

Channel	Behaviour
Driver Log	Every batch logs a summary line. Fraud batches print a full `🚨 FRAUD ALERT` block with scores
Gmail Email	HTML email sent via SMTP on any `is_fraud_alert = True` batch

Email includes: timestamp, batch ID, alert count, and a per-transaction table showing Amount, Category, and which model(s) triggered with raw scores (distance / z-score).

🎨 Threat Risk Levels

Level	Trigger
SAFE	`is_fraud_alert = False` — no models flagged
ALERT	`is_anomaly_statistical = True` only
HIGH	`is_anomaly_kmeans = True` only
CRITICAL	Both models flagged simultaneously

📁 Repository Structure

fraud-detection/
│
├── README.md
├── LICENSE
├── dashboard-query
├── fraud-detection-analysis.ipynb         ← ETL pipeline: ingest → medallion → model training
│
└── fraud-detection-streaming-only.ipynb   ← Real-time scoring + alerts (auto-triggered via jobs)

🛠️ Tech Stack

Layer	Technology
Platform	Databricks (Serverless)
Streaming	PySpark Structured Streaming, Auto Loader
Storage	Delta Lake, Unity Catalog Volumes
ML	PySpark MLlib (VectorAssembler, StandardScaler, K-Means)
Alerting	Gmail SMTP (`smtplib`), HTML email
Dataset	Kaggle Credit Card Fraud

🚀 Setup & Usage

Prerequisites: Databricks workspace with Unity Catalog enabled, Runtime 16.0+.

1. Run fraud-detection-analysis.ipynb top to bottom — creates all medallion tables and trains both models.

2. In fraud-detection-streaming-only.ipynb Cell 5, fill in SENDER_EMAIL, SENDER_PASSWORD, and RECIPIENT_EMAILS.

3. Run fraud-detection-streaming-only.ipynb Cells 1–6 in a single session.

4. In Databricks Jobs, create a File Arrival trigger pointing to:

/Volumes/workspace/default/fraud_detection/live_transactions/

Test the stream — drop a random fraud row as a CSV to trigger the full pipeline.

🗄️ Output Tables

Table	Description
`bronze_transactions`	Raw data + ingestion metadata
`silver_transactions`	Cleaned + feature-engineered
`gold_kmeans_anomalies`	K-Means scores on historical data
`gold_statistical_anomalies`	Z-score results on historical data
`gold_model_performance`	Precision/recall comparison of both models
`gold_risk_scoring`	Fraud probability by amount category
`gold_temporal_patterns`	Fraud rate by hour of day
`live_fraud_alerts`	Real-time fraud transactions (streaming output)

📜 License

MIT License — see LICENSE for details.

Built by @itsbk13

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Dashboard Query.dbquery.ipynb		Dashboard Query.dbquery.ipynb
Dashboard.lvdash.json		Dashboard.lvdash.json
LICENSE		LICENSE
README.md		README.md
fraud-detection-analysis.ipynb		fraud-detection-analysis.ipynb
fraud-detection-streaming-only.ipynb		fraud-detection-streaming-only.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💵 Real-Time Fraud Detection Pipeline

📋 Table of Contents

⚙️ How It Works

🔄 Pipeline Phases

🚨 Alert System

🎨 Threat Risk Levels

📁 Repository Structure

🛠️ Tech Stack

🚀 Setup & Usage

🗄️ Output Tables

📜 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💵 Real-Time Fraud Detection Pipeline

📋 Table of Contents

⚙️ How It Works

🔄 Pipeline Phases

🚨 Alert System

🎨 Threat Risk Levels

📁 Repository Structure

🛠️ Tech Stack

🚀 Setup & Usage

🗄️ Output Tables

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages