Skip to content

itsbk13/Real-Time-Transaction-Fraud-Detection-System

Repository files navigation

💵 Real-Time Fraud Detection Pipeline

Databricks Python Delta Lake PySpark

A production-grade fraud detection system built on Databricks. Ingests raw credit card transactions, processes them through a Bronze → Silver → Gold medallion architecture, trains unsupervised anomaly detection models, and scores live transactions in real time — firing email alerts the moment a suspicious transaction is detected.


📋 Table of Contents


⚙️ How It Works

Kaggle CSV
    │
    ▼
Bronze Table  ──►  Silver Table  ──►  Gold Tables
(raw ingest)      (clean + FE)       (KPIs + model training)
                                              │
                              K-Means + Statistical models trained
                                              │
                                              ▼
                                  UC Volume (live_transactions)
                                              │
                                     Auto Loader (Files)
                                              │
                                              ▼
                                    Silver Transformations
                                    (filters + scaling)
                                              │
                                              ▼
                                ┌───────────────────────────┐
                                │     Real-Time Inference   │
                                │  K-Means  +  Statistical  │
                                │    →  is_fraud_alert      │
                                └─────────────┬─────────────┘
                                              │
                                        foreachBatch()
                                    ┌─────────┴─────────┐
                                    ▼                   ▼
                             Delta Append           Email Alert
                           live_fraud_alerts       (Gmail SMTP)

🔄 Pipeline Phases

Phase Notebook Output
1 — Ingest fraud-detection-analysis.ipynb bronze_transactions
2 — Clean & Engineer fraud-detection-analysis.ipynb silver_transactions
3 — Analytics & Modelling fraud-detection-analysis.ipynb 5× Gold tables, trained models
4 — Visualisation fraud-detection-analysis.ipynb 4× SQL views for dashboards
5 — Real-Time Streaming fraud-detection-streaming-only.ipynb live_fraud_alerts + email alerts

Silver feature engineering applied consistently in both batch and stream:

  • amount_category: zero / small / medium / large / very_large
  • Quality filters

🚨 Alert System

Channel Behaviour
Driver Log Every batch logs a summary line. Fraud batches print a full 🚨 FRAUD ALERT block with scores
Gmail Email HTML email sent via SMTP on any is_fraud_alert = True batch

Email includes: timestamp, batch ID, alert count, and a per-transaction table showing Amount, Category, and which model(s) triggered with raw scores (distance / z-score).

🎨 Threat Risk Levels

Level Trigger
SAFE is_fraud_alert = False — no models flagged
ALERT is_anomaly_statistical = True only
HIGH is_anomaly_kmeans = True only
CRITICAL Both models flagged simultaneously

📁 Repository Structure

fraud-detection/
│
├── README.md
├── LICENSE
├── dashboard-query
├── fraud-detection-analysis.ipynb         ← ETL pipeline: ingest → medallion → model training
│
└── fraud-detection-streaming-only.ipynb   ← Real-time scoring + alerts (auto-triggered via jobs)

🛠️ Tech Stack

Layer Technology
Platform Databricks (Serverless)
Streaming PySpark Structured Streaming, Auto Loader
Storage Delta Lake, Unity Catalog Volumes
ML PySpark MLlib (VectorAssembler, StandardScaler, K-Means)
Alerting Gmail SMTP (smtplib), HTML email
Dataset Kaggle Credit Card Fraud

🚀 Setup & Usage

Prerequisites: Databricks workspace with Unity Catalog enabled, Runtime 16.0+.

1. Run fraud-detection-analysis.ipynb top to bottom — creates all medallion tables and trains both models.

2. In fraud-detection-streaming-only.ipynb Cell 5, fill in SENDER_EMAIL, SENDER_PASSWORD, and RECIPIENT_EMAILS.

3. Run fraud-detection-streaming-only.ipynb Cells 1–6 in a single session.

4. In Databricks Jobs, create a File Arrival trigger pointing to:

/Volumes/workspace/default/fraud_detection/live_transactions/

Test the stream — drop a random fraud row as a CSV to trigger the full pipeline.


🗄️ Output Tables

Table Description
bronze_transactions Raw data + ingestion metadata
silver_transactions Cleaned + feature-engineered
gold_kmeans_anomalies K-Means scores on historical data
gold_statistical_anomalies Z-score results on historical data
gold_model_performance Precision/recall comparison of both models
gold_risk_scoring Fraud probability by amount category
gold_temporal_patterns Fraud rate by hour of day
live_fraud_alerts Real-time fraud transactions (streaming output)

📜 License

MIT License — see LICENSE for details.


Built by @itsbk13

About

A production-grade fraud detection system built on Databricks. Ingests raw credit card transactions, processes them through a Medallion architecture and scores fraud transactions in real time — firing email alerts.

Topics

Resources

License

Stars

Watchers

Forks

Contributors