A production-grade fraud detection system built on Databricks. Ingests raw credit card transactions, processes them through a Bronze → Silver → Gold medallion architecture, trains unsupervised anomaly detection models, and scores live transactions in real time — firing email alerts the moment a suspicious transaction is detected.
- How It Works
- Pipeline Phases
- Alert System
- Repository Structure
- Tech Stack
- Setup & Usage
- Output Tables
- License
Kaggle CSV
│
▼
Bronze Table ──► Silver Table ──► Gold Tables
(raw ingest) (clean + FE) (KPIs + model training)
│
K-Means + Statistical models trained
│
▼
UC Volume (live_transactions)
│
Auto Loader (Files)
│
▼
Silver Transformations
(filters + scaling)
│
▼
┌───────────────────────────┐
│ Real-Time Inference │
│ K-Means + Statistical │
│ → is_fraud_alert │
└─────────────┬─────────────┘
│
foreachBatch()
┌─────────┴─────────┐
▼ ▼
Delta Append Email Alert
live_fraud_alerts (Gmail SMTP)
| Phase | Notebook | Output |
|---|---|---|
| 1 — Ingest | fraud-detection-analysis.ipynb |
bronze_transactions |
| 2 — Clean & Engineer | fraud-detection-analysis.ipynb |
silver_transactions |
| 3 — Analytics & Modelling | fraud-detection-analysis.ipynb |
5× Gold tables, trained models |
| 4 — Visualisation | fraud-detection-analysis.ipynb |
4× SQL views for dashboards |
| 5 — Real-Time Streaming | fraud-detection-streaming-only.ipynb |
live_fraud_alerts + email alerts |
Silver feature engineering applied consistently in both batch and stream:
amount_category: zero / small / medium / large / very_large- Quality filters
| Channel | Behaviour |
|---|---|
| Driver Log | Every batch logs a summary line. Fraud batches print a full 🚨 FRAUD ALERT block with scores |
| Gmail Email | HTML email sent via SMTP on any is_fraud_alert = True batch |
Email includes: timestamp, batch ID, alert count, and a per-transaction table showing Amount, Category, and which model(s) triggered with raw scores (distance / z-score).
| Level | Trigger |
|---|---|
| SAFE | is_fraud_alert = False — no models flagged |
| ALERT | is_anomaly_statistical = True only |
| HIGH | is_anomaly_kmeans = True only |
| CRITICAL | Both models flagged simultaneously |
fraud-detection/
│
├── README.md
├── LICENSE
├── dashboard-query
├── fraud-detection-analysis.ipynb ← ETL pipeline: ingest → medallion → model training
│
└── fraud-detection-streaming-only.ipynb ← Real-time scoring + alerts (auto-triggered via jobs)
| Layer | Technology |
|---|---|
| Platform | Databricks (Serverless) |
| Streaming | PySpark Structured Streaming, Auto Loader |
| Storage | Delta Lake, Unity Catalog Volumes |
| ML | PySpark MLlib (VectorAssembler, StandardScaler, K-Means) |
| Alerting | Gmail SMTP (smtplib), HTML email |
| Dataset | Kaggle Credit Card Fraud |
Prerequisites: Databricks workspace with Unity Catalog enabled, Runtime 16.0+.
1. Run fraud-detection-analysis.ipynb top to bottom — creates all medallion tables and trains both models.
2. In fraud-detection-streaming-only.ipynb Cell 5, fill in SENDER_EMAIL, SENDER_PASSWORD, and RECIPIENT_EMAILS.
3. Run fraud-detection-streaming-only.ipynb Cells 1–6 in a single session.
4. In Databricks Jobs, create a File Arrival trigger pointing to:
/Volumes/workspace/default/fraud_detection/live_transactions/
Test the stream — drop a random fraud row as a CSV to trigger the full pipeline.
| Table | Description |
|---|---|
bronze_transactions |
Raw data + ingestion metadata |
silver_transactions |
Cleaned + feature-engineered |
gold_kmeans_anomalies |
K-Means scores on historical data |
gold_statistical_anomalies |
Z-score results on historical data |
gold_model_performance |
Precision/recall comparison of both models |
gold_risk_scoring |
Fraud probability by amount category |
gold_temporal_patterns |
Fraud rate by hour of day |
live_fraud_alerts |
Real-time fraud transactions (streaming output) |
MIT License — see LICENSE for details.
Built by @itsbk13