An end-to-end production ML pipeline that predicts industrial machine failures before they happen.
Optimizes for total business cost in dollars — not accuracy, not F1.
Try it now — no setup, no API key required:
👉 https://predictive-maintenance-deep-shah.streamlit.app/
The live application allows you to:
- Adjust real-time sensor sliders and watch the failure probability gauge update instantly
- Upload a CSV of machine readings and get a full fleet risk assessment in seconds
- Explore the business dashboard — compare reactive vs preventive vs AI-driven maintenance costs
- Drag the decision threshold slider and watch FP/FN counts and total cost update live
- Inspect SHAP waterfall charts that explain exactly why the model flagged any specific machine
18.9% failure probability — SAFE. H-type machine with 25 minutes of tool wear, 2000 RPM, and 30 Nm torque. The gauge, risk badge, and cost analysis update in real time as sliders are adjusted. Cost if ignored: $1,891. Preventive maintenance: $500. Model recommendation: Save $1,391. All four tabs — Live Prediction, Batch Analysis, Business Dashboard, and Model Explainability — are visible in the tab bar.
Physics-derived features at the bottom — Temp Differential: 9.50 K (above the 8.6 K Heat Dissipation threshold), Mechanical Power: 60,842 W (within safe operating range), Force Ratio: 0.01509 (well below the 0.035 Overstrain threshold). All three engineered features confirm the machine is operating within healthy parameters.
81.5% failure probability — DANGER. L-type machine with Temp_Diff of 7.40 K (below the 8.6 K Heat Dissipation threshold), 1,208 RPM (low), and 65 Nm torque. DANGER badge fires immediately. Multiple simultaneous failure modes detected — the model identifies the specific physical mechanisms, not just a probability score.
Expected cost if ignored: $8,151. Preventive maintenance cost: $500. Model recommendation: Save $7,651 — act immediately. Physics features confirm the failure signal: Temp Differential at 7.40 K (below the 8.6 K threshold). The 20:1 cost asymmetry ($10,000 failure vs $500 inspection) makes the maintenance decision unambiguous.
12 machines analyzed in one CSV upload. Fleet summary: 2 CRITICAL (16.7%), 3 MONITOR, 7 SAFE — $39,965 total cost at risk. The failure probability distribution chart separates the healthy cluster (left) from the at-risk machines (right of the DANGER threshold line). Maintenance teams get an immediate prioritized action list.
Machines sorted by failure probability descending. MACHINE-003 and MACHINE-004 flagged DANGER in red (81.2% and 80.9% — both L-type with tool wear 240+ minutes). Three MONITOR machines follow in orange. Color-coded Risk_Level column and Expected_Cost_$ give maintenance teams an immediate dollar-ranked action list. Full results downloadable as CSV.
1,000-machine fleet simulation: Reactive maintenance costs $340,000/year. Full preventive costs $500,000/year. This model costs $79,000/year — catching 32 of 34 failures (94% recall). Savings vs reactive: $261,000 (76.8%). Savings vs full preventive: $421,000 (84.2%). Fleet size, failure rate, and cost parameters are all adjustable.
LightGBM selected as champion via 5-fold cross-validated F1 mean (0.7857) — not by test-set score. CatBoost ranks second with lower CV std (0.051 vs 0.063), indicating more stable folds. Champion selection by CV score prevents the model selection bias that occurs when the test set is used to pick between models.
SHAP waterfall for a healthy H-type machine (18.9% failure probability). Blue bars dominate — high RPM (2000), low tool wear (25 min), and H-type quality tier all push the prediction strongly away from failure. The baseline probability (~3.4%, the dataset failure rate) is adjusted downward by each safe feature. This tells the operator not just "safe" but exactly which sensor readings are responsible for the clean health signal.
SHAP waterfall for a critical L-type machine (81%+ failure probability). Red bars dominate — Tool Wear at 240 min, high Force Ratio from low RPM + high torque, and collapsed Temp_Diff all push the prediction strongly toward failure. The global feature importance chart below confirms these are the model's top features across all training data, not just this one prediction. An interviewer or business stakeholder can verify exactly which physical signals drove the alarm.
- The Business Problem
- What Makes This Different
- System Architecture
- Technical Decisions & Rationale
- Results
- Business Impact
- Repository Structure
- Quickstart
- Streamlit App
- FastAPI — REST Endpoints
- Docker Deployment
- Drift Detection & Monitoring
- Running Tests
- Dataset
Every hour of unplanned downtime in heavy manufacturing costs between $10,000 and $250,000 depending on the industry. Yet the two standard maintenance strategies are both fundamentally broken:
| Strategy | What Goes Wrong | Hidden Cost |
|---|---|---|
| Reactive | Wait for failure, then fix it | Emergency repair + full production halt |
| Preventive (fixed schedule) | Service everything on a calendar | Replacing healthy components, unnecessary labor |
Predictive maintenance is the only strategy that is neither wasteful nor dangerous. It uses real-time sensor data to generate a maintenance alert only when a specific machine is genuinely showing signs of imminent failure — catching the failure before it happens, touching nothing that doesn't need attention.
This project builds a full production-structured ML pipeline on the AI4I 2020 Predictive Maintenance Dataset (UCI / Kaggle) — a realistic simulation of CNC machine sensor telemetry across 10,000 operating cycles with a 97:3 healthy-to-failure class ratio.
The majority of ML classification projects optimize for accuracy. Accuracy is the wrong metric for this problem. On a factory floor, errors are not symmetric:
- A missed failure (False Negative) = unplanned downtime, possible safety incident → $10,000
- A false alarm (False Positive) = a technician dispatched unnecessarily → $500
That is a 20:1 cost asymmetry. Every decision in this pipeline flows from that single insight.
| What a standard ML project does | What this pipeline does |
|---|---|
| Optimize accuracy or generic F1 | Optimize total dollar cost: (FP × $500) + (FN × $10,000) |
| Single train/test split | 3-way stratified split — train (60%) / val (20%) / test (20%) |
| Decision threshold fixed at 0.5 | Threshold searched on validation set, reported on test set |
GridSearchCV on F1 |
GridSearchCV on a custom business-cost scorer |
| SMOTE applied to the full dataset | SMOTE inside CV folds only — no synthetic leakage |
| Pick champion by test-set F1 | Pick champion by 5-fold cross-validated F1 mean |
| No unit tests | 14 pytest unit tests covering all core functions |
| Black-box predictions only | SHAP waterfall explains every individual prediction |
| Notebook only | Streamlit app + FastAPI + Docker + drift monitoring |
Raw CSV (Google Drive / local cache)
│
▼
┌─────────────────────────────────────┐
│ data_ingestion.py │
│ Download → Schema validation │
│ Deduplication → Null audit │
│ Target column sanity check │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ feature_engineering.py │
│ Physics feature creation │
│ Drop leakage columns │
│ 3-way stratified split (60/20/20) │
└───────┬─────────────┬───────────────┘
│ │
X_train X_val, X_test
y_train y_val, y_test
│ │
▼ │
┌─────────────────────────────────────┐
│ modeling.py │
│ 9-model zoo benchmarked via │
│ 5-fold StratifiedKFold CV │
│ │
│ Each fold pipeline: │
│ preprocessor (fit on fold only) │
│ → SMOTE (train fold only) │
│ → classifier │
│ │
│ Champion = highest CV_F1_Mean │
│ GridSearchCV on business-cost │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ evaluation.py │
│ optimize_threshold(X_val, y_val) │ ← val set ONLY
│ Final report on (X_test, y_test) │ ← test set, first touch here
│ Confusion matrix · ROC · Features │
│ Save model → artifacts/models/ │
└─────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ PRODUCTION SYSTEM │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ streamlit_app.py │ │
│ │ │ │
│ │ Tab 1: Live Prediction — gauge + risk + cost │ │
│ │ Tab 2: Batch Analysis — fleet CSV upload │ │
│ │ Tab 3: Business Dashboard — cost comparison │ │
│ │ Tab 4: Model Explainability — SHAP waterfall │ │
│ └──────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┐ │ ┌──────────────────────────┐ │
│ │ api/main.py │ │ │ monitoring.py │ │
│ │ POST /predict │───┘ │ KS drift detection │ │
│ │ POST /predict-batch │ │ → drift_alerts.csv │ │
│ │ GET /health │ └──────────────────────────┘ │
│ └──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ lightgbm_champion.pkl │ │
│ │ + SHAP TreeExplainer │ │
│ └────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Docker / docker-compose (port 8000) │ │
│ └───────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Three features were engineered from first principles of thermodynamics and rotational mechanics rather than feeding raw sensor readings directly into the model.
| Feature | Formula | Physical Interpretation |
|---|---|---|
Temp_Diff |
Process Temp − Air Temp | Thermal gradient: a rising value signals heat retention preceding thermal failure |
Power |
Torque [Nm] × RPM | Mechanical power input to spindle: sustained peaks accelerate tool wear |
Force_Ratio |
Torque / (RPM + ε) | Load per revolution: high ratio at low speed indicates heavy cutting conditions |
The ε = 1e-5 guard in Force_Ratio prevents division-by-zero. The SHAP global feature importance chart confirms Power ranks 2nd and Temp_Diff 3rd — above every raw sensor reading. Domain-driven features outperformed raw sensor data, and SHAP makes this empirically verifiable on any individual prediction.
The dataset is 96.6% healthy machines and 3.4% failures. Three decisions handle this correctly:
Stratified splits preserve the 3.4% failure rate across all three subsets. SMOTE inside CV folds via imblearn.Pipeline ensures synthetic minority samples are generated from training data only — the common mistake of applying SMOTE before CV inflates CV metrics by leaking synthetic copies of validation samples into training folds. Business-cost scorer explicitly encodes the 20:1 class cost asymmetry into hyperparameter search.
If the decision threshold were optimised on the test set and then reported on the same set, the reported cost would be the minimum achievable on that specific sample — overly optimistic and non-generalising. The validation set is used exclusively for threshold search. The test set is touched exactly once — in evaluation.py — for the final unbiased report.
Selecting the champion model by test-set score is model selection bias. Once you use the test set to make a decision, it is no longer a clean estimate of generalisation. All 9 models are ranked by 5-fold cross-validated F1 mean. The test set is only used for the final report after both champion and threshold are locked in.
GridSearchCV minimizes (FP × $500) + (FN × $10,000) via a custom make_scorer with greater_is_better=False. The tuner directly searches for the configuration that saves the most money — not the one that maximises an abstract metric.
Type encodes a genuine quality tier: L (Low) < M (Medium) < H (High). OrdinalEncoder with categories=[['L', 'M', 'H']] preserves this ordering as integers (0, 1, 2). OneHotEncoder would discard the ordinal structure. The handle_unknown='use_encoded_value', unknown_value=-1 guard ensures the pipeline never crashes on unseen categories at inference time.
Most production ML deployments are black boxes. A maintenance technician who receives a "DANGER" alert needs to know which sensor triggered it — not just the probability. The SHAP TreeExplainer is initialised once per session (cached via st.cache_resource) and computes exact Shapley values for any input in milliseconds.
The implementation uses the LightGBM classifier extracted from the sklearn Pipeline:
classifier = model.named_steps["model"]
explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X_transformed) # X already preprocessedThe SHAP tab includes four quick-load presets (Safe, Critical Danger, Heat Dissipation Risk, Tool Wear Limit) so any visitor to the live app can see the explainability working within seconds — no manual slider adjustment required.
| Rank | Model | CV F1 Mean | CV F1 Std | CV AUC | Test F1 | Test AUC |
|---|---|---|---|---|---|---|
| 🥇 | LightGBM | 0.7857 | 0.0626 | 0.9707 | 0.7808 | 0.9847 |
| 🥈 | CatBoost | 0.7758 | 0.0512 | 0.9709 | 0.7200 | 0.9782 |
| 🥉 | XGBoost | 0.7543 | 0.0615 | 0.9638 | 0.7125 | 0.9799 |
| 4 | Random Forest | 0.7346 | 0.0522 | 0.9698 | 0.7355 | 0.9727 |
| 5 | Gradient Boosting | 0.6227 | 0.0217 | 0.9726 | 0.5957 | 0.9794 |
| 6 | Decision Tree | 0.5953 | 0.0370 | 0.8653 | 0.6067 | 0.8826 |
| 7 | SVC | 0.4972 | 0.0263 | 0.9621 | 0.4917 | 0.9731 |
| 8 | Logistic Regression | 0.2857 | 0.0147 | 0.9191 | 0.3021 | 0.9316 |
| 9 | Gaussian NB | 0.2654 | 0.0200 | 0.9075 | 0.2821 | 0.9038 |
LightGBM vs CatBoost: LightGBM wins on CV F1 mean (0.786 vs 0.776). CatBoost has lower CV std (0.051 vs 0.063) — more stable across folds. In production, an ensemble of both would be the natural next step.
Threshold optimized on validation set: 0.32
precision recall f1-score support
0 0.9977 0.9063 0.9498 1932
1 0.2612 0.9412 0.4089 68
accuracy 0.9075 2000
macro avg 0.6295 0.9238 0.6794 2000
weighted avg 0.9652 0.9075 0.9320 2000
The model catches 64 of 68 actual failures (94.1% recall). 4 failures missed. 181 false alarms — a deliberate trade-off given a missed failure costs 20× more than a false alarm.
Confusion Matrix
64 failures correctly flagged. 4 missed at $10,000 each ($40,000). 181 false alarms at $500 each ($90,500). Total projected test-set cost: $130,500.
ROC Curve
AUC = 0.9847. The curve immediately reaches ~80% True Positive Rate at near-zero False Positive Rate.
Feature Importance
Tool wear [min] ranks first. Power and Temp_Diff — both engineered features — rank 2nd and 3rd, above every raw sensor reading. Domain engineering validated. The SHAP tab in the live app shows these same rankings at the individual-prediction level.
| Outcome | Count | Unit Cost | Total |
|---|---|---|---|
| False Negatives — missed failures | 4 | $10,000 | $40,000 |
| False Positives — unnecessary inspections | 181 | $500 | $90,500 |
| Total projected cost | $130,500 |
| Strategy | Failures Caught | Annual Cost | Saving vs Reactive |
|---|---|---|---|
| Reactive — wait for breakdown | 0% | $340,000 | — |
| Preventive — fixed schedule | 100% | $500,000 | −$160,000 |
| This Model — LightGBM, threshold 0.32 | 94% | $79,000 | $261,000 (76.8%) |
predictive-maintenance-engine/
│
├── assets/
│ └── screenshots/
│ ├── 01_live_prediction_safe.png
│ ├── 02_cost_analysis_safe.png
│ ├── 03_live_prediction_danger.png
│ ├── 04_cost_analysis_danger.png
│ ├── 05_batch_analysis_summary.png
│ ├── 06_batch_analysis_table.png
│ ├── 07_business_dashboard.png
│ ├── 08_model_leaderboard.png
│ ├── 09_shap_safe_machine.png
│ └── 10_shap_critical_danger.png
│
├── artifacts/ # Auto-generated — gitignored
│ ├── graphs/
│ │ ├── confusion_matrix.png
│ │ ├── roc_curve.png
│ │ └── feature_importance.png
│ └── model_leaderboard.csv
│
├── api/
│ ├── __init__.py
│ └── main.py # /predict, /predict-batch, /health
│
├── src/
│ ├── __init__.py
│ ├── config.py
│ ├── data_ingestion.py
│ ├── feature_engineering.py
│ ├── modeling.py
│ └── evaluation.py
│
├── tests/
│ └── test_pipeline.py # 14 pytest unit tests
│
├── main_execution.ipynb # Training pipeline (Colab)
├── run_pipeline.py # Training pipeline (local)
├── streamlit_app.py # Streamlit dashboard (4 tabs)
├── monitoring.py # KS drift detection
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md
Visit https://predictive-maintenance-deep-shah.streamlit.app/ directly in your browser.
1. Upload the project to Google Drive:
MyDrive/
└── predictive-maintenance-engine/
├── src/
├── api/
├── tests/
├── streamlit_app.py
├── monitoring.py
└── requirements.txt
2. Open main_execution.ipynb in Google Colab and run all cells.
The pipeline mounts Drive, downloads the dataset automatically via gdown, trains all 9 models, tunes the champion, and saves every artifact back to Drive.
git clone https://github.com/DeepShah111/predictive-maintenance-engine.git
cd predictive-maintenance-engine
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt
python run_pipeline.pystreamlit run streamlit_app.py
# → http://localhost:8501| Tab | What it does |
|---|---|
| ⚡ Live Prediction | Sensor sliders → real-time failure probability gauge + risk level + cost impact |
| 📂 Batch Analysis | Upload CSV → ranked fleet risk table + distribution chart + downloadable results |
| 📊 Business Dashboard | Strategy cost comparison + live threshold slider with FP/FN/cost update |
| 🔍 Model Explainability | SHAP waterfall chart + global feature importance — explains any prediction in plain English |
The Model Explainability tab uses shap.TreeExplainer on the LightGBM classifier to compute exact Shapley values for any sensor reading. Four quick-load presets are included so any user can immediately see the explainability working:
| Preset | What it demonstrates |
|---|---|
| Safe Machine (H-type) | Blue bars dominate — high RPM, fresh tool, H-tier push prediction toward safe |
| Critical Danger (L-type) | Red bars dominate — Tool Wear, Force Ratio, Temp_Diff all fire simultaneously |
| Heat Dissipation Risk | Temp_Diff below 8.6 K threshold as the primary red bar |
| Tool Wear Limit | Tool Wear at 253 min (maximum) as the single dominant red bar |
Live deployment: https://predictive-maintenance-deep-shah.streamlit.app/
uvicorn api.main:app --reload --port 8000
# → http://localhost:8000/docs| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Model loaded status, threshold, version |
POST |
/predict |
Single reading → probability + risk level + recommended action |
POST |
/predict-batch |
List of readings → predictions + fleet summary |
Example — Single Prediction:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"machine_type": "L",
"air_temperature_K": 302.0,
"process_temperature_K": 309.0,
"rotational_speed_rpm": 1200,
"torque_Nm": 65.0,
"tool_wear_min": 240,
"machine_id": "MACHINE-001"
}'Expected response:
{
"machine_id": "MACHINE-001",
"failure_probability": 0.812,
"failure_probability_pct": 81.2,
"risk_level": "DANGER",
"recommended_action": "IMMEDIATE maintenance required. Take machine offline.",
"expected_cost_if_ignored": 8120.0,
"physics_features": {
"Temp_Diff": 7.0,
"Power": 72600.0,
"Force_Ratio": 0.054167
},
"model_name": "Lightgbm",
"threshold_used": 0.32
}# Build and run
docker compose up --build
# → API available at http://localhost:8000
# Stop
docker compose downThe artifacts/ directory is mounted as a read-only volume so the container always uses the latest trained model without a rebuild.
The monitoring.py module detects covariate shift between training and production data using the Kolmogorov-Smirnov test (α = 0.05).
from monitoring import DriftMonitor
import pandas as pd
monitor = DriftMonitor()
alerts = monitor.check_drift(pd.read_csv("new_readings.csv"), tag="production_batch_1")
if alerts:
for a in alerts:
print(f"DRIFT: {a['feature']} — shift {a['mean_shift_pct']:.1f}%")CLI usage:
python monitoring.py --csv new_sensor_data.csv --tag production_jan_2025All alerts logged to artifacts/drift_alerts.csv with timestamp, KS statistic, p-value, and mean shift percentage.
python -m pytest tests/ -vcollected 14 items
tests/test_pipeline.py::test_physics_features_columns_created PASSED
tests/test_pipeline.py::test_physics_features_temp_diff_value PASSED
tests/test_pipeline.py::test_physics_features_power_value PASSED
tests/test_pipeline.py::test_physics_features_no_infinities PASSED
tests/test_pipeline.py::test_leakage_cols_dropped_after_split PASSED
tests/test_pipeline.py::test_get_preprocessor_returns_column_transformer PASSED
tests/test_pipeline.py::test_clean_data_removes_duplicates PASSED
tests/test_pipeline.py::test_clean_data_index_is_contiguous PASSED
tests/test_pipeline.py::test_build_features_and_split_returns_six_objects PASSED
tests/test_pipeline.py::test_build_features_and_split_sizes PASSED
tests/test_pipeline.py::test_build_features_and_split_class_balance PASSED
tests/test_pipeline.py::test_total_cost_metric_correct_value PASSED
tests/test_pipeline.py::test_total_cost_metric_degenerate_returns_inf PASSED
tests/test_pipeline.py::test_schema_validation_raises_on_missing_columns PASSED
14 passed in ~18s
AI4I 2020 Predictive Maintenance Dataset
| Property | Value |
|---|---|
| Source | UCI ML Repository · Kaggle |
| Rows | 10,000 |
| Features used | 11 (8 numerical + 1 categorical + 3 physics-derived) |
| Target | Machine failure (binary: 0 = healthy, 1 = failure) |
| Class distribution | 96.6% healthy / 3.4% failure |
| Leakage columns dropped | UDI, Product ID, TWF, HDF, PWF, OSF, RNF |
The leakage columns (TWF through RNF) are individual failure-mode sub-flags set to 1 only when Machine failure is also 1. Keeping them would let the model read the answer directly — they are dropped before any modelling step. The dataset downloads automatically on first run via gdown.
Built as a portfolio project demonstrating production ML engineering practices.
Structured for clarity, correctness, and interview-readiness.
🚀 Live Demo |
📁 GitHub












