An end-to-end MLOps analysis and execution platform that uses multi-agent LLM orchestration to evaluate machine learning projects, execute training pipelines, track experiments, register models, detect drift, and generate retraining recommendations β all from a single GitHub repository URL.
Point the platform at any public ML repository. Within minutes it produces a comprehensive readiness report covering dataset quality, training methodology, evaluation rigour, MLOps maturity, model monitoring health, and a data-driven retraining recommendation β scored 0β10 with actionable remediation steps.
Beyond analysis, the platform can execute the training pipeline inside an isolated Docker container, collect metrics and artifacts, register the resulting model, and track it through a full lifecycle (candidate β approved β production β archived).
| Feature | Description |
|---|---|
| Project Analysis | Scans repo structure, detects ML frameworks, maps training and evaluation entrypoints |
| Dataset Analysis | Inspects CSV/Parquet/JSON datasets β schema, nulls, class balance, PSI/KS/JS drift |
| Training Analysis | Reviews training code quality, reproducibility, hyperparameter management |
| Evaluation Analysis | Assesses evaluation methodology, metric appropriateness, test-set integrity |
| MLOps Readiness | Checks CI/CD, containerisation, dependency pinning, experiment tracking setup |
| Training Execution | Runs training in an isolated Docker sandbox, streams logs, captures metrics |
| Evaluation Execution | Executes evaluation scripts post-training, records final performance metrics |
| Experiment Tracking | Persists metrics and artifacts per run; compare up to 5 runs side-by-side |
| Model Registry | Register, annotate, and promote models through a governed lifecycle |
| Deployment Readiness | Evaluates serving infrastructure, API contracts, rollback procedures |
| Model Monitoring | Analyses monitoring instrumentation, computes drift scores across features |
| Drift Detection | PSI (Population Stability Index), KS test, Jensen-Shannon divergence |
| Retraining Intelligence | Deterministic trigger extraction + LLM rationale β urgency, strategy, plan |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Next.js 15 Frontend β
β Analyze β Reports β Experiments β Models β Compare β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β REST + SSE
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β /projects /runs /reports /experiments /models β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β LangGraph StateGraph
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β 8-Agent Analysis Pipeline β
β Project β Dataset β Training β Evaluation β MLOps β β
β Monitoring β Retraining β Report β
ββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
ββββββββΌβββββββ βββββββββββΌβββββββββββ
β PostgreSQL β β Docker Sandbox β
β (Alembic) β β Training/Eval Execβ
βββββββββββββββ ββββββββββββββββββββββ
- Framework: Next.js 15 with React Server Components and client components
- Styling: Tailwind CSS + shadcn/ui design system
- Real-time: Custom SSE hook (
usePipelineStream) with exponential-backoff reconnection and REST polling fallback - Pages: Analyze, Reports (full 7-section report), Experiments (table + compare), Models (registry + lifecycle + compare)
- API: FastAPI with async route handlers, Pydantic v2 schemas, SSE streaming via
asyncio.Queue - ORM: SQLAlchemy 2.0 async with
Mapped[T]/mapped_columntype annotations - Auth: API key middleware (
X-API-Keyheader) β protects Groq quota - Migrations: Alembic with 8 versioned migrations (
0001β0008)
- Graph type:
StateGraphwith typedPipelineStateTypedDict shared across all nodes - Two topologies: Analysis pipeline (8 nodes) and Execution pipeline (adds training/eval/registration nodes)
- SSE events: Each node publishes
agent_started,agent_decision,agent_completedevents through an in-memory queue
- JSONB columns: All 7 agent outputs stored as JSONB β flexible schema without migrations per agent change
- Core tables:
projects,workflow_runs,agent_runs,reports,experiment_metrics,experiment_artifacts,registered_models - Indexes: 11 performance indexes added in migration 0008 covering all hot-path queries
- Sandboxing:
--security-opt no-new-privileges,--cap-drop ALL,--memory,--cpus,--networkcontrols - Streaming: stdout/stderr merged and streamed line-by-line through the SSE pipeline
- Timeout: Configurable per execution, hard-kill via
docker killon timeout
- Model:
llama-3.3-70b-versatile(128K context window) - Pattern: Evidence-first β deterministic extraction runs first, LLM only interprets pre-computed evidence
- Reliability: Per-agent timeout, JSON parse retry with correction message, rate-limit exponential backoff
GitHub Repository URL
β
βΌ
βββββββββββββββββββββ
β Project Analysis β Maps repo structure, detects ML stack
ββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββ
β Dataset Analysis β Inspects files, computes drift statistics
ββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββ
β Training Analysis β Reviews training code quality
ββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββ
βEvaluation Analysisβ Assesses evaluation methodology
ββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββ
β MLOps Readiness β CI/CD, containerisation, tracking
ββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββ
β Model Monitoring β Drift scores, monitoring health
ββββββββββ¬βββββββββββ
βΌ
ββββββββββββββββββββββββββ
β Retraining Recommend. β Triggers β urgency β plan
ββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββββ
β Report Generation β Weighted synthesis β 0β10 score
ββββββββββ¬βββββββββββ
βΌ
Final Report + Score
Execution pipeline extends this with three additional nodes between Training Analysis and Evaluation Analysis: training_execution β evaluation_execution β model_registration.
This is the platform's strongest technical differentiator.
Most LLM agents ask the model to both discover facts and interpret them. This conflates two tasks and produces unreliable, hallucinated assessments.
The Autonomous MLOps Platform separates these concerns completely:
Step 1: Deterministic Extraction (no LLM)
ββββββββββββββββββββββββββββββββββββββββββ
Read files β compute statistics β extract signals
Examples:
β’ PSI score for each feature: PSI = Ξ£ (P_ref - P_curr) Γ ln(P_ref / P_curr)
β’ KS statistic: max |F_ref(x) - F_curr(x)|
β’ Retraining triggers: PSI β₯ 0.10 β MEDIUM drift flagged
Step 2: Evidence Collection
ββββββββββββββββββββββββββββ
Assemble a structured evidence block:
β’ "Feature 'age': PSI=0.18 (MEDIUM drift), KS=0.21 (MEDIUM drift)"
β’ "3 of 12 features exceeded PSI threshold"
β’ "Monitoring health: at_risk (score 4.2/10)"
Step 3: LLM Interpretation
βββββββββββββββββββββββββββ
The LLM receives only the evidence block β never raw data.
It cannot change numerical facts. It provides:
β’ Root cause analysis ("PSI spike on 'age' likely reflects seasonal shift")
β’ Urgency classification (immediate / soon / monitor / none)
β’ Retraining strategy (fine_tune / full_retrain / data_augmentation / etc.)
β’ Plain-language rationale for non-technical stakeholders
Step 4: Hard Override
ββββββββββββββββββββββ
After LLM response, deterministic overrides apply:
β’ If PSI β₯ 0.25 on any feature β urgency = "immediate" (non-negotiable)
β’ If monitoring health = "at_risk" β urgency β₯ "soon"
β’ LLM cannot soften a numerically-justified urgency call
Why this matters:
- Numerical thresholds are reproducible and auditable
- LLM adds language, reasoning, and actionability β not facts
- Reports are defensible: every score traces to a specific measurement
- Eliminates hallucinated drift scores or false urgency
projects
βββ id (UUID, PK)
βββ name, repo_url, status
βββ created_at, updated_at
workflow_runs
βββ id (UUID, PK)
βββ project_id (FK β projects)
βββ status, run_mode (autonomous | execute)
βββ started_at, completed_at, error_message
βββ ββ indexes: project_id, status
agent_runs
βββ id (UUID, PK)
βββ workflow_run_id (FK β workflow_runs)
βββ agent_name (enum: 12 values)
βββ status, output (JSONB), duration_ms
βββ ββ index: workflow_run_id
reports
βββ id (UUID, PK)
βββ workflow_run_id (FK), project_id (FK)
βββ readiness_score (0β10)
βββ project_profile (JSONB) β project_analysis output
βββ dataset_health (JSONB) β dataset_analysis output
βββ training_assessment (JSONB)
βββ evaluation_assessment (JSONB)
βββ mlops_readiness_data (JSONB)
βββ monitoring_data (JSONB)
βββ retraining_data (JSONB)
βββ priority_actions (JSONB)
experiment_metrics
βββ id, workflow_run_id, project_id
βββ metric_name, value, is_final
βββ source (stdout | file | mlflow)
experiment_artifacts
βββ id, workflow_run_id, project_id
βββ file_path, artifact_type, size_bytes
βββ extension
registered_models
βββ id (UUID, PK)
βββ project_id, source_run_id
βββ name, version, status (enum)
βββ deployment_readiness_score
βββ status_note, framework
When run_mode = "execute", the platform runs training inside a Docker container:
1. Clone repository to isolated workspace directory
2. Detect entrypoints
βββ train.py, src/train.py, main.py, scripts/train.sh
(ranked by confidence score)
3. Launch Docker container
docker run --rm \
--memory 4g --cpus 2.0 \
--network bridge \
--security-opt no-new-privileges \
--cap-drop ALL \
--volume /workspace:/workspace \
python:3.11-slim \
bash -c "pip install -r requirements.txt && python train.py"
4. Stream stdout/stderr β SSE β frontend live log
5. Collect metrics (multi-source):
β’ stdout scraping (loss=0.42, val_accuracy=0.91)
β’ file scanning (/workspace/metrics.json, results/*.json)
β’ MLflow tracking server (if present)
6. Collect artifacts:
β’ model files (*.pt, *.pkl, *.h5, *.onnx, *.safetensors)
β’ checkpoints (epoch_*/checkpoint_*)
β’ MLflow run directories
7. Register model (if readiness score β₯ threshold)
βββ status: "candidate" in registry
Every run produces an ExperimentRunSummary with:
- All metrics from all sources (stdout, files, MLflow)
is_finalflag marking the final epoch vs intermediate- Artifact inventory with file paths, types, sizes
- Linked report with readiness score
Compare view: Select 2β5 runs β side-by-side metric table with delta columns and improvement direction indicators.
Models progress through a governed lifecycle:
candidate βββ approved βββ production
β β β
ββββ rejected ββββ rejected β
ββββ archived
Rules:
β’ archived is terminal (no re-activation)
β’ production β candidate is blocked (must re-register)
β’ Invalid transitions return HTTP 422
The Model Compare view loads two models, queries their source run reports for metrics, computes per-metric deltas (with direction awareness β loss is lower-better), and classifies the challenger as promote_b / reject_b / investigate.
The model_monitoring agent computes drift across three statistical tests:
| Test | Formula | Thresholds |
|---|---|---|
| PSI (Population Stability Index) | Ξ£ (P_ref - P_curr) Γ ln(P_ref / P_curr) |
< 0.10 stable Β· β₯ 0.10 medium Β· β₯ 0.25 high |
| KS (Kolmogorov-Smirnov) | max βF_ref(x) - F_curr(x)β |
< 0.15 stable Β· β₯ 0.15 medium Β· β₯ 0.30 high |
| Jensen-Shannon | (KL(PβM) + KL(QβM)) / 2 |
< 0.10 stable Β· β₯ 0.10 medium Β· β₯ 0.20 high |
Results are aggregated into:
overall_drift_scoreβ weighted composite (0β1)healthβhealthy / at_risk / criticalclassificationfeatures_driftedβ count of features exceeding any threshold
Three trigger categories feed the recommendation:
- Drift triggers β PSI/KS thresholds exceeded (deterministic)
- Monitoring triggers β health
at_risk/critical, monitoring score < 5 (deterministic) - Performance triggers β metric decline vs prior runs: β₯ 5% = medium, β₯ 10% = high (deterministic)
Triggers are collected, formatted as evidence, and passed to the LLM which outputs:
urgency:immediate / soon / monitor / noneconfidence: 0β100%strategy:fine_tune / full_retrain / data_augmentation / architecture_change / ensemble / noneretraining_plan: dataset recommendation, feature review list, evaluation requirements, promotion criteria
The LLM cannot reduce urgency below what the deterministic triggers mandate.
Screenshots will be added after first deployment. The UI includes:
| Page | Description |
|---|---|
/analyze |
Repository URL form β live agent progress stream |
/reports/[runId] |
7-section report: Project / Dataset / Training / Evaluation / MLOps / Monitoring / Retraining |
/experiments |
Tabular run history with metrics chips; multi-select compare |
/experiments/[runId] |
Full metric table + artifact list for a single run |
/experiments/compare |
Side-by-side metric delta table across up to 5 runs |
/models |
Registry table with lifecycle status; multi-select compare |
/models/[modelId] |
Model detail with deployment assessment |
/models/compare |
Metric delta comparison with promote/reject recommendation |
- Python 3.11+
- Node.js 18+
- PostgreSQL 15+ (or Docker)
- Docker (for training execution)
- Groq API key β console.groq.com
git clone https://github.com/your-username/mlops-platform.git
cd mlops-platform# With Docker:
docker run -d --name mlops-pg \
-e POSTGRES_USER=mlops \
-e POSTGRES_PASSWORD=mlops \
-e POSTGRES_DB=mlops \
-p 5432:5432 postgres:15-alpinecd backend
# Install dependencies
pip install -e ".[dev]"
# Configure environment
cp .env.example .env
# Edit .env β set GROQ_API_KEY at minimum
# Run migrations
alembic upgrade head
# Start the API server
uvicorn src.api.app:app --reload --port 8000cd frontend
# Install dependencies
npm install
# Configure environment
cp .env.local.example .env.local
# Set NEXT_PUBLIC_API_URL=http://localhost:8000
# Start development server
npm run devNavigate to http://localhost:3000
curl -X POST http://localhost:8000/api/v1/projects \
-H "X-API-Key: dev-key" \
-H "Content-Type: application/json" \
-d '{"repo_url": "https://github.com/owner/ml-project"}'
# Returns: {"id": "<project-id>"}
curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
-H "X-API-Key: dev-key" \
-H "Content-Type: application/json" \
-d '{"run_mode": "autonomous"}'
# Returns: {"id": "<run-id>"}curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
-H "X-API-Key: dev-key" \
-H "Content-Type: application/json" \
-d '{"run_mode": "execute"}'curl -N -H "X-API-Key: dev-key" \
http://localhost:8000/api/v1/runs/<run-id>/streamcd backend
pytest tests/ -v| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
(required) | Groq API key for LLM calls |
LLM_PROVIDER |
groq |
LLM provider (currently only groq) |
LLM_MODEL_ID |
llama-3.3-70b-versatile |
Model to use for all agent calls |
API_KEY |
dev-key |
API key required in X-API-Key header |
DATABASE_URL |
postgresql+asyncpg://mlops:mlops@localhost:5432/mlops |
PostgreSQL connection string |
DEBUG |
false |
Enable debug logging |
WORKSPACE_BASE_DIR |
/tmp/mlops-workspaces |
Where cloned repositories are stored |
TOKEN_BUDGET |
100000 |
Max tokens of file content per agent (128K context) |
AGENT_TIMEOUT_SECONDS |
120 |
Per-agent LLM call timeout |
ALLOWED_ORIGINS |
http://localhost:3000 |
Comma-separated CORS allowed origins |
EXECUTION_DOCKER_IMAGE |
python:3.11-slim |
Default Docker image for training runs |
EXECUTION_TIMEOUT_SECONDS |
3600 |
Max duration for a training run (seconds) |
EXECUTION_MEMORY_LIMIT |
4g |
Docker memory limit per training container |
EXECUTION_CPU_LIMIT |
2.0 |
Docker CPU limit per training container |
EXECUTION_NETWORK |
bridge |
Docker network mode for training containers |
| Layer | Technology |
|---|---|
| Frontend | Next.js 15, React 19, Tailwind CSS, shadcn/ui |
| Backend | FastAPI, SQLAlchemy 2.0 async, Pydantic v2 |
| Orchestration | LangGraph (StateGraph) |
| LLM | Groq (llama-3.3-70b-versatile, 128K context) |
| Database | PostgreSQL 15 with Alembic migrations |
| Execution | Docker with security sandboxing |
| Data Analysis | pandas, numpy, scipy |
| Streaming | Server-Sent Events (SSE) |
- Multi-provider LLM β OpenAI, Anthropic, Gemini via provider abstraction (factory already in place)
- Scheduled monitoring β Cron-triggered drift analysis on deployed models
- Webhook notifications β Slack/email alerts on
retraining_recommendation.urgency = immediate - ONNX export validation β Automated model export and inference smoke test
- Multi-project dashboard β Portfolio view across all tracked repositories
- GitHub App integration β Auto-trigger analysis on PR merge to
main
MIT