Autonomous MLOps Platform

An end-to-end MLOps analysis and execution platform that uses multi-agent LLM orchestration to evaluate machine learning projects, execute training pipelines, track experiments, register models, detect drift, and generate retraining recommendations — all from a single GitHub repository URL.

Overview

Point the platform at any public ML repository. Within minutes it produces a comprehensive readiness report covering dataset quality, training methodology, evaluation rigour, MLOps maturity, model monitoring health, and a data-driven retraining recommendation — scored 0–10 with actionable remediation steps.

Beyond analysis, the platform can execute the training pipeline inside an isolated Docker container, collect metrics and artifacts, register the resulting model, and track it through a full lifecycle (candidate → approved → production → archived).

Key Features

Feature	Description
Project Analysis	Scans repo structure, detects ML frameworks, maps training and evaluation entrypoints
Dataset Analysis	Inspects CSV/Parquet/JSON datasets — schema, nulls, class balance, PSI/KS/JS drift
Training Analysis	Reviews training code quality, reproducibility, hyperparameter management
Evaluation Analysis	Assesses evaluation methodology, metric appropriateness, test-set integrity
MLOps Readiness	Checks CI/CD, containerisation, dependency pinning, experiment tracking setup
Training Execution	Runs training in an isolated Docker sandbox, streams logs, captures metrics
Evaluation Execution	Executes evaluation scripts post-training, records final performance metrics
Experiment Tracking	Persists metrics and artifacts per run; compare up to 5 runs side-by-side
Model Registry	Register, annotate, and promote models through a governed lifecycle
Deployment Readiness	Evaluates serving infrastructure, API contracts, rollback procedures
Model Monitoring	Analyses monitoring instrumentation, computes drift scores across features
Drift Detection	PSI (Population Stability Index), KS test, Jensen-Shannon divergence
Retraining Intelligence	Deterministic trigger extraction + LLM rationale → urgency, strategy, plan

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Next.js 15 Frontend                  │
│  Analyze │ Reports │ Experiments │ Models │ Compare         │
└──────────────────────────┬──────────────────────────────────┘
                           │ REST + SSE
┌──────────────────────────▼──────────────────────────────────┐
│                     FastAPI Backend                          │
│  /projects  /runs  /reports  /experiments  /models          │
└──────────────────────────┬──────────────────────────────────┘
                           │ LangGraph StateGraph
┌──────────────────────────▼──────────────────────────────────┐
│                   8-Agent Analysis Pipeline                  │
│  Project → Dataset → Training → Evaluation → MLOps →       │
│  Monitoring → Retraining → Report                           │
└──────┬──────────────────────────────────────────────────────┘
       │                                         │
┌──────▼──────┐                        ┌─────────▼──────────┐
│  PostgreSQL  │                        │   Docker Sandbox   │
│  (Alembic)  │                        │  Training/Eval Exec│
└─────────────┘                        └────────────────────┘

System Design

Frontend — Next.js 15 (App Router)

Framework: Next.js 15 with React Server Components and client components
Styling: Tailwind CSS + shadcn/ui design system
Real-time: Custom SSE hook (usePipelineStream) with exponential-backoff reconnection and REST polling fallback
Pages: Analyze, Reports (full 7-section report), Experiments (table + compare), Models (registry + lifecycle + compare)

Backend — FastAPI + SQLAlchemy 2.0

API: FastAPI with async route handlers, Pydantic v2 schemas, SSE streaming via asyncio.Queue
ORM: SQLAlchemy 2.0 async with Mapped[T] / mapped_column type annotations
Auth: API key middleware (X-API-Key header) — protects Groq quota
Migrations: Alembic with 8 versioned migrations (0001–0008)

LangGraph — Agent Orchestration

Graph type: StateGraph with typed PipelineState TypedDict shared across all nodes
Two topologies: Analysis pipeline (8 nodes) and Execution pipeline (adds training/eval/registration nodes)
SSE events: Each node publishes agent_started, agent_decision, agent_completed events through an in-memory queue

PostgreSQL — Data Persistence

JSONB columns: All 7 agent outputs stored as JSONB — flexible schema without migrations per agent change
Core tables: projects, workflow_runs, agent_runs, reports, experiment_metrics, experiment_artifacts, registered_models
Indexes: 11 performance indexes added in migration 0008 covering all hot-path queries

Docker — Execution Isolation

Sandboxing: --security-opt no-new-privileges, --cap-drop ALL, --memory, --cpus, --network controls
Streaming: stdout/stderr merged and streamed line-by-line through the SSE pipeline
Timeout: Configurable per execution, hard-kill via docker kill on timeout

Groq — LLM Provider

Model: llama-3.3-70b-versatile (128K context window)
Pattern: Evidence-first — deterministic extraction runs first, LLM only interprets pre-computed evidence
Reliability: Per-agent timeout, JSON parse retry with correction message, rate-limit exponential backoff

Agent Workflow

GitHub Repository URL
        │
        ▼
┌───────────────────┐
│  Project Analysis  │  Maps repo structure, detects ML stack
└────────┬──────────┘
         ▼
┌───────────────────┐
│ Dataset Analysis  │  Inspects files, computes drift statistics
└────────┬──────────┘
         ▼
┌───────────────────┐
│ Training Analysis │  Reviews training code quality
└────────┬──────────┘
         ▼
┌───────────────────┐
│Evaluation Analysis│  Assesses evaluation methodology
└────────┬──────────┘
         ▼
┌───────────────────┐
│  MLOps Readiness  │  CI/CD, containerisation, tracking
└────────┬──────────┘
         ▼
┌───────────────────┐
│ Model Monitoring  │  Drift scores, monitoring health
└────────┬──────────┘
         ▼
┌────────────────────────┐
│ Retraining Recommend.  │  Triggers → urgency → plan
└────────┬───────────────┘
         ▼
┌───────────────────┐
│ Report Generation │  Weighted synthesis → 0–10 score
└────────┬──────────┘
         ▼
   Final Report + Score

Execution pipeline extends this with three additional nodes between Training Analysis and Evaluation Analysis: training_execution → evaluation_execution → model_registration.

Evidence-First Agent Architecture

This is the platform's strongest technical differentiator.

Most LLM agents ask the model to both discover facts and interpret them. This conflates two tasks and produces unreliable, hallucinated assessments.

The Autonomous MLOps Platform separates these concerns completely:

Step 1: Deterministic Extraction (no LLM)
──────────────────────────────────────────
  Read files → compute statistics → extract signals
  Examples:
  • PSI score for each feature: PSI = Σ (P_ref - P_curr) × ln(P_ref / P_curr)
  • KS statistic: max |F_ref(x) - F_curr(x)|
  • Retraining triggers: PSI ≥ 0.10 → MEDIUM drift flagged

Step 2: Evidence Collection
────────────────────────────
  Assemble a structured evidence block:
  • "Feature 'age': PSI=0.18 (MEDIUM drift), KS=0.21 (MEDIUM drift)"
  • "3 of 12 features exceeded PSI threshold"
  • "Monitoring health: at_risk (score 4.2/10)"

Step 3: LLM Interpretation
───────────────────────────
  The LLM receives only the evidence block — never raw data.
  It cannot change numerical facts. It provides:
  • Root cause analysis ("PSI spike on 'age' likely reflects seasonal shift")
  • Urgency classification (immediate / soon / monitor / none)
  • Retraining strategy (fine_tune / full_retrain / data_augmentation / etc.)
  • Plain-language rationale for non-technical stakeholders

Step 4: Hard Override
──────────────────────
  After LLM response, deterministic overrides apply:
  • If PSI ≥ 0.25 on any feature → urgency = "immediate" (non-negotiable)
  • If monitoring health = "at_risk" → urgency ≥ "soon"
  • LLM cannot soften a numerically-justified urgency call

Why this matters:

Numerical thresholds are reproducible and auditable
LLM adds language, reasoning, and actionability — not facts
Reports are defensible: every score traces to a specific measurement
Eliminates hallucinated drift scores or false urgency

Database Design

projects
├── id (UUID, PK)
├── name, repo_url, status
└── created_at, updated_at

workflow_runs
├── id (UUID, PK)
├── project_id (FK → projects)
├── status, run_mode (autonomous | execute)
├── started_at, completed_at, error_message
└── ── indexes: project_id, status

agent_runs
├── id (UUID, PK)
├── workflow_run_id (FK → workflow_runs)
├── agent_name (enum: 12 values)
├── status, output (JSONB), duration_ms
└── ── index: workflow_run_id

reports
├── id (UUID, PK)
├── workflow_run_id (FK), project_id (FK)
├── readiness_score (0–10)
├── project_profile (JSONB)    ← project_analysis output
├── dataset_health (JSONB)     ← dataset_analysis output
├── training_assessment (JSONB)
├── evaluation_assessment (JSONB)
├── mlops_readiness_data (JSONB)
├── monitoring_data (JSONB)
├── retraining_data (JSONB)
└── priority_actions (JSONB)

experiment_metrics
├── id, workflow_run_id, project_id
├── metric_name, value, is_final
└── source (stdout | file | mlflow)

experiment_artifacts
├── id, workflow_run_id, project_id
├── file_path, artifact_type, size_bytes
└── extension

registered_models
├── id (UUID, PK)
├── project_id, source_run_id
├── name, version, status (enum)
├── deployment_readiness_score
└── status_note, framework

Execution Pipeline

When run_mode = "execute", the platform runs training inside a Docker container:

1. Clone repository to isolated workspace directory

2. Detect entrypoints
   └── train.py, src/train.py, main.py, scripts/train.sh
       (ranked by confidence score)

3. Launch Docker container
   docker run --rm \
     --memory 4g --cpus 2.0 \
     --network bridge \
     --security-opt no-new-privileges \
     --cap-drop ALL \
     --volume /workspace:/workspace \
     python:3.11-slim \
     bash -c "pip install -r requirements.txt && python train.py"

4. Stream stdout/stderr → SSE → frontend live log

5. Collect metrics (multi-source):
   • stdout scraping (loss=0.42, val_accuracy=0.91)
   • file scanning (/workspace/metrics.json, results/*.json)
   • MLflow tracking server (if present)

6. Collect artifacts:
   • model files (*.pt, *.pkl, *.h5, *.onnx, *.safetensors)
   • checkpoints (epoch_*/checkpoint_*)
   • MLflow run directories

7. Register model (if readiness score ≥ threshold)
   └── status: "candidate" in registry

Experiment Tracking

Every run produces an ExperimentRunSummary with:

All metrics from all sources (stdout, files, MLflow)
is_final flag marking the final epoch vs intermediate
Artifact inventory with file paths, types, sizes
Linked report with readiness score

Compare view: Select 2–5 runs → side-by-side metric table with delta columns and improvement direction indicators.

Model Registry

Models progress through a governed lifecycle:

candidate ──→ approved ──→ production
    │              │             │
    └──→ rejected  └──→ rejected │
                                 └──→ archived

Rules:
• archived is terminal (no re-activation)
• production → candidate is blocked (must re-register)
• Invalid transitions return HTTP 422

The Model Compare view loads two models, queries their source run reports for metrics, computes per-metric deltas (with direction awareness — loss is lower-better), and classifies the challenger as promote_b / reject_b / investigate.

Monitoring & Drift Detection

The model_monitoring agent computes drift across three statistical tests:

Test	Formula	Thresholds
PSI (Population Stability Index)	`Σ (P_ref - P_curr) × ln(P_ref / P_curr)`	< 0.10 stable · ≥ 0.10 medium · ≥ 0.25 high
KS (Kolmogorov-Smirnov)	`max │F_ref(x) - F_curr(x)│`	< 0.15 stable · ≥ 0.15 medium · ≥ 0.30 high
Jensen-Shannon	`(KL(P‖M) + KL(Q‖M)) / 2`	< 0.10 stable · ≥ 0.10 medium · ≥ 0.20 high

Results are aggregated into:

overall_drift_score — weighted composite (0–1)
health — healthy / at_risk / critical classification
features_drifted — count of features exceeding any threshold

Retraining Intelligence

Three trigger categories feed the recommendation:

Drift triggers — PSI/KS thresholds exceeded (deterministic)
Monitoring triggers — health at_risk/critical, monitoring score < 5 (deterministic)
Performance triggers — metric decline vs prior runs: ≥ 5% = medium, ≥ 10% = high (deterministic)

Triggers are collected, formatted as evidence, and passed to the LLM which outputs:

urgency: immediate / soon / monitor / none
confidence: 0–100%
strategy: fine_tune / full_retrain / data_augmentation / architecture_change / ensemble / none
retraining_plan: dataset recommendation, feature review list, evaluation requirements, promotion criteria

The LLM cannot reduce urgency below what the deterministic triggers mandate.

Screenshots

Screenshots will be added after first deployment. The UI includes:

Page	Description
`/analyze`	Repository URL form → live agent progress stream
`/reports/[runId]`	7-section report: Project / Dataset / Training / Evaluation / MLOps / Monitoring / Retraining
`/experiments`	Tabular run history with metrics chips; multi-select compare
`/experiments/[runId]`	Full metric table + artifact list for a single run
`/experiments/compare`	Side-by-side metric delta table across up to 5 runs
`/models`	Registry table with lifecycle status; multi-select compare
`/models/[modelId]`	Model detail with deployment assessment
`/models/compare`	Metric delta comparison with promote/reject recommendation

Local Setup

Prerequisites

Python 3.11+
Node.js 18+
PostgreSQL 15+ (or Docker)
Docker (for training execution)
Groq API key — console.groq.com

1. Clone the repository

git clone https://github.com/your-username/mlops-platform.git
cd mlops-platform

2. Start PostgreSQL

# With Docker:
docker run -d --name mlops-pg \
  -e POSTGRES_USER=mlops \
  -e POSTGRES_PASSWORD=mlops \
  -e POSTGRES_DB=mlops \
  -p 5432:5432 postgres:15-alpine

3. Backend setup

cd backend

# Install dependencies
pip install -e ".[dev]"

# Configure environment
cp .env.example .env
# Edit .env — set GROQ_API_KEY at minimum

# Run migrations
alembic upgrade head

# Start the API server
uvicorn src.api.app:app --reload --port 8000

4. Frontend setup

cd frontend

# Install dependencies
npm install

# Configure environment
cp .env.local.example .env.local
# Set NEXT_PUBLIC_API_URL=http://localhost:8000

# Start development server
npm run dev

5. Open the platform

Navigate to http://localhost:3000

Running the Platform

Analyze a repository (autonomous mode)

curl -X POST http://localhost:8000/api/v1/projects \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"repo_url": "https://github.com/owner/ml-project"}'
# Returns: {"id": "<project-id>"}

curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"run_mode": "autonomous"}'
# Returns: {"id": "<run-id>"}

Execute training pipeline

curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"run_mode": "execute"}'

Stream live progress

curl -N -H "X-API-Key: dev-key" \
  http://localhost:8000/api/v1/runs/<run-id>/stream

Run tests

cd backend
pytest tests/ -v

Environment Variables

Variable	Default	Description
`GROQ_API_KEY`	(required)	Groq API key for LLM calls
`LLM_PROVIDER`	`groq`	LLM provider (currently only `groq`)
`LLM_MODEL_ID`	`llama-3.3-70b-versatile`	Model to use for all agent calls
`API_KEY`	`dev-key`	API key required in `X-API-Key` header
`DATABASE_URL`	`postgresql+asyncpg://mlops:mlops@localhost:5432/mlops`	PostgreSQL connection string
`DEBUG`	`false`	Enable debug logging
`WORKSPACE_BASE_DIR`	`/tmp/mlops-workspaces`	Where cloned repositories are stored
`TOKEN_BUDGET`	`100000`	Max tokens of file content per agent (128K context)
`AGENT_TIMEOUT_SECONDS`	`120`	Per-agent LLM call timeout
`ALLOWED_ORIGINS`	`http://localhost:3000`	Comma-separated CORS allowed origins
`EXECUTION_DOCKER_IMAGE`	`python:3.11-slim`	Default Docker image for training runs
`EXECUTION_TIMEOUT_SECONDS`	`3600`	Max duration for a training run (seconds)
`EXECUTION_MEMORY_LIMIT`	`4g`	Docker memory limit per training container
`EXECUTION_CPU_LIMIT`	`2.0`	Docker CPU limit per training container
`EXECUTION_NETWORK`	`bridge`	Docker network mode for training containers

Tech Stack

Layer	Technology
Frontend	Next.js 15, React 19, Tailwind CSS, shadcn/ui
Backend	FastAPI, SQLAlchemy 2.0 async, Pydantic v2
Orchestration	LangGraph (StateGraph)
LLM	Groq (llama-3.3-70b-versatile, 128K context)
Database	PostgreSQL 15 with Alembic migrations
Execution	Docker with security sandboxing
Data Analysis	pandas, numpy, scipy
Streaming	Server-Sent Events (SSE)

Future Roadmap

Multi-provider LLM — OpenAI, Anthropic, Gemini via provider abstraction (factory already in place)
Scheduled monitoring — Cron-triggered drift analysis on deployed models
Webhook notifications — Slack/email alerts on retraining_recommendation.urgency = immediate
ONNX export validation — Automated model export and inference smoke test
Multi-project dashboard — Portfolio view across all tracked repositories
GitHub App integration — Auto-trigger analysis on PR merge to main

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
docs		docs
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Autonomous MLOps Platform

Overview

Key Features

Architecture

System Design

Frontend — Next.js 15 (App Router)

Backend — FastAPI + SQLAlchemy 2.0

LangGraph — Agent Orchestration

PostgreSQL — Data Persistence

Docker — Execution Isolation

Groq — LLM Provider

Agent Workflow

Evidence-First Agent Architecture

Database Design

Execution Pipeline

Experiment Tracking

Model Registry

Monitoring & Drift Detection

Retraining Intelligence

Screenshots

Local Setup

Prerequisites

1. Clone the repository

2. Start PostgreSQL

3. Backend setup

4. Frontend setup

5. Open the platform

Running the Platform

Analyze a repository (autonomous mode)

Execute training pipeline

Stream live progress

Run tests

Environment Variables

Tech Stack

Future Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages