Skip to content

ssid18/AXIOM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Autonomous MLOps Platform

An end-to-end MLOps analysis and execution platform that uses multi-agent LLM orchestration to evaluate machine learning projects, execute training pipelines, track experiments, register models, detect drift, and generate retraining recommendations β€” all from a single GitHub repository URL.


Overview

Point the platform at any public ML repository. Within minutes it produces a comprehensive readiness report covering dataset quality, training methodology, evaluation rigour, MLOps maturity, model monitoring health, and a data-driven retraining recommendation β€” scored 0–10 with actionable remediation steps.

Beyond analysis, the platform can execute the training pipeline inside an isolated Docker container, collect metrics and artifacts, register the resulting model, and track it through a full lifecycle (candidate β†’ approved β†’ production β†’ archived).


Key Features

Feature Description
Project Analysis Scans repo structure, detects ML frameworks, maps training and evaluation entrypoints
Dataset Analysis Inspects CSV/Parquet/JSON datasets β€” schema, nulls, class balance, PSI/KS/JS drift
Training Analysis Reviews training code quality, reproducibility, hyperparameter management
Evaluation Analysis Assesses evaluation methodology, metric appropriateness, test-set integrity
MLOps Readiness Checks CI/CD, containerisation, dependency pinning, experiment tracking setup
Training Execution Runs training in an isolated Docker sandbox, streams logs, captures metrics
Evaluation Execution Executes evaluation scripts post-training, records final performance metrics
Experiment Tracking Persists metrics and artifacts per run; compare up to 5 runs side-by-side
Model Registry Register, annotate, and promote models through a governed lifecycle
Deployment Readiness Evaluates serving infrastructure, API contracts, rollback procedures
Model Monitoring Analyses monitoring instrumentation, computes drift scores across features
Drift Detection PSI (Population Stability Index), KS test, Jensen-Shannon divergence
Retraining Intelligence Deterministic trigger extraction + LLM rationale β†’ urgency, strategy, plan

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Next.js 15 Frontend                  β”‚
β”‚  Analyze β”‚ Reports β”‚ Experiments β”‚ Models β”‚ Compare         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ REST + SSE
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     FastAPI Backend                          β”‚
β”‚  /projects  /runs  /reports  /experiments  /models          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ LangGraph StateGraph
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   8-Agent Analysis Pipeline                  β”‚
β”‚  Project β†’ Dataset β†’ Training β†’ Evaluation β†’ MLOps β†’       β”‚
β”‚  Monitoring β†’ Retraining β†’ Report                           β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PostgreSQL  β”‚                        β”‚   Docker Sandbox   β”‚
β”‚  (Alembic)  β”‚                        β”‚  Training/Eval Execβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

System Design

Frontend β€” Next.js 15 (App Router)

  • Framework: Next.js 15 with React Server Components and client components
  • Styling: Tailwind CSS + shadcn/ui design system
  • Real-time: Custom SSE hook (usePipelineStream) with exponential-backoff reconnection and REST polling fallback
  • Pages: Analyze, Reports (full 7-section report), Experiments (table + compare), Models (registry + lifecycle + compare)

Backend β€” FastAPI + SQLAlchemy 2.0

  • API: FastAPI with async route handlers, Pydantic v2 schemas, SSE streaming via asyncio.Queue
  • ORM: SQLAlchemy 2.0 async with Mapped[T] / mapped_column type annotations
  • Auth: API key middleware (X-API-Key header) β€” protects Groq quota
  • Migrations: Alembic with 8 versioned migrations (0001–0008)

LangGraph β€” Agent Orchestration

  • Graph type: StateGraph with typed PipelineState TypedDict shared across all nodes
  • Two topologies: Analysis pipeline (8 nodes) and Execution pipeline (adds training/eval/registration nodes)
  • SSE events: Each node publishes agent_started, agent_decision, agent_completed events through an in-memory queue

PostgreSQL β€” Data Persistence

  • JSONB columns: All 7 agent outputs stored as JSONB β€” flexible schema without migrations per agent change
  • Core tables: projects, workflow_runs, agent_runs, reports, experiment_metrics, experiment_artifacts, registered_models
  • Indexes: 11 performance indexes added in migration 0008 covering all hot-path queries

Docker β€” Execution Isolation

  • Sandboxing: --security-opt no-new-privileges, --cap-drop ALL, --memory, --cpus, --network controls
  • Streaming: stdout/stderr merged and streamed line-by-line through the SSE pipeline
  • Timeout: Configurable per execution, hard-kill via docker kill on timeout

Groq β€” LLM Provider

  • Model: llama-3.3-70b-versatile (128K context window)
  • Pattern: Evidence-first β€” deterministic extraction runs first, LLM only interprets pre-computed evidence
  • Reliability: Per-agent timeout, JSON parse retry with correction message, rate-limit exponential backoff

Agent Workflow

GitHub Repository URL
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Project Analysis  β”‚  Maps repo structure, detects ML stack
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Dataset Analysis  β”‚  Inspects files, computes drift statistics
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Training Analysis β”‚  Reviews training code quality
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Evaluation Analysisβ”‚  Assesses evaluation methodology
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MLOps Readiness  β”‚  CI/CD, containerisation, tracking
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model Monitoring  β”‚  Drift scores, monitoring health
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Retraining Recommend.  β”‚  Triggers β†’ urgency β†’ plan
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Report Generation β”‚  Weighted synthesis β†’ 0–10 score
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
   Final Report + Score

Execution pipeline extends this with three additional nodes between Training Analysis and Evaluation Analysis: training_execution β†’ evaluation_execution β†’ model_registration.


Evidence-First Agent Architecture

This is the platform's strongest technical differentiator.

Most LLM agents ask the model to both discover facts and interpret them. This conflates two tasks and produces unreliable, hallucinated assessments.

The Autonomous MLOps Platform separates these concerns completely:

Step 1: Deterministic Extraction (no LLM)
──────────────────────────────────────────
  Read files β†’ compute statistics β†’ extract signals
  Examples:
  β€’ PSI score for each feature: PSI = Ξ£ (P_ref - P_curr) Γ— ln(P_ref / P_curr)
  β€’ KS statistic: max |F_ref(x) - F_curr(x)|
  β€’ Retraining triggers: PSI β‰₯ 0.10 β†’ MEDIUM drift flagged

Step 2: Evidence Collection
────────────────────────────
  Assemble a structured evidence block:
  β€’ "Feature 'age': PSI=0.18 (MEDIUM drift), KS=0.21 (MEDIUM drift)"
  β€’ "3 of 12 features exceeded PSI threshold"
  β€’ "Monitoring health: at_risk (score 4.2/10)"

Step 3: LLM Interpretation
───────────────────────────
  The LLM receives only the evidence block β€” never raw data.
  It cannot change numerical facts. It provides:
  β€’ Root cause analysis ("PSI spike on 'age' likely reflects seasonal shift")
  β€’ Urgency classification (immediate / soon / monitor / none)
  β€’ Retraining strategy (fine_tune / full_retrain / data_augmentation / etc.)
  β€’ Plain-language rationale for non-technical stakeholders

Step 4: Hard Override
──────────────────────
  After LLM response, deterministic overrides apply:
  β€’ If PSI β‰₯ 0.25 on any feature β†’ urgency = "immediate" (non-negotiable)
  β€’ If monitoring health = "at_risk" β†’ urgency β‰₯ "soon"
  β€’ LLM cannot soften a numerically-justified urgency call

Why this matters:

  • Numerical thresholds are reproducible and auditable
  • LLM adds language, reasoning, and actionability β€” not facts
  • Reports are defensible: every score traces to a specific measurement
  • Eliminates hallucinated drift scores or false urgency

Database Design

projects
β”œβ”€β”€ id (UUID, PK)
β”œβ”€β”€ name, repo_url, status
└── created_at, updated_at

workflow_runs
β”œβ”€β”€ id (UUID, PK)
β”œβ”€β”€ project_id (FK β†’ projects)
β”œβ”€β”€ status, run_mode (autonomous | execute)
β”œβ”€β”€ started_at, completed_at, error_message
└── ── indexes: project_id, status

agent_runs
β”œβ”€β”€ id (UUID, PK)
β”œβ”€β”€ workflow_run_id (FK β†’ workflow_runs)
β”œβ”€β”€ agent_name (enum: 12 values)
β”œβ”€β”€ status, output (JSONB), duration_ms
└── ── index: workflow_run_id

reports
β”œβ”€β”€ id (UUID, PK)
β”œβ”€β”€ workflow_run_id (FK), project_id (FK)
β”œβ”€β”€ readiness_score (0–10)
β”œβ”€β”€ project_profile (JSONB)    ← project_analysis output
β”œβ”€β”€ dataset_health (JSONB)     ← dataset_analysis output
β”œβ”€β”€ training_assessment (JSONB)
β”œβ”€β”€ evaluation_assessment (JSONB)
β”œβ”€β”€ mlops_readiness_data (JSONB)
β”œβ”€β”€ monitoring_data (JSONB)
β”œβ”€β”€ retraining_data (JSONB)
└── priority_actions (JSONB)

experiment_metrics
β”œβ”€β”€ id, workflow_run_id, project_id
β”œβ”€β”€ metric_name, value, is_final
└── source (stdout | file | mlflow)

experiment_artifacts
β”œβ”€β”€ id, workflow_run_id, project_id
β”œβ”€β”€ file_path, artifact_type, size_bytes
└── extension

registered_models
β”œβ”€β”€ id (UUID, PK)
β”œβ”€β”€ project_id, source_run_id
β”œβ”€β”€ name, version, status (enum)
β”œβ”€β”€ deployment_readiness_score
└── status_note, framework

Execution Pipeline

When run_mode = "execute", the platform runs training inside a Docker container:

1. Clone repository to isolated workspace directory

2. Detect entrypoints
   └── train.py, src/train.py, main.py, scripts/train.sh
       (ranked by confidence score)

3. Launch Docker container
   docker run --rm \
     --memory 4g --cpus 2.0 \
     --network bridge \
     --security-opt no-new-privileges \
     --cap-drop ALL \
     --volume /workspace:/workspace \
     python:3.11-slim \
     bash -c "pip install -r requirements.txt && python train.py"

4. Stream stdout/stderr β†’ SSE β†’ frontend live log

5. Collect metrics (multi-source):
   β€’ stdout scraping (loss=0.42, val_accuracy=0.91)
   β€’ file scanning (/workspace/metrics.json, results/*.json)
   β€’ MLflow tracking server (if present)

6. Collect artifacts:
   β€’ model files (*.pt, *.pkl, *.h5, *.onnx, *.safetensors)
   β€’ checkpoints (epoch_*/checkpoint_*)
   β€’ MLflow run directories

7. Register model (if readiness score β‰₯ threshold)
   └── status: "candidate" in registry

Experiment Tracking

Every run produces an ExperimentRunSummary with:

  • All metrics from all sources (stdout, files, MLflow)
  • is_final flag marking the final epoch vs intermediate
  • Artifact inventory with file paths, types, sizes
  • Linked report with readiness score

Compare view: Select 2–5 runs β†’ side-by-side metric table with delta columns and improvement direction indicators.


Model Registry

Models progress through a governed lifecycle:

candidate ──→ approved ──→ production
    β”‚              β”‚             β”‚
    └──→ rejected  └──→ rejected β”‚
                                 └──→ archived

Rules:
β€’ archived is terminal (no re-activation)
β€’ production β†’ candidate is blocked (must re-register)
β€’ Invalid transitions return HTTP 422

The Model Compare view loads two models, queries their source run reports for metrics, computes per-metric deltas (with direction awareness β€” loss is lower-better), and classifies the challenger as promote_b / reject_b / investigate.


Monitoring & Drift Detection

The model_monitoring agent computes drift across three statistical tests:

Test Formula Thresholds
PSI (Population Stability Index) Ξ£ (P_ref - P_curr) Γ— ln(P_ref / P_curr) < 0.10 stable Β· β‰₯ 0.10 medium Β· β‰₯ 0.25 high
KS (Kolmogorov-Smirnov) max β”‚F_ref(x) - F_curr(x)β”‚ < 0.15 stable Β· β‰₯ 0.15 medium Β· β‰₯ 0.30 high
Jensen-Shannon (KL(Pβ€–M) + KL(Qβ€–M)) / 2 < 0.10 stable Β· β‰₯ 0.10 medium Β· β‰₯ 0.20 high

Results are aggregated into:

  • overall_drift_score β€” weighted composite (0–1)
  • health β€” healthy / at_risk / critical classification
  • features_drifted β€” count of features exceeding any threshold

Retraining Intelligence

Three trigger categories feed the recommendation:

  1. Drift triggers β€” PSI/KS thresholds exceeded (deterministic)
  2. Monitoring triggers β€” health at_risk/critical, monitoring score < 5 (deterministic)
  3. Performance triggers β€” metric decline vs prior runs: β‰₯ 5% = medium, β‰₯ 10% = high (deterministic)

Triggers are collected, formatted as evidence, and passed to the LLM which outputs:

  • urgency: immediate / soon / monitor / none
  • confidence: 0–100%
  • strategy: fine_tune / full_retrain / data_augmentation / architecture_change / ensemble / none
  • retraining_plan: dataset recommendation, feature review list, evaluation requirements, promotion criteria

The LLM cannot reduce urgency below what the deterministic triggers mandate.


Screenshots

Screenshots will be added after first deployment. The UI includes:

Page Description
/analyze Repository URL form β†’ live agent progress stream
/reports/[runId] 7-section report: Project / Dataset / Training / Evaluation / MLOps / Monitoring / Retraining
/experiments Tabular run history with metrics chips; multi-select compare
/experiments/[runId] Full metric table + artifact list for a single run
/experiments/compare Side-by-side metric delta table across up to 5 runs
/models Registry table with lifecycle status; multi-select compare
/models/[modelId] Model detail with deployment assessment
/models/compare Metric delta comparison with promote/reject recommendation

Local Setup

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • PostgreSQL 15+ (or Docker)
  • Docker (for training execution)
  • Groq API key β€” console.groq.com

1. Clone the repository

git clone https://github.com/your-username/mlops-platform.git
cd mlops-platform

2. Start PostgreSQL

# With Docker:
docker run -d --name mlops-pg \
  -e POSTGRES_USER=mlops \
  -e POSTGRES_PASSWORD=mlops \
  -e POSTGRES_DB=mlops \
  -p 5432:5432 postgres:15-alpine

3. Backend setup

cd backend

# Install dependencies
pip install -e ".[dev]"

# Configure environment
cp .env.example .env
# Edit .env β€” set GROQ_API_KEY at minimum

# Run migrations
alembic upgrade head

# Start the API server
uvicorn src.api.app:app --reload --port 8000

4. Frontend setup

cd frontend

# Install dependencies
npm install

# Configure environment
cp .env.local.example .env.local
# Set NEXT_PUBLIC_API_URL=http://localhost:8000

# Start development server
npm run dev

5. Open the platform

Navigate to http://localhost:3000


Running the Platform

Analyze a repository (autonomous mode)

curl -X POST http://localhost:8000/api/v1/projects \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"repo_url": "https://github.com/owner/ml-project"}'
# Returns: {"id": "<project-id>"}

curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"run_mode": "autonomous"}'
# Returns: {"id": "<run-id>"}

Execute training pipeline

curl -X POST http://localhost:8000/api/v1/projects/<project-id>/runs \
  -H "X-API-Key: dev-key" \
  -H "Content-Type: application/json" \
  -d '{"run_mode": "execute"}'

Stream live progress

curl -N -H "X-API-Key: dev-key" \
  http://localhost:8000/api/v1/runs/<run-id>/stream

Run tests

cd backend
pytest tests/ -v

Environment Variables

Variable Default Description
GROQ_API_KEY (required) Groq API key for LLM calls
LLM_PROVIDER groq LLM provider (currently only groq)
LLM_MODEL_ID llama-3.3-70b-versatile Model to use for all agent calls
API_KEY dev-key API key required in X-API-Key header
DATABASE_URL postgresql+asyncpg://mlops:mlops@localhost:5432/mlops PostgreSQL connection string
DEBUG false Enable debug logging
WORKSPACE_BASE_DIR /tmp/mlops-workspaces Where cloned repositories are stored
TOKEN_BUDGET 100000 Max tokens of file content per agent (128K context)
AGENT_TIMEOUT_SECONDS 120 Per-agent LLM call timeout
ALLOWED_ORIGINS http://localhost:3000 Comma-separated CORS allowed origins
EXECUTION_DOCKER_IMAGE python:3.11-slim Default Docker image for training runs
EXECUTION_TIMEOUT_SECONDS 3600 Max duration for a training run (seconds)
EXECUTION_MEMORY_LIMIT 4g Docker memory limit per training container
EXECUTION_CPU_LIMIT 2.0 Docker CPU limit per training container
EXECUTION_NETWORK bridge Docker network mode for training containers

Tech Stack

Layer Technology
Frontend Next.js 15, React 19, Tailwind CSS, shadcn/ui
Backend FastAPI, SQLAlchemy 2.0 async, Pydantic v2
Orchestration LangGraph (StateGraph)
LLM Groq (llama-3.3-70b-versatile, 128K context)
Database PostgreSQL 15 with Alembic migrations
Execution Docker with security sandboxing
Data Analysis pandas, numpy, scipy
Streaming Server-Sent Events (SSE)

Future Roadmap

  • Multi-provider LLM β€” OpenAI, Anthropic, Gemini via provider abstraction (factory already in place)
  • Scheduled monitoring β€” Cron-triggered drift analysis on deployed models
  • Webhook notifications β€” Slack/email alerts on retraining_recommendation.urgency = immediate
  • ONNX export validation β€” Automated model export and inference smoke test
  • Multi-project dashboard β€” Portfolio view across all tracked repositories
  • GitHub App integration β€” Auto-trigger analysis on PR merge to main

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors