Skip to content
View DuqueOM's full-sized avatar

Block or report DuqueOM

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
DuqueOM/README.md

Duque Ortega Mutis — MLOps Engineer (Production-Focused)

I don't just deploy ML models. I diagnose why they break at 2am.

Portfolio LinkedIn YouTube Email


MLOps engineer with a production multi-cloud platform deployed from scratch (GKE + EKS, 6 K8s services, 395+ tests). 14 years running 5 businesses — managing teams, P&L, and vendor operations — now applied to building reliable ML systems.


Three production incidents — diagnosed from first principles, not guesswork

Three production incidents diagnosed from first principles:

 81% error rate under load  →  uvicorn --workers is anti-pattern under K8s
                                (shared CPU budget = thrashing, not parallelism)
                                Fixed: asyncio + ThreadPoolExecutor, GIL analysis
                                Result: 81% errors → 0%, 2000m CPU → 1000m

 SHAP returning all zeros   →  TreeExplainer incompatible with StackingClassifier
                                Fixed: KernelExplainer in original feature space
                                Evaluated 4 alternatives before deciding

 HPA never scales down      →  Memory-based HPA + fixed ML footprint
                                = mathematically impossible to scale down
                                Fixed: CPU-only HPA, 3→1 pods in 8 minutes

Flagship Open-Source — ML-MLOps-Production-Template

The patterns my portfolio cost $200/mo and 22 ADRs to learn — packaged so other teams don't have to.

Release Anti-Patterns Agentic

Most templates give you files.
This one gives you a behavioral protocol.
 
AUTO / CONSULT / STOP — 20 operations mapped to agent modes.
STOP on production deploys cannot be bypassed by human insistence.
If env=production and audit.passed=False → DeploymentRequest refuses to construct.
 
The invariants aren't in the README. They're in the code.
Layer What's encoded
32 anti-patterns (D-01→D-32) Runtime · Training · Infrastructure · EDA · Security · Closed-loop monitoring
SLSA L2 supply chain Gitleaks → Trivy → Syft SBOM → Cosign keyless (OIDC) → Kyverno admission
Closed-loop monitoring Ground truth ingestion · Sliced performance · Champion/Challenger (McNemar + bootstrap ΔAUC)
Quad-IDE native Windsurf · Claude Code · Cursor · Codex — same invariants, native config for each
24 ADRs Each decision documented with alternatives rejected and revisit triggers
# Zero to working fraud detection service in one command
git clone https://github.com/DuqueOM/ML-MLOps-Production-Template.git
cd ML-MLOps-Production-Template && make bootstrap

Template repo  |  QUICK_START.md  |  24 ADRs


Production Portfolio — ML-MLOps-Portfolio

3 ML services on GKE + EKS · 18 ADRs · 395+ tests · Multi-cloud Terraform

CI codecov K8s Terraform

Portfolio Demo

Most ML portfolios show models that score well. This one shows what happens after you deploy — the incidents, the wrong decisions corrected, and 18 Architectural Decision Records documenting every trade-off with measured data.

Project Metric Key Engineering Decision
BankChurn Predictor AUC 0.87 · 90% cov Async inference via ThreadPoolExecutor · threshold 0.35 (30:1 cost ratio, quantified)
NLPInsight Analyzer Acc 80.6% · 98% cov Upgraded from curated dataset to 11.9K real noisy tweets — honest over impressive
ChicagoTaxi Pipeline 0.96 · 6.3M rows Found & fixed data leakage · temporal split · R² 0.905→0.965

Selected "Don't Build" decisions (often harder than building):

  • Removed CarVision: MAPE 32.9% is not defensible — ADR-009
  • Deferred Feature Store: full Feast architecture designed for when it's needed — ADR-007
  • Rejected Airflow: CronJob + GitHub Actions is sufficient for 3 models — ADR-006
  • Documented $24/mo GCP vs $145/mo AWS gap — both meet SLA, chose FinOps — ADR-016

📐 18 ADRs →  |  📋 Engineering Highlights →  |  📺 3min Demo →


Agentic Development Configuration

The portfolio includes a production-grade agentic setup (AGENTS.md + .windsurf/) that encodes 18 ADRs and 3 production incidents into the development environment itself.

.windsurf/
├── rules/       7 context-aware rules (glob-triggered per file type)
├── skills/      6 operational procedures (debug, deploy-gke, deploy-aws,
│                drift-detection, model-retrain, release-checklist)
└── workflows/   6 structured prompt workflows (/incident, /retrain,
                 /release, /load-test, /new-adr, /drift-check)

The agent knows: never use uvicorn --workers N under K8s (ADR-014), always use KernelExplainer for SHAP with StackingClassifier (ADR-010), CPU targets are 50%/60%/60% — not 70% (ADR-001). Operational knowledge encoded, not just referenced.

AGENTS.md  |  .windsurf/


Stack

Kubernetes (GKE · EKS) Terraform GitHub Actions FastAPI MLflow Prometheus Grafana Argo Rollouts Docker PySpark LightGBM XGBoost SHAP Evidently DVC Pandera GCP AWS SageMaker Vertex AI Cosign Kyverno OpenTelemetry Python 3.11+

AWS Certified Machine Learning Engineer – Associate (MLA-C01) · TripleTen Data Science · 14 years ops → MLOps


AI Transparency

These projects use AI-assisted tools (Windsurf Cascade, Claude Code) for code generation and boilerplate. All architectural decisions, system design, trade-off analysis, and incident resolution are the author's. AI tools accelerate throughput — they don't replace engineering judgment.

The .windsurf/ and .claude/ configurations are themselves a demonstration of this philosophy: the agent is governed by documented decisions, not given free rein. That governance design is original and independently authored.


Open to MLOps · ML Platform · ML Infrastructure roles — Remote preferred — Mexico City (CST)

Pinned Loading

  1. ML-MLOps-Portfolio ML-MLOps-Portfolio Public

    Production-grade MLOps platform: 3 end-to-end ML projects with CI/CD, Terraform (GCP GKE) (AWS EKS), Kubernetes, MLflow, Docker, and 90-96% test coverage

    Python 3

  2. ML-MLOps-Production-Template ML-MLOps-Production-Template Public template

    Production ML template with 22 encoded anti-patterns, multi-cloud K8s, agent rules (AUTO/CONSULT/STOP), and supply-chain security for Windsurf, Claude Code, and Cursor.

    Python 2