You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -95,7 +95,7 @@ The core pillars of MLOps on Azure are:
95
95
| Best Practice | Consideration |
96
96
|---|---|
97
97
|**Project structure**| Use the [Team Data Science Process (TDSP)](https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview) as a lightweight template to organize work across teams. |
98
-
|**Responsible AI documentation**| Establish a **Model Card** or **AI Use Case Description** early — this feeds directly into Responsible AI documentation later. |
98
+
|**Responsible AI documentation**| Establish a **Model Card** or **AI Use Case Description** early, this feeds directly into Responsible AI documentation later. |
99
99
|**Service Level Agreements**| Define **SLAs** for model latency, availability, and retraining frequency before any architecture decisions are made. |
100
100
101
101
## Phase 2: Data Management & Preparation
@@ -125,7 +125,7 @@ The core pillars of MLOps on Azure are:
125
125
|---|---|
126
126
|**Dataset versioning**| Register all datasets as **Azure ML Data Assets** so every training run references a versioned, traceable data snapshot. |
127
127
|**Data validation**| Apply data validation checks (schema, row counts, value ranges) as the first step of every pipeline run. Fail fast on bad data. |
128
-
|**Secure storage**| Store sensitive data in **Azure Data Lake Storage Gen2** with hierarchical namespace enabled, access controlled via Azure RBAC and ACLs — never embed credentials in code. |
128
+
|**Secure storage**| Store sensitive data in **Azure Data Lake Storage Gen2** with hierarchical namespace enabled, access controlled via Azure RBAC and ACLs, never embed credentials in code. |
129
129
|**Data protection**| Enable **soft delete** and **versioning** on Azure Blob/ADLS to protect against accidental deletion. |
130
130
131
131
## Phase 3: Model Development & Experimentation
@@ -138,7 +138,7 @@ The core pillars of MLOps on Azure are:
138
138
|**Feature selection & engineering**| Iterate on which features move the needle. |
139
139
|**Algorithm selection**| Try multiple algorithms; avoid premature commitment to a complex model. |
140
140
|**Hyperparameter tuning**| Systematic search over the parameter space. |
141
-
|**Experiment tracking**| Log every run — parameters, metrics, artifacts, and environment. |
141
+
|**Experiment tracking**| Log every run, parameters, metrics, artifacts, and environment. |
142
142
143
143
| Need | Azure Offering Services & Tools|
144
144
|---|---|
@@ -166,7 +166,7 @@ The core pillars of MLOps on Azure are:
166
166
|---|---|
167
167
|**Refactor training code**| Move from notebooks into reusable Python modules/scripts ready for pipeline execution. |
168
168
|**Define an Azure ML Pipeline**| Structure the workflow into discrete, versioned steps: data prep → feature engineering → training → evaluation. |
169
-
|**Parameterize everything**| Externalize data version, hyperparameters, and compute target — nothing hardcoded in scripts. |
169
+
|**Parameterize everything**| Externalize data version, hyperparameters, and compute target, nothing hardcoded in scripts. |
170
170
|**Compute selection**| Choose the right cluster type for the workload: CPU for classical ML, GPU for deep learning. |
171
171
|**Distributed training**| For large models or datasets, configure multi-node training with frameworks like PyTorch DDP or Horovod. |
172
172
@@ -237,19 +237,19 @@ The core pillars of MLOps on Azure are:
237
237
|**Package the model**| Create a scoring script (`score.py`) and register the serving environment in Azure ML. |
238
238
|**Deploy to staging**| Validate the endpoint behavior against integration tests before routing any production traffic. |
239
239
|**Blue/green or canary deployment**| Gradually shift traffic to the new model version (e.g., 10% → 50% → 100%) to minimize blast radius. |
240
-
|**Rollback plan**| Document and test the rollback procedure before going live — know the exact steps to revert traffic to the previous deployment. |
240
+
|**Rollback plan**| Document and test the rollback procedure before going live, know the exact steps to revert traffic to the previous deployment. |
241
241
242
242
> [!TIP]
243
243
> For batch workloads, **Batch Endpoints** are significantly more cost-efficient than keeping an online endpoint scaled up. They spin compute up on demand and scale to zero after the job completes.
244
244
245
245
| Best Practice | Consideration |
246
246
|---|---|
247
-
|**Managed Online Endpoints**| Use Managed Online Endpoints for real-time serving — Microsoft handles provisioning, autoscaling, certificates, and blue/green traffic splitting natively. |
247
+
|**Managed Online Endpoints**| Use Managed Online Endpoints for real-time serving, Microsoft handles provisioning, autoscaling, certificates, and blue/green traffic splitting natively. |
248
248
|**Traffic splitting**| Configure canary deployments at the endpoint level (e.g., 10% new / 90% current) before committing to full promotion. |
249
249
|**Autoscaling**| Scale based on request queue depth and CPU/GPU utilization. Set appropriate min/max instance counts to balance cost and availability. |
250
-
|**Authentication**| Protect all endpoints with **Azure AD authentication** — never expose unauthenticated endpoints in production. |
250
+
|**Authentication**| Protect all endpoints with **Azure AD authentication**, never expose unauthenticated endpoints in production. |
251
251
|**Smoke & integration tests**| Run automated tests against the staging deployment in the CD pipeline before promoting to production. |
252
-
|**Registry-based deployments**| Reference model artifacts from the **Azure ML Model Registry** by name and version — never copy files manually. |
252
+
|**Registry-based deployments**| Reference model artifacts from the **Azure ML Model Registry** by name and version, never copy files manually. |
253
253
254
254
## Phase 7: Monitoring & Observability
255
255
@@ -271,7 +271,7 @@ The core pillars of MLOps on Azure are:
271
271
|**Model Monitors**| Create Azure ML Model Monitors for every production model, scheduled daily or weekly, with alerts when drift exceeds thresholds. |
272
272
|**Application Insights instrumentation**| Instrument the scoring script to log prediction inputs, outputs, and latency for every inference request (subject to data privacy requirements). |
273
273
|**Operational alerts**| Set up Azure Monitor Alerts for P99 latency spikes, error rate increases, and endpoint availability drops. |
274
-
|**Baseline dataset**| Store a baseline dataset (training data or representative sample) at deployment time — Azure ML uses this as the reference distribution for drift calculations. |
274
+
|**Baseline dataset**| Store a baseline dataset (training data or representative sample) at deployment time, Azure ML uses this as the reference distribution for drift calculations. |
275
275
|**Ground truth collection**| Collect and store ground truth labels wherever possible to compute actual model performance metrics in production. |
276
276
277
277
## Phase 8: Retraining & Continuous Improvement
@@ -291,7 +291,7 @@ The core pillars of MLOps on Azure are:
291
291
| Key Activity | Description |
292
292
|---|---|
293
293
|**Automated retraining pipeline**| The same parameterized training pipeline from Phase 4 should be fully triggerable via an event or schedule without manual intervention. |
294
-
|**Automated evaluation gate**| The retrained model must pass all evaluation thresholds from Phase 5 before being registered — fail the pipeline otherwise. |
294
+
|**Automated evaluation gate**| The retrained model must pass all evaluation thresholds from Phase 5 before being registered, fail the pipeline otherwise. |
295
295
|**Automated deployment**| A passing model version automatically updates the production endpoint with a canary rollout. |
296
296
|**Human-in-the-loop**| For high-stakes models, include a mandatory human approval step in the CD pipeline before promoting to production. |
297
297
@@ -309,7 +309,7 @@ The core pillars of MLOps on Azure are:
309
309
| Security & Access Control Practice | Consideration |
310
310
|---|---|
311
311
|**Least-privilege RBAC**| Apply minimal permissions at every layer: Azure ML Workspace, Storage Account, Key Vault, and compute. |
312
-
|**Secret management**| Store all secrets in **Azure Key Vault** — never in code, baked-in environment variables, or `terraform.tfvars` committed to source control. |
312
+
|**Secret management**| Store all secrets in **Azure Key Vault**, never in code, baked-in environment variables, or `terraform.tfvars` committed to source control. |
313
313
|**Managed Identity**| Use System-Assigned or User-Assigned Managed Identity for all Azure ML resources to eliminate credential management entirely. |
314
314
|**Private endpoints**| Enable private endpoints for the Azure ML Workspace, Storage, Key Vault, and Container Registry in production to eliminate public internet exposure. |
315
315
@@ -323,7 +323,7 @@ The core pillars of MLOps on Azure are:
323
323
| Cost Management Practice | Consideration |
324
324
|---|---|
325
325
|**Budget alerts**| Configure budget alerts in Azure Cost Management for the ML resource group to catch unexpected spend early. |
326
-
|**Scale-to-zero training**| Use compute clusters that scale to zero nodes when idle — never leave clusters running between jobs. |
326
+
|**Scale-to-zero training**| Use compute clusters that scale to zero nodes when idle, never leave clusters running between jobs. |
327
327
|**Dev instance shutdown**| Schedule automatic shutdown for compute instances used for development (e.g., nightly shutdown policy). |
328
328
|**Workspace hygiene**| Regularly review and delete unused model versions, stale datasets, and old pipeline run logs that accumulate over time. |
329
329
|**Reserved Instances**| Use Reserved Instances for stable, predictable production endpoint compute to reduce costs by up to 40%. |
0 commit comments