Skip to content

Commit 35ac648

Browse files
authored
Add overview of AI Foundry evaluation features
Added an overview of evaluation features in AI Foundry, including RFT Observability, Quick Evaluations, and Python Grader. Included details on evaluation targets and metric categories.
1 parent 941b797 commit 35ac648

2 files changed

Lines changed: 113 additions & 31 deletions

File tree

0_Azure/3_AzureAI/AIFoundry/demos/6_AI-Foundry_Evaluations.md

Lines changed: 0 additions & 31 deletions
This file was deleted.
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# AI Foundry LLM Evaluations <br/> How it works - Overview
2+
3+
Costa Rica
4+
5+
[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com)
6+
[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/)
7+
[brown9804](https://github.com/brown9804)
8+
9+
Last updated: 2025-11-27
10+
11+
----------
12+
13+
<details>
14+
<summary><b>List of References</b> (Click to expand)</summary>
15+
16+
17+
</details>
18+
19+
<details>
20+
<summary><b>Table of Content</b> (Click to expand)</summary>
21+
22+
23+
</details>
24+
25+
## Overview
26+
27+
> Evaluation features, enhance how developers assess and monitor fine-tuned models.
28+
29+
1. **RFT Observability ("Auto-Evals")**: Offers real-time visibility into Reinforcement Fine-Tuning (RFT) jobs.
30+
* **How it works**: Automatically launches a linked evaluation job when an RFT job starts. This job tracks intermediate results (prompts, responses, and grader scores) at each checkpoint.
31+
* **Benefits**: [What’s New in Azure AI Foundry Finetuning: July 2025](https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/what%E2%80%99s-new-in-azure-ai-foundry-finetuning-july-2025/4438850)
32+
* Enables live monitoring and debugging.
33+
* Reduces wasted compute and budget due to misconfigured graders or reward hacking.
34+
* Accessible via the “Evaluation” section on the Fine-tuning page when using RFT.
35+
2. **Quick Evaluations (Quick Evals)**: Rapidly assess model outputs from Stored Completions.
36+
* One-click evaluation without setting up a full evaluation job.
37+
* Compare outputs across multiple models instantly.
38+
* Ideal for fast iteration and spotting issues quickly.
39+
3. **Python Grader**: Custom evaluation logic using Python code.
40+
* Users write Python functions to score model outputs based on structure, content, or tool usage.
41+
* Returns a numeric score (typically 0–1).
42+
* Can be combined with other graders for holistic evaluation.
43+
44+
45+
> [!NOTE]
46+
> You can create evaluation runs using:
47+
> - **Built-in metrics**: Includes AI-assisted quality, NLP-based metrics (e.g., ROUGE, BLEU), and safety checks (e.g., self-harm, hate speech).
48+
> - **Custom flows**: Upload datasets (CSV or JSONL), configure evaluation targets (fine-tuned model or dataset), and map data columns to metric inputs. Read more about it here: [Evaluate generative AI models and applications by using Microsoft Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluate-generative-ai-app?view=foundry-classic)
49+
50+
> Evaluation Targets:
51+
52+
| Target | Description |
53+
|--------|-------------|
54+
| Fine-tuned model | Evaluates outputs generated during testing |
55+
| Dataset | Evaluates pre-generated outputs stored in a dataset |
56+
57+
> Metric Categories:
58+
59+
| Category | Description | Key Details |
60+
|----------|-------------|-------------|
61+
| AI Quality (AI-assisted) | Evaluates output quality using AI models | Requires a model deployment |
62+
| AI Quality (NLP) | Evaluates using mathematical metrics | Uses F1, ROUGE, BLEU scores |
63+
| Risk & Safety | Detects harmful or inappropriate content | Content safety evaluation |
64+
65+
<img width="909" height="484" alt="image" src="https://github.com/user-attachments/assets/4c432506-dc16-4baa-966c-c8de17f57852" />
66+
67+
From [Observability in generative AI](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observability?view=foundry-classic)
68+
69+
## How it works?
70+
71+
> Here’s a handy reference with all the details about the parameters for textual similarity: [Textual similarity evaluators](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/textual-similarit…). The approach is to experiment with different scenarios to identify the one with best results for the intended use case.
72+
73+
E.g:
74+
75+
<img width="1581" height="828" alt="image" src="https://github.com/user-attachments/assets/2fa49964-fe97-40ed-ad2e-31ad4b7ba275" />
76+
77+
> System Prompt used:
78+
79+
~~~text
80+
You are a search-query string checker. For each user query, output exactly one label from this list: misspelling, brand name, heritage month, holiday, industry, customer service information, product use, event.
81+
misspelling: common word or brand spelled incorrectly.
82+
brand name: references to specific companies, product brands, or branded product phrases.
83+
heritage month: terms tied to cultural observances (e.g., Pride Month).
84+
holiday: religious or cultural holidays.
85+
industry: sectors of business or commerce (e.g., plumbing, technology).
86+
customer service information: phrases seeking store policies, coupons, hours, contacts, or similar help.
87+
product use: phrases describing what an item is for (e.g., “guitar pick”).
88+
event: occasions or gatherings such as weddings or concerts.
89+
If none apply, choose the closest match. Do not provide explanations, return only the label.
90+
~~~
91+
92+
> Please make any adjustments as you see fit:
93+
94+
<img width="1570" height="835" alt="image" src="https://github.com/user-attachments/assets/60bb1e75-370d-4bef-849e-ec42642b28b2" />
95+
96+
> Add the test criteria, for example:
97+
98+
<img width="1587" height="837" alt="image" src="https://github.com/user-attachments/assets/ba86935a-b4a8-459f-bd91-c7170b39ebb4" />
99+
100+
> Example of values used:
101+
> F1 = 0.8
102+
> Precision = 0.85
103+
> Recall = 0.85
104+
105+
<img width="1583" height="826" alt="image" src="https://github.com/user-attachments/assets/d649259c-23d9-4452-a719-5431724f757c" />
106+
107+
108+
<!-- START BADGE -->
109+
<div align="center">
110+
<img src="https://img.shields.io/badge/Total%20views-1532-limegreen" alt="Total views">
111+
<p>Refresh Date: 2025-10-23</p>
112+
</div>
113+
<!-- END BADGE -->

0 commit comments

Comments
 (0)