Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This document refers to the Microsoft Foundry (classic) portal.
Note
This document refers to the Microsoft Foundry (new) portal.
Important
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
In today's AI-driven world, Generative AI Operations (GenAIOps) is revolutionizing how organizations build and deploy intelligent systems. As companies increasingly use AI agents and applications to transform decision-making, enhance customer experiences, and fuel innovation, one element stands paramount: robust evaluation frameworks. Evaluation isn't just a checkpoint. It's the foundation of quality and trust in AI applications. Without rigorous assessment and monitoring, AI systems can produce content that's:
- Fabricated or ungrounded in reality
- Irrelevant or incoherent
- Harmful in perpetuating content risks and stereotypes
- Dangerous in spreading misinformation
- Vulnerable to security exploits
This is where observability becomes essential. These capabilities measure both the frequency and severity of risks in AI outputs, enabling teams to systematically address quality, safety, and security concerns throughout the entire AI development journey—from selecting the right model to monitoring production performance, quality, and safety.
What is observability?
AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.
What are evaluators?
Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users. The following supported evaluators provide comprehensive assessment capabilities across different AI application types and concerns:
General purpose
| Evaluator | Purpose | Inputs |
|---|---|---|
| Coherence | Measures logical consistency and flow of responses. | Query, response |
| Fluency | Measures natural language quality and readability. | Response |
| QA | Measures comprehensively various quality aspects in question-answering. | Query, context, response, ground truth |
To learn more, see General purpose evaluators.
Textual similarity
| Evaluator | Purpose | Inputs |
|---|---|---|
| Similarity | AI-assisted textual similarity measurement. | Query, context, ground truth |
| F1 Score | Harmonic mean of precision and recall in token overlaps between response and ground truth. | Response, ground truth |
| BLEU | Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth. | Response, ground truth |
| GLEU | Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth. | Response, ground truth |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth. | Response, ground truth |
| METEOR | Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth. | Response, ground truth |
To learn more, see Textual similarity evaluators
RAG (retrieval augmented generation)
| Evaluator | Purpose | Inputs |
|---|---|---|
| Retrieval | Measures how effectively the system retrieves relevant information. | Query, context |
| Document Retrieval (preview) | Measures accuracy in retrieval results given ground truth. | Ground truth, retrieved documents |
| Groundedness | Measures how consistent the response is with respect to the retrieved context. | Query (optional), context, response |
| Groundedness Pro (preview) | Measures whether the response is consistent with respect to the retrieved context. | Query, context, response |
| Relevance | Measures how relevant the response is with respect to the query. | Query, response |
| Response Completeness (preview) | Measures to what extent the response is complete (not missing critical information) with respect to the ground truth. | Response, ground truth |
To learn more, see Retrieval-augmented Generation (RAG) evaluators.
Safety and security (preview)
| Evaluator | Purpose | Inputs |
|---|---|---|
| Hate and Unfairness | Identifies biased, discriminatory, or hateful content. | Query, response |
| Sexual | Identifies inappropriate sexual content. | Query, response |
| Violence | Detects violent content or incitement. | Query, response |
| Self-Harm | Detects content promoting or describing self-harm. | Query, response |
| Content Safety | Comprehensive assessment of various safety concerns. | Query, response |
| Protected Materials | Detects unauthorized use of copyrighted or protected content. | Query, response |
| Code Vulnerability | Identifies security issues in generated code. | Query, response |
| Ungrounded Attributes | Detects fabricated or hallucinated information inferred from user interactions. | Query, context, response |
To learn more, see Risk and safety evaluators.
Agents (preview)
| Evaluator | Purpose | Inputs |
|---|---|---|
| Intent Resolution | Measures how accurately the agent identifies and addresses user intentions. | Query, response |
| Task Adherence | Measures how well the agent follows through on identified tasks. | Query, response, tool definitions (optional) |
| Tool Call Accuracy | Measures how well the agent selects and calls the correct tools to. | Query, either response or tool calls, tool definitions |
| Evaluator | Purpose | Inputs |
|---|---|---|
| Task Adherence | Measures whether the agent follows through on identified tasks according to system instructions. | Query, Response, Tool definitions (Optional) |
| Task Completion | Measures whether the agent successfully completed the requested task end-to-end. | Query, Response, Tool definitions (Optional) |
| Intent Resolution | Measures how accurately the agent identifies and addresses user intentions. | Query, Response, Tool definitions (Optional) |
| Task Navigation Efficiency | Determines whether the agent's sequence of steps matches an optimal or expected path to measure efficiency. | Response, Ground truth |
| Tool Call Accuracy | Measures the overall quality of tool calls including selection, parameter correctness, and efficiency. | Query, Tool definitions, Tool calls (Optional), Response |
| Tool Selection | Measures whether the agent selected the most appropriate and efficient tools for a task. | Query, Tool definitions, Tool calls (Optional), Response |
| Tool Input Accuracy | Validates that all tool call parameters are correct with strict criteria including grounding, type, format, completeness, and appropriateness. | Query, Response, Tool definitions |
| Tool Output Utilization | Measures whether the agent correctly interprets and uses tool outputs contextually in responses and subsequent calls. | Query, Response, Tool definitions (Optional) |
| Tool Call Success | Evaluates whether all tool calls executed successfully without technical failures. | Response, Tool definitions (Optional) |
To learn more, see Agent evaluators.
Azure OpenAI graders (preview)
| Evaluator | Purpose | Inputs |
|---|---|---|
| Model Labeler | Classifies content using custom guidelines and labels. | Query, response, ground truth |
| String Checker | Performs flexible text validations and pattern matching. | Response |
| Text Similarity | Evaluates the quality of text or determine semantic closeness. | Response, ground truth |
| Model Scorer | Generates numerical scores (customized range) for content based on custom guidelines. | Query, response, ground truth |
To learn more, see Azure OpenAI Graders.
Evaluators in the development lifecycle
By using these evaluators strategically throughout the development lifecycle, teams can build more reliable, safe, and effective AI applications that meet user needs while minimizing potential risks.
The three stages of GenAIOps evaluation
GenAIOps uses the following three stages.
Base model selection
Before building your application, you need to select the right foundation. This initial evaluation helps you compare different models based on:
- Quality and accuracy: How relevant and coherent are the model's responses?
- Task performance: Does the model handle your specific use cases efficiently?
- Ethical considerations: Is the model free from harmful biases?
- Safety profile: What is the risk of generating unsafe content?
Tools available: Microsoft Foundry benchmark for comparing models on public datasets or your own data, and the Azure AI Evaluation SDK for testing specific model endpoints.
Preproduction evaluation
After you select a base model, the next step is to develop an AI agent or application. Before you deploy to a production environment, thorough testing is essential to ensure that the AI agent or application is ready for real-world use.
Preproduction evaluation involves:
- Testing with evaluation datasets: These datasets simulate realistic user interactions to ensure the AI agent performs as expected.
- Identifying edge cases: Finding scenarios where the AI agent's response quality might degrade or produce undesirable outputs.
- Assessing robustness: Ensuring that the AI agent can handle a range of input variations without significant drops in quality or safety.
- Measuring key metrics: Metrics such as task adherence, response groundedness, relevance, and safety are evaluated to confirm readiness for production.
The preproduction stage acts as a final quality check, reducing the risk of deploying an AI agent or application that doesn't meet the desired performance or safety standards.
Evaluation Tools and Approaches:
Bring your own data: You can evaluate your AI agents and applications in preproduction using your own evaluation data with supported evaluators, including quality, safety, or custom evaluators, and view results via the Foundry portal. Use Foundry's evaluation wizard or Azure AI Evaluation SDK's supported evaluators, including generation quality, safety, or custom evaluators. View results by using the Foundry portal.
Simulators and AI red teaming agent: If you don't have evaluation data (test data), Azure AI Evaluation SDK's simulators can help by generating topic-related or adversarial queries. These simulators test the model's response to situation-appropriate or attack-like queries (edge cases).
- AI red teaming agent simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks using Microsoft's open framework for Python Risk Identification Tool or PyRIT.
- Adversarial simulators injects static queries that mimic potential safety risks or security attacks such as attempted jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
- Context-appropriate simulators generate typical, relevant conversations you'd expect from users to test quality of responses. With context-appropriate simulators you can assess metrics such as groundedness, relevance, coherence, and fluency of generated responses.
Automated scans using the AI red teaming agent enhance preproduction risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment. By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. This tool is recommended to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.
Alternatively, you can also use the Foundry portal for testing your generative AI applications.
Bring your own data: You can evaluate your AI applications in preproduction using your own evaluation data with supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Foundry portal. Use Foundry's evaluation wizard or Azure AI Evaluation SDK's supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Foundry portal.
Simulators and AI red teaming agent: If you don't have evaluation data (test data), simulators can help by generating topic-related or adversarial queries. These simulators test the model's response to situation-appropriate or attack-like queries (edge cases).
- AI red teaming agent simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks using Microsoft's open framework for Python Risk Identification Tool or PyRIT.
Automated scans using the AI red teaming agent enhances preproduction risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment. By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. This tool is recommended to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.
Alternatively, you can also use the Foundry portal for testing your generative AI applications.
After you get satisfactory results, you can deploy the AI application to production.
Post-production monitoring
After deployment, continuous monitoring ensures your AI application maintains quality in real-world conditions.
After deployment, continuous monitoring ensures your AI application maintains quality in real-world conditions.
- Operational metrics: Regular measurement of key AI agent operational metrics.
- Continuous evaluation: Enables quality and safety evaluation of production traffic at a sampled rate.
- Scheduled evaluation: Enables scheduled quality and safety evaluation using a test dataset to detect drift in the underlying systems.
- Scheduled red teaming: Provides scheduled adversarial testing capabilities to probe for safety and security vulnerabilities.
- Azure Monitor alerts: Swift action when harmful or inappropriate outputs occur. Set up alerts for continuous evaluation to be notified when evaluation results drop below the pass rate threshold in production.
Effective monitoring helps maintain user trust and allows for rapid issue resolution.
Observability provides comprehensive monitoring capabilities essential for today's complex and rapidly evolving AI landscape. Seamlessly integrated with Azure Monitor Application Insights, this solution enables continuous monitoring of deployed AI applications to ensure optimal performance, safety, and quality in production environments.
The Foundry Observability dashboard delivers real-time insights into critical metrics. It allows teams to quickly identify and address performance issues, safety concerns, or quality degradation.
For Agent-based applications, Foundry offers enhanced continuous evaluation capabilities. These capabilities can provide deeper visibility into quality and safety metrics. They can create a robust monitoring ecosystem that adapts to the dynamic nature of AI applications while maintaining high standards of performance and reliability.
By continuously monitoring the AI application's behavior in production, you can maintain high-quality user experiences and swiftly address any issues that surface.
Building trust through systematic evaluation
GenAIOps establishes a reliable process for managing AI applications throughout their lifecycle. By implementing thorough evaluation at each stage—from model selection through deployment and beyond—teams can create AI solutions that aren't just powerful but trustworthy and safe.
Evaluation cheat sheet
| Purpose | Process | Parameters |
|---|---|---|
| What are you evaluating for? | Identify or build relevant evaluators | - Quality and performance sample notebook - Agents Response Quality - Safety and Security (Safety and Security sample notebook) - Custom (Custom sample notebook) |
| What data should you use? | Upload or generate relevant dataset | - Generic simulator for measuring Quality and Performance (Generic simulator sample notebook) - Adversarial simulator for measuring Safety and Security (Adversarial simulator sample notebook) - AI red teaming agent for running automated scans to assess safety and security vulnerabilities (AI red teaming agent sample notebook) |
| How to run evaluations on a dataset? | Run evaluation | - Agent evaluation runs - Remote cloud run - Local run |
| How did my model/app perform? | Analyze results | - View aggregate scores, view details, score details, compare evaluation runs |
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results didn't align to human feedback, adjust your evaluator. - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. Example of mitigations to apply: Azure AI Content Safety |
| Purpose | Process | Parameters |
|---|---|---|
| What are you evaluating for? | Identify or build relevant evaluators | - RAG Quality - Agents Quality - Safety and Security (Safety and Security sample notebook) - Custom (Custom sample notebook) |
| What data should you use? | Upload or generate relevant dataset | - Synthetic dataset generation - AI red teaming agent for running automated scans to assess safety and security vulnerabilities (AI red teaming agent sample notebook) |
| How to run evaluations on a dataset? | Run evaluation | - Agent evaluation runs - Remote cloud run |
| How did my model/app perform? | Analyze results | - View aggregate scores, view details, score details, compare evaluation runs |
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results didn't align to human feedback, adjust your evaluator. - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. Example of mitigations to apply: Azure AI Content Safety |
Bring your own virtual network for evaluation
For network isolation purposes you can bring your own virtual network for evaluation. To learn more, see How to configure a private link.
Note
Evaluation data is sent to Application Insights if Application Insights is connected. Virtual Network support for Application Insights and tracing isn't available.
Virtual network region support
| Geography | Supported Azure region |
|---|---|
| US | westus, westus3, eastus, eastus2 |
| Australia | australiaeast |
| France | francecentral |
| India | southindia |
| Japan | japaneast |
| Norway | norwayeast |
| Sweden | swedencentral |
| Switzerland | switzerlandnorth |
| UAE | uaenorth |
| UK | uksouth |
Region support
Currently certain AI-assisted evaluators are available only in the following regions:
| Region | Hate and unfairness, Sexual, Violent, Self-harm, Indirect attack, Code vulnerabilities, Ungrounded attributes | Groundedness Pro | Protected material |
|---|---|---|---|
| East US 2 | Supported | Supported | Supported |
| Sweden Central | Supported | Supported | N/A |
| US North Central | Supported | N/A | N/A |
| France Central | Supported | N/A | N/A |
| Switzerland West | Supported | N/A | N/A |
Agent playground evaluation region support
| Region | Status |
|---|---|
| East US | Supported |
| East US 2 | Supported |
| West US | Supported |
| West US 2 | Supported |
| West US 3 | Supported |
| France Central | Supported |
| Norway East | Supported |
| Sweden Central | Supported |
Pricing
Observability features such as Risk and Safety Evaluations and Continuous Evaluations are billed based on consumption as listed in our Azure pricing page.