Observability in generative AI

Note

This document refers to the Microsoft Foundry (classic) portal.

Note

This document refers to the Microsoft Foundry (new) portal.

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In today's AI-driven world, Generative AI Operations (GenAIOps) is revolutionizing how organizations build and deploy intelligent systems. As companies increasingly use AI agents and applications to transform decision-making, enhance customer experiences, and fuel innovation, one element stands paramount: robust evaluation frameworks. Evaluation isn't just a checkpoint. It's the foundation of quality and trust in AI applications. Without rigorous assessment and monitoring, AI systems can produce content that's:

Fabricated or ungrounded in reality
Irrelevant or incoherent
Harmful in perpetuating content risks and stereotypes
Dangerous in spreading misinformation
Vulnerable to security exploits

This is where observability becomes essential. These capabilities measure both the frequency and severity of risks in AI outputs, enabling teams to systematically address quality, safety, and security concerns throughout the entire AI development journey—from selecting the right model to monitoring production performance, quality, and safety.

What is observability?

AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.

What are evaluators?

Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users. The following supported evaluators provide comprehensive assessment capabilities across different AI application types and concerns:

General purpose

Evaluator	Purpose	Inputs
Coherence	Measures logical consistency and flow of responses.	Query, response
Fluency	Measures natural language quality and readability.	Response
QA	Measures comprehensively various quality aspects in question-answering.	Query, context, response, ground truth

To learn more, see General purpose evaluators.

Textual similarity

Evaluator	Purpose	Inputs
Similarity	AI-assisted textual similarity measurement.	Query, context, ground truth
F1 Score	Harmonic mean of precision and recall in token overlaps between response and ground truth.	Response, ground truth
BLEU	Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth.	Response, ground truth
GLEU	Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth.	Response, ground truth
ROUGE	Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth.	Response, ground truth
METEOR	Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth.	Response, ground truth

To learn more, see Textual similarity evaluators

RAG (retrieval augmented generation)

Evaluator	Purpose	Inputs
Retrieval	Measures how effectively the system retrieves relevant information.	Query, context
Document Retrieval (preview)	Measures accuracy in retrieval results given ground truth.	Ground truth, retrieved documents
Groundedness	Measures how consistent the response is with respect to the retrieved context.	Query (optional), context, response
Groundedness Pro (preview)	Measures whether the response is consistent with respect to the retrieved context.	Query, context, response
Relevance	Measures how relevant the response is with respect to the query.	Query, response
Response Completeness (preview)	Measures to what extent the response is complete (not missing critical information) with respect to the ground truth.	Response, ground truth

To learn more, see Retrieval-augmented Generation (RAG) evaluators.

Safety and security (preview)

Evaluator	Purpose	Inputs
Hate and Unfairness	Identifies biased, discriminatory, or hateful content.	Query, response
Sexual	Identifies inappropriate sexual content.	Query, response
Violence	Detects violent content or incitement.	Query, response
Self-Harm	Detects content promoting or describing self-harm.	Query, response
Content Safety	Comprehensive assessment of various safety concerns.	Query, response
Protected Materials	Detects unauthorized use of copyrighted or protected content.	Query, response
Code Vulnerability	Identifies security issues in generated code.	Query, response
Ungrounded Attributes	Detects fabricated or hallucinated information inferred from user interactions.	Query, context, response

To learn more, see Risk and safety evaluators.

Agents (preview)

Evaluator	Purpose	Inputs
Intent Resolution	Measures how accurately the agent identifies and addresses user intentions.	Query, response
Task Adherence	Measures how well the agent follows through on identified tasks.	Query, response, tool definitions (optional)
Tool Call Accuracy	Measures how well the agent selects and calls the correct tools to.	Query, either response or tool calls, tool definitions

Evaluator	Purpose	Inputs
Task Adherence	Measures whether the agent follows through on identified tasks according to system instructions.	Query, Response, Tool definitions (Optional)
Task Completion	Measures whether the agent successfully completed the requested task end-to-end.	Query, Response, Tool definitions (Optional)
Intent Resolution	Measures how accurately the agent identifies and addresses user intentions.	Query, Response, Tool definitions (Optional)
Task Navigation Efficiency	Determines whether the agent's sequence of steps matches an optimal or expected path to measure efficiency.	Response, Ground truth
Tool Call Accuracy	Measures the overall quality of tool calls including selection, parameter correctness, and efficiency.	Query, Tool definitions, Tool calls (Optional), Response
Tool Selection	Measures whether the agent selected the most appropriate and efficient tools for a task.	Query, Tool definitions, Tool calls (Optional), Response
Tool Input Accuracy	Validates that all tool call parameters are correct with strict criteria including grounding, type, format, completeness, and appropriateness.	Query, Response, Tool definitions
Tool Output Utilization	Measures whether the agent correctly interprets and uses tool outputs contextually in responses and subsequent calls.	Query, Response, Tool definitions (Optional)
Tool Call Success	Evaluates whether all tool calls executed successfully without technical failures.	Response, Tool definitions (Optional)

To learn more, see Agent evaluators.

Azure OpenAI graders (preview)

Evaluator	Purpose	Inputs
Model Labeler	Classifies content using custom guidelines and labels.	Query, response, ground truth
String Checker	Performs flexible text validations and pattern matching.	Response
Text Similarity	Evaluates the quality of text or determine semantic closeness.	Response, ground truth
Model Scorer	Generates numerical scores (customized range) for content based on custom guidelines.	Query, response, ground truth

To learn more, see Azure OpenAI Graders.

Evaluators in the development lifecycle

By using these evaluators strategically throughout the development lifecycle, teams can build more reliable, safe, and effective AI applications that meet user needs while minimizing potential risks.

The three stages of GenAIOps evaluation

GenAIOps uses the following three stages.

Base model selection

Before building your application, you need to select the right foundation. This initial evaluation helps you compare different models based on:

Quality and accuracy: How relevant and coherent are the model's responses?
Task performance: Does the model handle your specific use cases efficiently?
Ethical considerations: Is the model free from harmful biases?
Safety profile: What is the risk of generating unsafe content?

Tools available: Microsoft Foundry benchmark for comparing models on public datasets or your own data, and the Azure AI Evaluation SDK for testing specific model endpoints.

Preproduction evaluation

After you select a base model, the next step is to develop an AI agent or application. Before you deploy to a production environment, thorough testing is essential to ensure that the AI agent or application is ready for real-world use.

Preproduction evaluation involves:

Testing with evaluation datasets: These datasets simulate realistic user interactions to ensure the AI agent performs as expected.
Identifying edge cases: Finding scenarios where the AI agent's response quality might degrade or produce undesirable outputs.
Assessing robustness: Ensuring that the AI agent can handle a range of input variations without significant drops in quality or safety.
Measuring key metrics: Metrics such as task adherence, response groundedness, relevance, and safety are evaluated to confirm readiness for production.

The preproduction stage acts as a final quality check, reducing the risk of deploying an AI agent or application that doesn't meet the desired performance or safety standards.

Evaluation Tools and Approaches:

Bring your own data: You can evaluate your AI agents and applications in preproduction using your own evaluation data with supported evaluators, including quality, safety, or custom evaluators, and view results via the Foundry portal. Use Foundry's evaluation wizard or Azure AI Evaluation SDK's supported evaluators, including generation quality, safety, or custom evaluators. View results by using the Foundry portal.
Simulators and AI red teaming agent: If you don't have evaluation data (test data), Azure AI Evaluation SDK's simulators can help by generating topic-related or adversarial queries. These simulators test the model's response to situation-appropriate or attack-like queries (edge cases).
- AI red teaming agent simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks using Microsoft's open framework for Python Risk Identification Tool or PyRIT.
- Adversarial simulators injects static queries that mimic potential safety risks or security attacks such as attempted jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
- Context-appropriate simulators generate typical, relevant conversations you'd expect from users to test quality of responses. With context-appropriate simulators you can assess metrics such as groundedness, relevance, coherence, and fluency of generated responses.
Automated scans using the AI red teaming agent enhance preproduction risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment. By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. This tool is recommended to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.

Alternatively, you can also use the Foundry portal for testing your generative AI applications.

Bring your own data: You can evaluate your AI applications in preproduction using your own evaluation data with supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Foundry portal. Use Foundry's evaluation wizard or Azure AI Evaluation SDK's supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Foundry portal.
Simulators and AI red teaming agent: If you don't have evaluation data (test data), simulators can help by generating topic-related or adversarial queries. These simulators test the model's response to situation-appropriate or attack-like queries (edge cases).
- AI red teaming agent simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks using Microsoft's open framework for Python Risk Identification Tool or PyRIT.
Automated scans using the AI red teaming agent enhances preproduction risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment. By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. This tool is recommended to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.

Alternatively, you can also use the Foundry portal for testing your generative AI applications.

After you get satisfactory results, you can deploy the AI application to production.

Post-production monitoring

After deployment, continuous monitoring ensures your AI application maintains quality in real-world conditions.

Operational metrics: Regular measurement of key AI agent operational metrics.
Continuous evaluation: Enables quality and safety evaluation of production traffic at a sampled rate.
Scheduled evaluation: Enables scheduled quality and safety evaluation using a test dataset to detect drift in the underlying systems.
Scheduled red teaming: Provides scheduled adversarial testing capabilities to probe for safety and security vulnerabilities.
Azure Monitor alerts: Swift action when harmful or inappropriate outputs occur. Set up alerts for continuous evaluation to be notified when evaluation results drop below the pass rate threshold in production.

Effective monitoring helps maintain user trust and allows for rapid issue resolution.

Observability provides comprehensive monitoring capabilities essential for today's complex and rapidly evolving AI landscape. Seamlessly integrated with Azure Monitor Application Insights, this solution enables continuous monitoring of deployed AI applications to ensure optimal performance, safety, and quality in production environments.

The Foundry Observability dashboard delivers real-time insights into critical metrics. It allows teams to quickly identify and address performance issues, safety concerns, or quality degradation.

For Agent-based applications, Foundry offers enhanced continuous evaluation capabilities. These capabilities can provide deeper visibility into quality and safety metrics. They can create a robust monitoring ecosystem that adapts to the dynamic nature of AI applications while maintaining high standards of performance and reliability.

By continuously monitoring the AI application's behavior in production, you can maintain high-quality user experiences and swiftly address any issues that surface.

Building trust through systematic evaluation

GenAIOps establishes a reliable process for managing AI applications throughout their lifecycle. By implementing thorough evaluation at each stage—from model selection through deployment and beyond—teams can create AI solutions that aren't just powerful but trustworthy and safe.

Evaluation cheat sheet

Purpose	Process	Parameters
What are you evaluating for?	Identify or build relevant evaluators	- Quality and performance sample notebook - Agents Response Quality - Safety and Security (Safety and Security sample notebook) - Custom (Custom sample notebook)
What data should you use?	Upload or generate relevant dataset	- Generic simulator for measuring Quality and Performance (Generic simulator sample notebook) - Adversarial simulator for measuring Safety and Security (Adversarial simulator sample notebook) - AI red teaming agent for running automated scans to assess safety and security vulnerabilities (AI red teaming agent sample notebook)
How to run evaluations on a dataset?	Run evaluation	- Agent evaluation runs - Remote cloud run - Local run
How did my model/app perform?	Analyze results	- View aggregate scores, view details, score details, compare evaluation runs
How can I improve?	Make changes to model, app, or evaluators	- If evaluation results didn't align to human feedback, adjust your evaluator. - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. Example of mitigations to apply: Azure AI Content Safety

Purpose	Process	Parameters
What are you evaluating for?	Identify or build relevant evaluators	- RAG Quality - Agents Quality - Safety and Security (Safety and Security sample notebook) - Custom (Custom sample notebook)
What data should you use?	Upload or generate relevant dataset	- Synthetic dataset generation - AI red teaming agent for running automated scans to assess safety and security vulnerabilities (AI red teaming agent sample notebook)
How to run evaluations on a dataset?	Run evaluation	- Agent evaluation runs - Remote cloud run
How did my model/app perform?	Analyze results	- View aggregate scores, view details, score details, compare evaluation runs
How can I improve?	Make changes to model, app, or evaluators	- If evaluation results didn't align to human feedback, adjust your evaluator. - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. Example of mitigations to apply: Azure AI Content Safety

Bring your own virtual network for evaluation

For network isolation purposes you can bring your own virtual network for evaluation. To learn more, see How to configure a private link.

Note

Evaluation data is sent to Application Insights if Application Insights is connected. Virtual Network support for Application Insights and tracing isn't available.

Virtual network region support

Geography	Supported Azure region
US	westus, westus3, eastus, eastus2
Australia	australiaeast
France	francecentral
India	southindia
Japan	japaneast
Norway	norwayeast
Sweden	swedencentral
Switzerland	switzerlandnorth
UAE	uaenorth
UK	uksouth

Region support

Currently certain AI-assisted evaluators are available only in the following regions:

Region	Hate and unfairness, Sexual, Violent, Self-harm, Indirect attack, Code vulnerabilities, Ungrounded attributes	Groundedness Pro	Protected material
East US 2	Supported	Supported	Supported
Sweden Central	Supported	Supported	N/A
US North Central	Supported	N/A	N/A
France Central	Supported	N/A	N/A
Switzerland West	Supported	N/A	N/A

Agent playground evaluation region support

Region	Status
East US	Supported
East US 2	Supported
West US	Supported
West US 2	Supported
West US 3	Supported
France Central	Supported
Norway East	Supported
Sweden Central	Supported

Pricing

Observability features such as Risk and Safety Evaluations and Continuous Evaluations are billed based on consumption as listed in our Azure pricing page.

Feedback

Was this page helpful?

Last updated on 2025-11-18

Share via

Observability in generative AI

What is observability?

What are evaluators?

General purpose

Textual similarity

RAG (retrieval augmented generation)

Safety and security (preview)

Agents (preview)

Azure OpenAI graders (preview)

Evaluators in the development lifecycle

The three stages of GenAIOps evaluation

Base model selection

Preproduction evaluation

Post-production monitoring

Building trust through systematic evaluation

Evaluation cheat sheet

Bring your own virtual network for evaluation

Virtual network region support

Region support

Agent playground evaluation region support

Pricing

Related content

Feedback

Additional resources