Agentic Evaluation Metrics#
Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.
Overview#
Agentic metrics evaluate different aspects of agent behavior:
Metric |
Use Case |
Requires Judge |
Evaluation Mode |
|---|---|---|---|
Tool Call Accuracy |
Evaluate tool/function call correctness |
No |
Job only |
Topic Adherence |
Measure topic focus in multi-turn conversations |
Yes |
Job only |
Agent Goal Accuracy |
Assess whether the agent completed its goal |
Yes |
Job only |
Answer Accuracy |
Check factual correctness of answers |
Yes |
Job only |
Tool Calling (template) |
Evaluate tool calls with custom templates |
No |
Live or Job |
Note
The RAGAS-based metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) are system metrics that only support job-based evaluation. The Tool Calling metric is a template-based metric that supports both live and job evaluation.
Prerequisites#
Before running agentic evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge. Required for all metrics except Tool Call Accuracy and Tool Calling.
API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key.
Initialize the SDK:
import os
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
Tool Call Accuracy#
Evaluates whether the agent invoked the correct tools with the correct arguments. This is the only RAGAS agentic metric that does not require a judge LLM.
Data Format#
{
"user_input": [
{"content": "What's the weather like in New York?", "type": "human"},
{"content": "Let me check that for you.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
{"content": "It's 75°F and partly cloudy.", "type": "tool"},
{"content": "The weather in New York is 75°F and partly cloudy.", "type": "ai"}
],
"reference_tool_calls": [
{"name": "weather_check", "args": {"location": "New York"}}
]
}
Example#
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/tool-call-accuracy",
"dataset": {
"rows": [
{
"user_input": [
{"content": "What's the weather like in New York?", "type": "human"},
{"content": "Let me check.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
{"content": "75°F, partly cloudy.", "type": "tool"},
{"content": "It's 75°F and partly cloudy in New York.", "type": "ai"}
],
"reference_tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]
}
]
},
"metric_params": {},
"params": {
"parallelism": 16
}
}
)
print(f"Job created: {job.name}")
For job monitoring and results retrieval, see Job Management.
{
"aggregate_scores": [
{
"name": "tool_call_accuracy",
"count": 1,
"mean": 1.0,
"min": 1.0,
"max": 1.0
}
],
"row_scores": [
{
"index": 0,
"scores": {"tool_call_accuracy": 1.0}
}
]
}
Topic Adherence#
Measures how well the agent maintained focus on assigned topics throughout a conversation. Uses F1, precision, or recall modes.
Data Format#
{
"user_input": [
{"content": "How do I stay healthy?", "type": "human"},
{"content": "Eat more fruits and vegetables, and exercise regularly.", "type": "ai"}
],
"reference_topics": ["health", "nutrition", "fitness"]
}
Example#
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/topic-adherence",
"metric_params": {
"metric_mode": "f1", # Options: "f1", "precision", "recall"
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct"
}
}
},
"dataset": "my-workspace/my-topic-dataset", # Fileset reference
"params": {
"parallelism": 16
}
}
)
print(f"Job created: {job.name}")
Result#
{
"aggregate_scores": [
{
"name": "topic_adherence(mode=f1)",
"count": 1,
"mean": 0.85,
"min": 0.85,
"max": 0.85
}
],
"row_scores": [
{
"index": 0,
"scores": {"topic_adherence(mode=f1)": 0.85}
}
]
}
Configuration Options#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Scoring mode: |
|
object |
required |
Judge LLM configuration (see Judge Configuration) |
Agent Goal Accuracy#
Evaluates whether the agent successfully completed the requested task. Supports evaluation with or without a reference outcome.
With Reference#
Compare the agent’s outcome against a known reference:
Data Format#
{
"user_input": [
{"content": "Book a table at a Chinese restaurant for 8pm", "type": "user"},
{"content": "I'll find options for you.", "type": "assistant", "tool_calls": [{"name": "restaurant_search", "args": {"cuisine": "Chinese"}}]},
{"content": "Found: Golden Dragon, Jade Palace", "type": "tool"},
{"content": "I found Golden Dragon and Jade Palace. Which do you prefer?", "type": "assistant"},
{"content": "Golden Dragon please", "type": "user"},
{"content": "Booking now.", "type": "assistant", "tool_calls": [{"name": "restaurant_book", "args": {"name": "Golden Dragon", "time": "8:00pm"}}]},
{"content": "Table booked at Golden Dragon for 8pm.", "type": "tool"},
{"content": "Your table at Golden Dragon is booked for 8pm!", "type": "assistant"}
],
"reference": "Table booked at a Chinese restaurant for 8pm"
}
Example#
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/agent-goal-accuracy",
"metric_params": {
"use_reference": True,
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct"
}
}
},
"dataset": "my-workspace/goal-accuracy-dataset",
"params": {
"parallelism": 16
}
}
)
Result#
{
"aggregate_scores": [
{
"name": "agent_goal_accuracy",
"count": 1,
"mean": 1.0,
"min": 1.0,
"max": 1.0
}
],
"row_scores": [
{
"index": 0,
"scores": {"agent_goal_accuracy": 1.0}
}
]
}
Without Reference#
The judge LLM infers the goal from the conversation context:
Data Format#
{
"user_input": [
{"content": "Set a reminder for my dentist appointment tomorrow at 2pm", "type": "user"},
{"content": "I'll set that reminder for you.", "type": "assistant", "tool_calls": [{"name": "set_reminder", "args": {"title": "Dentist appointment", "date": "tomorrow", "time": "2pm"}}]},
{"content": "Reminder set successfully.", "type": "tool"},
{"content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.", "type": "assistant"}
]
}
Example#
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/agent-goal-accuracy",
"metric_params": {
"use_reference": False,
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct"
}
}
},
"dataset": "my-workspace/goal-accuracy-dataset",
"params": {
"parallelism": 16
}
}
)
Configuration Options#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
boolean |
|
Whether to compare against a reference outcome |
|
object |
required |
Judge LLM configuration (see Judge Configuration) |
Answer Accuracy#
Evaluates the factual correctness of an agent’s answer by comparing it against a reference answer.
Data Format#
{"user_input": "What is the capital of France?", "response": "Paris", "reference": "Paris"}
Example#
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/answer-accuracy",
"metric_params": {
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct"
}
}
},
"dataset": "my-workspace/qa-dataset",
"params": {
"parallelism": 16
}
}
)
Result#
{
"aggregate_scores": [
{
"name": "answer_accuracy",
"count": 2,
"mean": 1.0,
"min": 1.0,
"max": 1.0
}
],
"row_scores": [
{"index": 0, "scores": {"answer_accuracy": 1.0}},
{"index": 1, "scores": {"answer_accuracy": 1.0}}
]
}
Tool Calling (Template)#
A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.
Scores Produced#
function_name_accuracy— Accuracy of function names onlyfunction_name_and_args_accuracy— Accuracy of both function names and arguments
Data Format#
Data must use OpenAI-compliant tool calling format:
{
"messages": [
{"role": "user", "content": "Book a table for 2 at 7pm."},
{"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
],
"tool_calls": [
{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
]
}
Note
Function names with dots (
.) must be replaced with underscores (_)Comparison is case-sensitive
Order of tool calls is ignored (supports parallel tool calling)
Tip
To return structured data (a list or dict) like tool_calls, use a single expression such as {{tool_calls}}. The evaluator preserves the original Python type.
Use | tojson only when you need the result as JSON text (a string), for example when embedding in a larger string template.
result = client.evaluation.metrics.evaluate(
dataset={
"rows": [
{
"messages": [
{"role": "user", "content": "Book a table for 2 at 7pm."},
{"role": "assistant", "content": "Booking...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
],
"tool_calls": [
{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
]
}
]
},
metric={
"type": "tool-calling",
"reference": "{{tool_calls}}"
}
)
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean:.2f}")
For job-based evaluation, first create a metric entity, then reference it by URN:
# Step 1: Create the metric entity
client.evaluation.metrics.create(
name="my-tool-calling-metric",
type="tool-calling",
reference="{{tool_calls}}"
)
# Step 2: Create the evaluation job
job = client.evaluation.metric_jobs.create(
spec={
"metric": "my-workspace/my-tool-calling-metric",
"dataset": "my-workspace/tool-calling-dataset",
"params": {
"parallelism": 16
}
}
)
{
"aggregate_scores": [
{
"name": "function_name_accuracy",
"count": 1,
"mean": 1.0,
"min": 1.0,
"max": 1.0
},
{
"name": "function_name_and_args_accuracy",
"count": 1,
"mean": 1.0,
"min": 1.0,
"max": 1.0
}
]
}
Judge Configuration#
Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "my-judge-key" # Optional: secret name for API key
},
"inference_params": {
"temperature": 0.1,
"max_tokens": 1024
}
}
Important
Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.
Using Reasoning Models#
For models that support extended reasoning (like nvidia/llama-3.3-nemotron-super-49b-v1):
"judge": {
"model": {
"url": "<judge-nim-url>/v1/chat/completions",
"name": "nvidia/llama-3.3-nemotron-super-49b-v1"
},
"system_prompt": "'detailed thinking on'",
"reasoning_params": {
"end_token": "</think>"
}
}
Managing Secrets for Authenticated Endpoints#
If your judge endpoint requires an API key, store it as a secret:
# Create the secret
client.secrets.create(
name="judge-api-key",
data="your-api-key-here"
)
# Reference in your metric_params
"judge": {
"model": {
"url": "https://api.example.com/v1/chat/completions",
"name": "gpt-4",
"format": "openai", # Required for OpenAI-hosted models
"api_key_secret": "judge-api-key"
}
}
For more details on secret management, see Managing Secrets.
Job Management#
After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.
Using Filesets for Large Datasets#
For datasets larger than 10 rows, upload your data to a fileset:
# Create a fileset
client.filesets.create(
name="agentic-eval-dataset",
description="Dataset for agentic evaluation"
)
# Upload data file (JSONL format)
with open("evaluation_data.jsonl", "rb") as f:
client.filesets.upload_file(
name="agentic-eval-dataset",
path="data.jsonl",
body=f.read()
)
# Reference in job by URN (workspace/fileset-name)
job = client.evaluation.metric_jobs.create(
spec={
"metric": "system/tool-call-accuracy",
"dataset": "my-workspace/agentic-eval-dataset",
"metric_params": {},
"params": {"parallelism": 16}
}
)
For more details on dataset management, see Managing Dataset Files.
Limitations#
Job-Only for System Metrics: RAGAS-based system metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) only support job-based evaluation. Only the Tool Calling template metric supports live evaluation.
Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) produce more consistent results.
RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.
Data Format Requirements: Each metric requires specific data fields. Ensure your dataset matches the expected schema.
See also
Agentic Benchmarks - Pre-configured benchmarks with fixed datasets
LLM-as-a-Judge - Custom judge-based evaluation
Evaluation Results - Understanding and downloading results