Agentic Evaluation Metrics#

Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.

Overview#

Agentic metrics evaluate different aspects of agent behavior:

Metric	Use Case	Requires Judge	Evaluation Mode
Tool Call Accuracy	Evaluate tool/function call correctness	No	Job only
Topic Adherence	Measure topic focus in multi-turn conversations	Yes	Job only
Agent Goal Accuracy	Assess whether the agent completed its goal	Yes	Job only
Answer Accuracy	Check factual correctness of answers	Yes	Job only
Tool Calling (template)	Evaluate tool calls with custom templates	No	Live or Job

Note

The RAGAS-based metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) are system metrics that only support job-based evaluation. The Tool Calling metric is a template-based metric that supports both live and job evaluation.

Prerequisites#

Before running agentic evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge. Required for all metrics except Tool Call Accuracy and Tool Calling.
API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key.
Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Tool Call Accuracy#

Evaluates whether the agent invoked the correct tools with the correct arguments. This is the only RAGAS agentic metric that does not require a judge LLM.

Data Format#

{
  "user_input": [
    {"content": "What's the weather like in New York?", "type": "human"},
    {"content": "Let me check that for you.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
    {"content": "It's 75°F and partly cloudy.", "type": "tool"},
    {"content": "The weather in New York is 75°F and partly cloudy.", "type": "ai"}
  ],
  "reference_tool_calls": [
    {"name": "weather_check", "args": {"location": "New York"}}
  ]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/tool-call-accuracy",
        "dataset": {
            "rows": [
                {
                    "user_input": [
                        {"content": "What's the weather like in New York?", "type": "human"},
                        {"content": "Let me check.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
                        {"content": "75°F, partly cloudy.", "type": "tool"},
                        {"content": "It's 75°F and partly cloudy in New York.", "type": "ai"}
                    ],
                    "reference_tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]
                }
            ]
        },
        "metric_params": {},
        "params": {
            "parallelism": 16
        }
    }
)

print(f"Job created: {job.name}")

For job monitoring and results retrieval, see Job Management.

{
  "aggregate_scores": [
    {
      "name": "tool_call_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"tool_call_accuracy": 1.0}
    }
  ]
}

Topic Adherence#

Measures how well the agent maintained focus on assigned topics throughout a conversation. Uses F1, precision, or recall modes.

Data Format#

{
  "user_input": [
    {"content": "How do I stay healthy?", "type": "human"},
    {"content": "Eat more fruits and vegetables, and exercise regularly.", "type": "ai"}
  ],
  "reference_topics": ["health", "nutrition", "fitness"]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/topic-adherence",
        "metric_params": {
            "metric_mode": "f1",  # Options: "f1", "precision", "recall"
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/my-topic-dataset",  # Fileset reference
        "params": {
            "parallelism": 16
        }
    }
)

print(f"Job created: {job.name}")

Result#

{
  "aggregate_scores": [
    {
      "name": "topic_adherence(mode=f1)",
      "count": 1,
      "mean": 0.85,
      "min": 0.85,
      "max": 0.85
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"topic_adherence(mode=f1)": 0.85}
    }
  ]
}

Configuration Options#

Parameter	Type	Default	Description
`metric_mode`	string	`"f1"`	Scoring mode: `"f1"`, `"precision"`, or `"recall"`
`judge`	object	required	Judge LLM configuration (see Judge Configuration)

Agent Goal Accuracy#

Evaluates whether the agent successfully completed the requested task. Supports evaluation with or without a reference outcome.

With Reference#

Compare the agent’s outcome against a known reference:

Data Format#

{
  "user_input": [
    {"content": "Book a table at a Chinese restaurant for 8pm", "type": "user"},
    {"content": "I'll find options for you.", "type": "assistant", "tool_calls": [{"name": "restaurant_search", "args": {"cuisine": "Chinese"}}]},
    {"content": "Found: Golden Dragon, Jade Palace", "type": "tool"},
    {"content": "I found Golden Dragon and Jade Palace. Which do you prefer?", "type": "assistant"},
    {"content": "Golden Dragon please", "type": "user"},
    {"content": "Booking now.", "type": "assistant", "tool_calls": [{"name": "restaurant_book", "args": {"name": "Golden Dragon", "time": "8:00pm"}}]},
    {"content": "Table booked at Golden Dragon for 8pm.", "type": "tool"},
    {"content": "Your table at Golden Dragon is booked for 8pm!", "type": "assistant"}
  ],
  "reference": "Table booked at a Chinese restaurant for 8pm"
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/agent-goal-accuracy",
        "metric_params": {
            "use_reference": True,
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/goal-accuracy-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Result#

{
  "aggregate_scores": [
    {
      "name": "agent_goal_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"agent_goal_accuracy": 1.0}
    }
  ]
}

Without Reference#

The judge LLM infers the goal from the conversation context:

Data Format#

{
  "user_input": [
    {"content": "Set a reminder for my dentist appointment tomorrow at 2pm", "type": "user"},
    {"content": "I'll set that reminder for you.", "type": "assistant", "tool_calls": [{"name": "set_reminder", "args": {"title": "Dentist appointment", "date": "tomorrow", "time": "2pm"}}]},
    {"content": "Reminder set successfully.", "type": "tool"},
    {"content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.", "type": "assistant"}
  ]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/agent-goal-accuracy",
        "metric_params": {
            "use_reference": False,
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/goal-accuracy-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Configuration Options#

Parameter	Type	Default	Description
`use_reference`	boolean	`True`	Whether to compare against a reference outcome
`judge`	object	required	Judge LLM configuration (see Judge Configuration)

Answer Accuracy#

Evaluates the factual correctness of an agent’s answer by comparing it against a reference answer.

Data Format#

{"user_input": "What is the capital of France?", "response": "Paris", "reference": "Paris"}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/answer-accuracy",
        "metric_params": {
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/qa-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Result#

{
  "aggregate_scores": [
    {
      "name": "answer_accuracy",
      "count": 2,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {"index": 0, "scores": {"answer_accuracy": 1.0}},
    {"index": 1, "scores": {"answer_accuracy": 1.0}}
  ]
}

Tool Calling (Template)#

A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.

Scores Produced#

function_name_accuracy — Accuracy of function names only
function_name_and_args_accuracy — Accuracy of both function names and arguments

Data Format#

Data must use OpenAI-compliant tool calling format:

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
  ],
  "tool_calls": [
    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
  ]
}

Note

Function names with dots (.) must be replaced with underscores (_)
Comparison is case-sensitive
Order of tool calls is ignored (supports parallel tool calling)

Tip

To return structured data (a list or dict) like tool_calls, use a single expression such as {{tool_calls}}. The evaluator preserves the original Python type.

Use | tojson only when you need the result as JSON text (a string), for example when embedding in a larger string template.

Live Evaluation

result = client.evaluation.metrics.evaluate(
    dataset={
        "rows": [
            {
                "messages": [
                    {"role": "user", "content": "Book a table for 2 at 7pm."},
                    {"role": "assistant", "content": "Booking...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
                ],
                "tool_calls": [
                    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
                ]
            }
        ]
    },
    metric={
        "type": "tool-calling",
        "reference": "{{tool_calls}}"
    }
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean:.2f}")

Job Evaluation

For job-based evaluation, first create a metric entity, then reference it by URN:

# Step 1: Create the metric entity
client.evaluation.metrics.create(
    name="my-tool-calling-metric",
    type="tool-calling",
    reference="{{tool_calls}}"
)

# Step 2: Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "my-workspace/my-tool-calling-metric",
        "dataset": "my-workspace/tool-calling-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Result

{
  "aggregate_scores": [
    {
      "name": "function_name_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    },
    {
      "name": "function_name_and_args_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ]
}

Judge Configuration#

Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:

"judge": {
    "model": {
        "url": "<judge-nim-url>/v1/chat/completions",
        "name": "meta/llama-3.1-70b-instruct",
        "api_key_secret": "my-judge-key"  # Optional: secret name for API key
    },
    "inference_params": {
        "temperature": 0.1,
        "max_tokens": 1024
    }
}

Important

Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.

Using Reasoning Models#

For models that support extended reasoning (like nvidia/llama-3.3-nemotron-super-49b-v1):

"judge": {
    "model": {
        "url": "<judge-nim-url>/v1/chat/completions",
        "name": "nvidia/llama-3.3-nemotron-super-49b-v1"
    },
    "system_prompt": "'detailed thinking on'",
    "reasoning_params": {
        "end_token": "</think>"
    }
}

Managing Secrets for Authenticated Endpoints#

If your judge endpoint requires an API key, store it as a secret:

# Create the secret
client.secrets.create(
    name="judge-api-key",
    data="your-api-key-here"
)

# Reference in your metric_params
"judge": {
    "model": {
        "url": "https://api.example.com/v1/chat/completions",
        "name": "gpt-4",
        "format": "openai",  # Required for OpenAI-hosted models
        "api_key_secret": "judge-api-key"
    }
}

For more details on secret management, see Managing Secrets.

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.

Using Filesets for Large Datasets#

For datasets larger than 10 rows, upload your data to a fileset:

# Create a fileset
client.filesets.create(
    name="agentic-eval-dataset",
    description="Dataset for agentic evaluation"
)

# Upload data file (JSONL format)
with open("evaluation_data.jsonl", "rb") as f:
    client.filesets.upload_file(
        name="agentic-eval-dataset",
        path="data.jsonl",
        body=f.read()
    )

# Reference in job by URN (workspace/fileset-name)
job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/tool-call-accuracy",
        "dataset": "my-workspace/agentic-eval-dataset",
        "metric_params": {},
        "params": {"parallelism": 16}
    }
)

For more details on dataset management, see Managing Dataset Files.

Limitations#

Job-Only for System Metrics: RAGAS-based system metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) only support job-based evaluation. Only the Tool Calling template metric supports live evaluation.
Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) produce more consistent results.
RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.
Data Format Requirements: Each metric requires specific data fields. Ensure your dataset matches the expected schema.