Agentic Evaluation Metrics#

Evaluate agent-based and multi-step reasoning models using metrics powered by RAGAS. These metrics assess tool calling accuracy, goal completion, topic adherence, and answer correctness in agentic workflows.

Overview#

Agentic metrics evaluate different aspects of agent behavior:

Metric

Use Case

Requires Judge

Evaluation Mode

Tool Call Accuracy

Evaluate tool/function call correctness

No

Job only

Topic Adherence

Measure topic focus in multi-turn conversations

Yes

Job only

Agent Goal Accuracy

Assess whether the agent completed its goal

Yes

Job only

Answer Accuracy

Check factual correctness of answers

Yes

Job only

Tool Calling (template)

Evaluate tool calls with custom templates

No

Live or Job

Note

The RAGAS-based metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) are system metrics that only support job-based evaluation. The Tool Calling metric is a template-based metric that supports both live and job evaluation.

Prerequisites#

Before running agentic evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.

  2. Judge LLM endpoint (for most metrics): Have access to an LLM that will serve as your judge. Required for all metrics except Tool Call Accuracy and Tool Calling.

  3. API key secret (if judge requires auth): If your judge endpoint requires authentication, create a secret to store the API key.

  4. Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Tool Call Accuracy#

Evaluates whether the agent invoked the correct tools with the correct arguments. This is the only RAGAS agentic metric that does not require a judge LLM.

Data Format#

{
  "user_input": [
    {"content": "What's the weather like in New York?", "type": "human"},
    {"content": "Let me check that for you.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
    {"content": "It's 75°F and partly cloudy.", "type": "tool"},
    {"content": "The weather in New York is 75°F and partly cloudy.", "type": "ai"}
  ],
  "reference_tool_calls": [
    {"name": "weather_check", "args": {"location": "New York"}}
  ]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/tool-call-accuracy",
        "dataset": {
            "rows": [
                {
                    "user_input": [
                        {"content": "What's the weather like in New York?", "type": "human"},
                        {"content": "Let me check.", "type": "ai", "tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]},
                        {"content": "75°F, partly cloudy.", "type": "tool"},
                        {"content": "It's 75°F and partly cloudy in New York.", "type": "ai"}
                    ],
                    "reference_tool_calls": [{"name": "weather_check", "args": {"location": "New York"}}]
                }
            ]
        },
        "metric_params": {},
        "params": {
            "parallelism": 16
        }
    }
)

print(f"Job created: {job.name}")

For job monitoring and results retrieval, see Job Management.

{
  "aggregate_scores": [
    {
      "name": "tool_call_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"tool_call_accuracy": 1.0}
    }
  ]
}

Topic Adherence#

Measures how well the agent maintained focus on assigned topics throughout a conversation. Uses F1, precision, or recall modes.

Data Format#

{
  "user_input": [
    {"content": "How do I stay healthy?", "type": "human"},
    {"content": "Eat more fruits and vegetables, and exercise regularly.", "type": "ai"}
  ],
  "reference_topics": ["health", "nutrition", "fitness"]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/topic-adherence",
        "metric_params": {
            "metric_mode": "f1",  # Options: "f1", "precision", "recall"
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/my-topic-dataset",  # Fileset reference
        "params": {
            "parallelism": 16
        }
    }
)

print(f"Job created: {job.name}")

Result#

{
  "aggregate_scores": [
    {
      "name": "topic_adherence(mode=f1)",
      "count": 1,
      "mean": 0.85,
      "min": 0.85,
      "max": 0.85
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"topic_adherence(mode=f1)": 0.85}
    }
  ]
}

Configuration Options#

Parameter

Type

Default

Description

metric_mode

string

"f1"

Scoring mode: "f1", "precision", or "recall"

judge

object

required

Judge LLM configuration (see Judge Configuration)


Agent Goal Accuracy#

Evaluates whether the agent successfully completed the requested task. Supports evaluation with or without a reference outcome.

With Reference#

Compare the agent’s outcome against a known reference:

Data Format#

{
  "user_input": [
    {"content": "Book a table at a Chinese restaurant for 8pm", "type": "user"},
    {"content": "I'll find options for you.", "type": "assistant", "tool_calls": [{"name": "restaurant_search", "args": {"cuisine": "Chinese"}}]},
    {"content": "Found: Golden Dragon, Jade Palace", "type": "tool"},
    {"content": "I found Golden Dragon and Jade Palace. Which do you prefer?", "type": "assistant"},
    {"content": "Golden Dragon please", "type": "user"},
    {"content": "Booking now.", "type": "assistant", "tool_calls": [{"name": "restaurant_book", "args": {"name": "Golden Dragon", "time": "8:00pm"}}]},
    {"content": "Table booked at Golden Dragon for 8pm.", "type": "tool"},
    {"content": "Your table at Golden Dragon is booked for 8pm!", "type": "assistant"}
  ],
  "reference": "Table booked at a Chinese restaurant for 8pm"
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/agent-goal-accuracy",
        "metric_params": {
            "use_reference": True,
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/goal-accuracy-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Result#

{
  "aggregate_scores": [
    {
      "name": "agent_goal_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "scores": {"agent_goal_accuracy": 1.0}
    }
  ]
}

Without Reference#

The judge LLM infers the goal from the conversation context:

Data Format#

{
  "user_input": [
    {"content": "Set a reminder for my dentist appointment tomorrow at 2pm", "type": "user"},
    {"content": "I'll set that reminder for you.", "type": "assistant", "tool_calls": [{"name": "set_reminder", "args": {"title": "Dentist appointment", "date": "tomorrow", "time": "2pm"}}]},
    {"content": "Reminder set successfully.", "type": "tool"},
    {"content": "Your reminder for the dentist appointment tomorrow at 2pm has been set.", "type": "assistant"}
  ]
}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/agent-goal-accuracy",
        "metric_params": {
            "use_reference": False,
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/goal-accuracy-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Configuration Options#

Parameter

Type

Default

Description

use_reference

boolean

True

Whether to compare against a reference outcome

judge

object

required

Judge LLM configuration (see Judge Configuration)


Answer Accuracy#

Evaluates the factual correctness of an agent’s answer by comparing it against a reference answer.

Data Format#

{"user_input": "What is the capital of France?", "response": "Paris", "reference": "Paris"}

Example#

job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/answer-accuracy",
        "metric_params": {
            "judge": {
                "model": {
                    "url": "<judge-nim-url>/v1/chat/completions",
                    "name": "meta/llama-3.1-70b-instruct"
                }
            }
        },
        "dataset": "my-workspace/qa-dataset",
        "params": {
            "parallelism": 16
        }
    }
)

Result#

{
  "aggregate_scores": [
    {
      "name": "answer_accuracy",
      "count": 2,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ],
  "row_scores": [
    {"index": 0, "scores": {"answer_accuracy": 1.0}},
    {"index": 1, "scores": {"answer_accuracy": 1.0}}
  ]
}

Tool Calling (Template)#

A template-based metric for evaluating tool/function call accuracy. Unlike the RAGAS Tool Call Accuracy metric, this metric uses configurable templates and produces multiple scores.

Scores Produced#

  • function_name_accuracy — Accuracy of function names only

  • function_name_and_args_accuracy — Accuracy of both function names and arguments

Data Format#

Data must use OpenAI-compliant tool calling format:

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
  ],
  "tool_calls": [
    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
  ]
}

Note

  • Function names with dots (.) must be replaced with underscores (_)

  • Comparison is case-sensitive

  • Order of tool calls is ignored (supports parallel tool calling)

Tip

To return structured data (a list or dict) like tool_calls, use a single expression such as {{tool_calls}}. The evaluator preserves the original Python type.

Use | tojson only when you need the result as JSON text (a string), for example when embedding in a larger string template.

result = client.evaluation.metrics.evaluate(
    dataset={
        "rows": [
            {
                "messages": [
                    {"role": "user", "content": "Book a table for 2 at 7pm."},
                    {"role": "assistant", "content": "Booking...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
                ],
                "tool_calls": [
                    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
                ]
            }
        ]
    },
    metric={
        "type": "tool-calling",
        "reference": "{{tool_calls}}"
    }
)

for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean:.2f}")

For job-based evaluation, first create a metric entity, then reference it by URN:

# Step 1: Create the metric entity
client.evaluation.metrics.create(
    name="my-tool-calling-metric",
    type="tool-calling",
    reference="{{tool_calls}}"
)

# Step 2: Create the evaluation job
job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "my-workspace/my-tool-calling-metric",
        "dataset": "my-workspace/tool-calling-dataset",
        "params": {
            "parallelism": 16
        }
    }
)
{
  "aggregate_scores": [
    {
      "name": "function_name_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    },
    {
      "name": "function_name_and_args_accuracy",
      "count": 1,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0
    }
  ]
}

Judge Configuration#

Most agentic metrics require a judge LLM. Configure the judge model in the metric definition:

"judge": {
    "model": {
        "url": "<judge-nim-url>/v1/chat/completions",
        "name": "meta/llama-3.1-70b-instruct",
        "api_key_secret": "my-judge-key"  # Optional: secret name for API key
    },
    "inference_params": {
        "temperature": 0.1,
        "max_tokens": 1024
    }
}

Important

Recommended model size: Use a 70B+ parameter model as the judge for reliable results. Smaller models may fail to follow the required output schema, causing parsing errors.

Using Reasoning Models#

For models that support extended reasoning (like nvidia/llama-3.3-nemotron-super-49b-v1):

"judge": {
    "model": {
        "url": "<judge-nim-url>/v1/chat/completions",
        "name": "nvidia/llama-3.3-nemotron-super-49b-v1"
    },
    "system_prompt": "'detailed thinking on'",
    "reasoning_params": {
        "end_token": "</think>"
    }
}

Managing Secrets for Authenticated Endpoints#

If your judge endpoint requires an API key, store it as a secret:

# Create the secret
client.secrets.create(
    name="judge-api-key",
    data="your-api-key-here"
)

# Reference in your metric_params
"judge": {
    "model": {
        "url": "https://api.example.com/v1/chat/completions",
        "name": "gpt-4",
        "format": "openai",  # Required for OpenAI-hosted models
        "api_key_secret": "judge-api-key"
    }
}

For more details on secret management, see Managing Secrets.


Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.


Using Filesets for Large Datasets#

For datasets larger than 10 rows, upload your data to a fileset:

# Create a fileset
client.filesets.create(
    name="agentic-eval-dataset",
    description="Dataset for agentic evaluation"
)

# Upload data file (JSONL format)
with open("evaluation_data.jsonl", "rb") as f:
    client.filesets.upload_file(
        name="agentic-eval-dataset",
        path="data.jsonl",
        body=f.read()
    )

# Reference in job by URN (workspace/fileset-name)
job = client.evaluation.metric_jobs.create(
    spec={
        "metric": "system/tool-call-accuracy",
        "dataset": "my-workspace/agentic-eval-dataset",
        "metric_params": {},
        "params": {"parallelism": 16}
    }
)

For more details on dataset management, see Managing Dataset Files.


Limitations#

  1. Job-Only for System Metrics: RAGAS-based system metrics (Tool Call Accuracy, Topic Adherence, Agent Goal Accuracy, Answer Accuracy) only support job-based evaluation. Only the Tool Calling template metric supports live evaluation.

  2. Judge Model Quality: For metrics requiring a judge, evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) produce more consistent results.

  3. RAGAS Dependency: These metrics are powered by RAGAS and may have version-specific behavior.

  4. Data Format Requirements: Each metric requires specific data fields. Ensure your dataset matches the expected schema.

See also