RAGAS Live Evaluation#
This guide covers how to run RAGAS metrics using the live evaluation endpoint with inline metric definitions. Live evaluation provides immediate results without job polling, making it ideal for quick, interactive evaluations with small datasets (up to 10 rows).
Quick Start#
import os
from nemo_microservices import NeMoMicroservices
# Initialize client
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
# Run a live evaluation with tool_call_accuracy (no judge needed)
result = client.evaluation.metrics.evaluate(
metric={
"type": "tool_call_accuracy",
},
dataset={
"rows": [{
"user_input": [
{"content": "What is 2+2?", "type": "human"},
{"content": "Let me calculate.", "type": "ai",
"tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}]},
{"content": "4", "type": "tool"},
{"content": "2+2 equals 4.", "type": "ai"},
],
"reference_tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}],
}]
},
)
# Print results
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}")
Prerequisites#
Creating a Secret for API Keys#
Most RAGAS metrics require a judge LLM. If using external endpoints (like NVIDIA API), create a secret first:
import os
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
# Create a secret for your API key
client.secrets.create(
name="nvidia-api-key",
data="nvapi-YOUR_API_KEY_HERE",
description="NVIDIA API key for RAGAS metrics"
)
Supported RAGAS Metrics#
Agentic Metrics#
Tool Call Accuracy#
Compares the AI’s tool calls against reference tool calls for an exact match. Does not require a judge LLM.
result = client.evaluation.metrics.evaluate(
metric={
"type": "tool_call_accuracy",
},
dataset={
"rows": [{
"user_input": [
{"content": "What's the weather in Paris?", "type": "human"},
{"content": "Let me check.", "type": "ai",
"tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}]},
{"content": "Sunny, 22°C", "type": "tool"},
{"content": "It's sunny and 22°C in Paris.", "type": "ai"},
],
"reference_tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
}]
},
)
Input Format:
{
"user_input": [
{"content": "...", "type": "human"},
{"content": "...", "type": "ai", "tool_calls": [{"name": "...", "args": {...}}]},
{"content": "...", "type": "tool"},
{"content": "...", "type": "ai"}
],
"reference_tool_calls": [{"name": "...", "args": {...}}]
}
Topic Adherence#
Measures whether the conversation stays on intended topics. Evaluated by the LLM judge.
Parameters:
metric_mode:"f1"(default),"recall", or"precision"
result = client.evaluation.metrics.evaluate(
metric={
"type": "topic_adherence",
"metric_mode": "f1",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": [
{"content": "Tell me about healthy eating", "type": "human"},
{"content": "Eating fruits and vegetables is essential for good health.", "type": "ai"},
],
"reference_topics": ["health", "nutrition", "diet"],
}]
},
)
Input Format:
{
"user_input": [
{"content": "...", "type": "human"},
{"content": "...", "type": "ai"}
],
"reference_topics": ["topic1", "topic2", "..."]
}
Agent Goal Accuracy#
Binary (0 or 1) metric evaluating whether the agent achieved the user’s goal.
Parameters:
use_reference:true(default) to evaluate against reference,falseto infer outcome
result = client.evaluation.metrics.evaluate(
metric={
"type": "agent_goal_accuracy",
"use_reference": True,
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": [
{"content": "Book a table at a restaurant for 8pm", "type": "human"},
{"content": "I'll search for restaurants.", "type": "ai",
"tool_calls": [{"name": "restaurant_search", "args": {}}]},
{"content": "Found: Italian Place", "type": "tool"},
{"content": "Your table at Italian Place is booked for 8pm.", "type": "ai"},
],
"reference": "Successfully booked a table at a restaurant for 8pm",
}]
},
)
Input Format:
{
"user_input": [/* Multi-turn conversation with tool_calls */],
"reference": "Expected outcome description"
}
NVIDIA Metrics#
Answer Accuracy#
Measures how well a model’s response matches a reference (ground truth) answer. Two LLM judges independently rate the agreement, and the scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).
result = client.evaluation.metrics.evaluate(
metric={
"type": "answer_accuracy",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
}]
},
)
Input Format:
{
"user_input": "...",
"response": "...",
"reference": "..."
}
Context Relevance#
Judges assess retrieved_contexts vs user_input on a 0/1/2 scale, normalized to [0,1].
result = client.evaluation.metrics.evaluate(
metric={
"type": "context_relevance",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
},
)
Input Format:
{
"user_input": "...",
"retrieved_contexts": ["...", "..."]
}
Response Groundedness#
Evaluates whether the response is grounded in retrieved contexts on a 0/1/2 scale, normalized to [0,1].
result = client.evaluation.metrics.evaluate(
metric={
"type": "response_groundedness",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
},
)
Input Format:
{
"response": "...",
"retrieved_contexts": ["...", "..."]
}
RAG Metrics#
Faithfulness#
Measures how factually consistent a response is with the retrieved context. Scores range from 0 to 1, with higher scores indicating better consistency.
result = client.evaluation.metrics.evaluate(
metric={
"type": "faithfulness",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
}]
},
)
Input Format:
{
"user_input": "...",
"response": "...",
"retrieved_contexts": ["...", "..."]
}
Context Recall#
Measures the fraction of relevant content retrieved compared to the total relevant content in the reference. Scores range from 0 to 1, with higher scores indicating better recall.
result = client.evaluation.metrics.evaluate(
metric={
"type": "context_recall",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}]
},
)
Input Format:
{
"user_input": "...",
"retrieved_contexts": ["...", "..."],
"reference": "..."
}
Context Precision#
Measures the proportion of relevant chunks in the retrieved contexts (precision@k). Scores range from 0 to 1, with higher scores indicating better precision.
result = client.evaluation.metrics.evaluate(
metric={
"type": "context_precision",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris",
}]
},
)
Input Format:
{
"user_input": "...",
"retrieved_contexts": ["...", "..."],
"reference": "..."
}
Context Entity Recall#
Measures the recall of entities in the retrieved contexts compared to entities in the reference. Scores range from 0 to 1, with higher scores indicating better entity coverage.
result = client.evaluation.metrics.evaluate(
metric={
"type": "context_entity_recall",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"reference": "Paris is the capital of France.",
}]
},
)
Input Format:
{
"retrieved_contexts": ["...", "..."],
"reference": "..."
}
Response Relevancy#
Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Scores range from 0 to 1, with higher scores indicating better relevance. Requires both judge LLM and embeddings model.
Parameters:
strictness: Number of parallel questions generated (default: 1, NIM only supports 1)
result = client.evaluation.metrics.evaluate(
metric={
"type": "response_relevancy",
"strictness": 1,
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
"embeddings_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/embeddings",
"name": "nvidia/nv-embedqa-e5-v5",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"retrieved_contexts": ["Paris is the capital city of France."],
}]
},
)
Input Format:
{
"user_input": "...",
"response": "...",
"retrieved_contexts": ["..."] // optional
}
Noise Sensitivity#
Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts. Scores range from 0 to 1, with lower scores indicating better robustness to noise.
result = client.evaluation.metrics.evaluate(
metric={
"type": "noise_sensitivity",
"judge_model": {
"endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-70b-instruct",
"api_key_secret": "nvidia-api-key",
},
},
dataset={
"rows": [{
"user_input": "What is the capital of France?",
"response": "The capital of France is Paris.",
"reference": "Paris",
"retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
}]
},
)
Input Format:
{
"user_input": "...",
"response": "...",
"reference": "...",
"retrieved_contexts": ["...", "..."]
}
Response Format#
All live evaluation responses follow this structure:
{
"metric": {
"type": "faithfulness",
"judge_model": {...}
},
"aggregate_scores": [
{
"name": "faithfulness",
"count": 1,
"mean": 0.95,
"min": 0.95,
"max": 0.95,
"sum": 0.95
}
],
"row_scores": [
{
"index": 0,
"row": {...},
"scores": {"faithfulness": 0.95},
"error": null
}
]
}
Working with Results#
# Access aggregate scores
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}, count={score.count}")
# Access per-row scores
for row in result.row_scores:
if row.scores:
print(f"Row {row.index}: {row.scores}")
elif row.error:
print(f"Row {row.index} failed: {row.error}")
Important Notes#
Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use the job-based evaluation endpoints.
Secret Management: API keys should be stored as secrets and referenced by name in
api_key_secret. Never pass API keys directly in the request.Column Names: RAGAS metrics use specific column names:
user_input(notquestion)response(notanswer)retrieved_contexts(notcontexts)reference(notground_truth)
Embeddings Model: Only
response_relevancyrequires an embeddings model. All other metrics use only the judge LLM.Tool Call Accuracy: This is the only RAGAS metric that does not require a judge LLM.
Metrics Summary#
Metric |
Type |
Requires Judge |
Requires Embeddings |
Description |
|---|---|---|---|---|
Tool Call Accuracy |
|
❌ |
❌ |
Exact match of tool calls |
Topic Adherence |
|
✅ |
❌ |
Conversation stays on topic |
Agent Goal Accuracy |
|
✅ |
❌ |
Agent achieved user’s goal |
Answer Accuracy |
|
✅ |
❌ |
Response matches reference |
Context Relevance |
|
✅ |
❌ |
Retrieved contexts are relevant |
Response Groundedness |
|
✅ |
❌ |
Response grounded in context |
Faithfulness |
|
✅ |
❌ |
Response factually consistent |
Context Recall |
|
✅ |
❌ |
Relevant content retrieved |
Context Precision |
|
✅ |
❌ |
Retrieved chunks are relevant |
Context Entity Recall |
|
✅ |
❌ |
Entities correctly retrieved |
Response Relevancy |
|
✅ |
✅ |
Response relevant to question |
Noise Sensitivity |
|
✅ |
❌ |
Robustness to noisy context |