RAGAS Live Evaluation#

This guide covers how to run RAGAS metrics using the live evaluation endpoint with inline metric definitions. Live evaluation provides immediate results without job polling, making it ideal for quick, interactive evaluations with small datasets (up to 10 rows).

Quick Start#

import os
from nemo_microservices import NeMoMicroservices

# Initialize client
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

# Run a live evaluation with tool_call_accuracy (no judge needed)
result = client.evaluation.metrics.evaluate(
    metric={
        "type": "tool_call_accuracy",
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "What is 2+2?", "type": "human"},
                {"content": "Let me calculate.", "type": "ai",
                 "tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}]},
                {"content": "4", "type": "tool"},
                {"content": "2+2 equals 4.", "type": "ai"},
            ],
            "reference_tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}],
        }]
    },
)

# Print results
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Prerequisites#

Creating a Secret for API Keys#

Most RAGAS metrics require a judge LLM. If using external endpoints (like NVIDIA API), create a secret first:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

# Create a secret for your API key
client.secrets.create(
    name="nvidia-api-key",
    data="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA API key for RAGAS metrics"
)

Supported RAGAS Metrics#

Agentic Metrics#

Tool Call Accuracy#

Compares the AI’s tool calls against reference tool calls for an exact match. Does not require a judge LLM.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "tool_call_accuracy",
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "What's the weather in Paris?", "type": "human"},
                {"content": "Let me check.", "type": "ai",
                 "tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}]},
                {"content": "Sunny, 22°C", "type": "tool"},
                {"content": "It's sunny and 22°C in Paris.", "type": "ai"},
            ],
            "reference_tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
        }]
    },
)

Input Format:

{
  "user_input": [
    {"content": "...", "type": "human"},
    {"content": "...", "type": "ai", "tool_calls": [{"name": "...", "args": {...}}]},
    {"content": "...", "type": "tool"},
    {"content": "...", "type": "ai"}
  ],
  "reference_tool_calls": [{"name": "...", "args": {...}}]
}

Topic Adherence#

Measures whether the conversation stays on intended topics. Evaluated by the LLM judge.

Parameters:

metric_mode: "f1" (default), "recall", or "precision"

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "topic_adherence",
        "metric_mode": "f1",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "Tell me about healthy eating", "type": "human"},
                {"content": "Eating fruits and vegetables is essential for good health.", "type": "ai"},
            ],
            "reference_topics": ["health", "nutrition", "diet"],
        }]
    },
)

Input Format:

{
  "user_input": [
    {"content": "...", "type": "human"},
    {"content": "...", "type": "ai"}
  ],
  "reference_topics": ["topic1", "topic2", "..."]
}

Agent Goal Accuracy#

Binary (0 or 1) metric evaluating whether the agent achieved the user’s goal.

Parameters:

use_reference: true (default) to evaluate against reference, false to infer outcome

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "agent_goal_accuracy",
        "use_reference": True,
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "Book a table at a restaurant for 8pm", "type": "human"},
                {"content": "I'll search for restaurants.", "type": "ai",
                 "tool_calls": [{"name": "restaurant_search", "args": {}}]},
                {"content": "Found: Italian Place", "type": "tool"},
                {"content": "Your table at Italian Place is booked for 8pm.", "type": "ai"},
            ],
            "reference": "Successfully booked a table at a restaurant for 8pm",
        }]
    },
)

Input Format:

{
  "user_input": [/* Multi-turn conversation with tool_calls */],
  "reference": "Expected outcome description"
}

NVIDIA Metrics#

Answer Accuracy#

Measures how well a model’s response matches a reference (ground truth) answer. Two LLM judges independently rate the agreement, and the scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "answer_accuracy",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "reference": "..."
}

Context Relevance#

Judges assess retrieved_contexts vs user_input on a 0/1/2 scale, normalized to [0,1].

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_relevance",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."]
}

Response Groundedness#

Evaluates whether the response is grounded in retrieved contexts on a 0/1/2 scale, normalized to [0,1].

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "response_groundedness",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "response": "...",
  "retrieved_contexts": ["...", "..."]
}

RAG Metrics#

Faithfulness#

Measures how factually consistent a response is with the retrieved context. Scores range from 0 to 1, with higher scores indicating better consistency.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "faithfulness",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "retrieved_contexts": ["...", "..."]
}

Context Recall#

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference. Scores range from 0 to 1, with higher scores indicating better recall.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_recall",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Context Precision#

Measures the proportion of relevant chunks in the retrieved contexts (precision@k). Scores range from 0 to 1, with higher scores indicating better precision.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_precision",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Context Entity Recall#

Measures the recall of entities in the retrieved contexts compared to entities in the reference. Scores range from 0 to 1, with higher scores indicating better entity coverage.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_entity_recall",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    },
)

Input Format:

{
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Response Relevancy#

Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Scores range from 0 to 1, with higher scores indicating better relevance. Requires both judge LLM and embeddings model.

Parameters:

strictness: Number of parallel questions generated (default: 1, NIM only supports 1)

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "response_relevancy",
        "strictness": 1,
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
        "embeddings_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/embeddings",
            "name": "nvidia/nv-embedqa-e5-v5",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "retrieved_contexts": ["..."]  // optional
}

Noise Sensitivity#

Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts. Scores range from 0 to 1, with lower scores indicating better robustness to noise.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "noise_sensitivity",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
            "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "reference": "...",
  "retrieved_contexts": ["...", "..."]
}

Response Format#

All live evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {...}
  },
  "aggregate_scores": [
    {
      "name": "faithfulness",
      "count": 1,
      "mean": 0.95,
      "min": 0.95,
      "max": 0.95,
      "sum": 0.95
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "row": {...},
      "scores": {"faithfulness": 0.95},
      "error": null
    }
  ]
}

Working with Results#

# Access aggregate scores
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.scores:
        print(f"Row {row.index}: {row.scores}")
    elif row.error:
        print(f"Row {row.index} failed: {row.error}")

Important Notes#

Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use the job-based evaluation endpoints.
Secret Management: API keys should be stored as secrets and referenced by name in api_key_secret. Never pass API keys directly in the request.
Column Names: RAGAS metrics use specific column names:
- user_input (not question)
- response (not answer)
- retrieved_contexts (not contexts)
- reference (not ground_truth)
Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.
Tool Call Accuracy: This is the only RAGAS metric that does not require a judge LLM.

Metrics Summary#

Metric	Type	Requires Judge	Requires Embeddings	Description
Tool Call Accuracy	`tool_call_accuracy`	❌	❌	Exact match of tool calls
Topic Adherence	`topic_adherence`	✅	❌	Conversation stays on topic
Agent Goal Accuracy	`agent_goal_accuracy`	✅	❌	Agent achieved user’s goal
Answer Accuracy	`answer_accuracy`	✅	❌	Response matches reference
Context Relevance	`context_relevance`	✅	❌	Retrieved contexts are relevant
Response Groundedness	`response_groundedness`	✅	❌	Response grounded in context
Faithfulness	`faithfulness`	✅	❌	Response factually consistent
Context Recall	`context_recall`	✅	❌	Relevant content retrieved
Context Precision	`context_precision`	✅	❌	Retrieved chunks are relevant
Context Entity Recall	`context_entity_recall`	✅	❌	Entities correctly retrieved
Response Relevancy	`response_relevancy`	✅	✅	Response relevant to question
Noise Sensitivity	`noise_sensitivity`	✅	❌	Robustness to noisy context