RAGAS Live Evaluation#

This guide covers how to run RAGAS metrics using the live evaluation endpoint with inline metric definitions. Live evaluation provides immediate results without job polling, making it ideal for quick, interactive evaluations with small datasets (up to 10 rows).

Quick Start#

import os
from nemo_microservices import NeMoMicroservices

# Initialize client
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

# Run a live evaluation with tool_call_accuracy (no judge needed)
result = client.evaluation.metrics.evaluate(
    metric={
        "type": "tool_call_accuracy",
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "What is 2+2?", "type": "human"},
                {"content": "Let me calculate.", "type": "ai",
                 "tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}]},
                {"content": "4", "type": "tool"},
                {"content": "2+2 equals 4.", "type": "ai"},
            ],
            "reference_tool_calls": [{"name": "calculator", "args": {"a": 2, "b": 2}}],
        }]
    },
)

# Print results
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}")

Prerequisites#

Creating a Secret for API Keys#

Most RAGAS metrics require a judge LLM. If using external endpoints (like NVIDIA API), create a secret first:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

# Create a secret for your API key
client.secrets.create(
    name="nvidia-api-key",
    data="nvapi-YOUR_API_KEY_HERE",
    description="NVIDIA API key for RAGAS metrics"
)

Supported RAGAS Metrics#

Agentic Metrics#

Tool Call Accuracy#

Compares the AI’s tool calls against reference tool calls for an exact match. Does not require a judge LLM.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "tool_call_accuracy",
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "What's the weather in Paris?", "type": "human"},
                {"content": "Let me check.", "type": "ai",
                 "tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}]},
                {"content": "Sunny, 22°C", "type": "tool"},
                {"content": "It's sunny and 22°C in Paris.", "type": "ai"},
            ],
            "reference_tool_calls": [{"name": "weather_api", "args": {"city": "Paris"}}],
        }]
    },
)

Input Format:

{
  "user_input": [
    {"content": "...", "type": "human"},
    {"content": "...", "type": "ai", "tool_calls": [{"name": "...", "args": {...}}]},
    {"content": "...", "type": "tool"},
    {"content": "...", "type": "ai"}
  ],
  "reference_tool_calls": [{"name": "...", "args": {...}}]
}

Topic Adherence#

Measures whether the conversation stays on intended topics. Evaluated by the LLM judge.

Parameters:

  • metric_mode: "f1" (default), "recall", or "precision"

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "topic_adherence",
        "metric_mode": "f1",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "Tell me about healthy eating", "type": "human"},
                {"content": "Eating fruits and vegetables is essential for good health.", "type": "ai"},
            ],
            "reference_topics": ["health", "nutrition", "diet"],
        }]
    },
)

Input Format:

{
  "user_input": [
    {"content": "...", "type": "human"},
    {"content": "...", "type": "ai"}
  ],
  "reference_topics": ["topic1", "topic2", "..."]
}

Agent Goal Accuracy#

Binary (0 or 1) metric evaluating whether the agent achieved the user’s goal.

Parameters:

  • use_reference: true (default) to evaluate against reference, false to infer outcome

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "agent_goal_accuracy",
        "use_reference": True,
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": [
                {"content": "Book a table at a restaurant for 8pm", "type": "human"},
                {"content": "I'll search for restaurants.", "type": "ai",
                 "tool_calls": [{"name": "restaurant_search", "args": {}}]},
                {"content": "Found: Italian Place", "type": "tool"},
                {"content": "Your table at Italian Place is booked for 8pm.", "type": "ai"},
            ],
            "reference": "Successfully booked a table at a restaurant for 8pm",
        }]
    },
)

Input Format:

{
  "user_input": [/* Multi-turn conversation with tool_calls */],
  "reference": "Expected outcome description"
}

NVIDIA Metrics#

Answer Accuracy#

Measures how well a model’s response matches a reference (ground truth) answer. Two LLM judges independently rate the agreement, and the scores are averaged. Scores range from 0 (incorrect) to 0.5 (partial match) to 1 (exact match).

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "answer_accuracy",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "reference": "..."
}

Context Relevance#

Judges assess retrieved_contexts vs user_input on a 0/1/2 scale, normalized to [0,1].

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_relevance",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."]
}

Response Groundedness#

Evaluates whether the response is grounded in retrieved contexts on a 0/1/2 scale, normalized to [0,1].

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "response_groundedness",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "response": "...",
  "retrieved_contexts": ["...", "..."]
}

RAG Metrics#

Faithfulness#

Measures how factually consistent a response is with the retrieved context. Scores range from 0 to 1, with higher scores indicating better consistency.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "faithfulness",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "retrieved_contexts": ["...", "..."]
}

Context Recall#

Measures the fraction of relevant content retrieved compared to the total relevant content in the reference. Scores range from 0 to 1, with higher scores indicating better recall.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_recall",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Context Precision#

Measures the proportion of relevant chunks in the retrieved contexts (precision@k). Scores range from 0 to 1, with higher scores indicating better precision.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_precision",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris",
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Context Entity Recall#

Measures the recall of entities in the retrieved contexts compared to entities in the reference. Scores range from 0 to 1, with higher scores indicating better entity coverage.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "context_entity_recall",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "retrieved_contexts": ["Paris is the capital and largest city of France."],
            "reference": "Paris is the capital of France.",
        }]
    },
)

Input Format:

{
  "retrieved_contexts": ["...", "..."],
  "reference": "..."
}

Response Relevancy#

Measures how relevant the response is to the user’s question using embedding-based cosine similarity. Scores range from 0 to 1, with higher scores indicating better relevance. Requires both judge LLM and embeddings model.

Parameters:

  • strictness: Number of parallel questions generated (default: 1, NIM only supports 1)

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "response_relevancy",
        "strictness": 1,
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
        "embeddings_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/embeddings",
            "name": "nvidia/nv-embedqa-e5-v5",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": ["Paris is the capital city of France."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "retrieved_contexts": ["..."]  // optional
}

Noise Sensitivity#

Measures how sensitive the response is to irrelevant or noisy content in the retrieved contexts. Scores range from 0 to 1, with lower scores indicating better robustness to noise.

result = client.evaluation.metrics.evaluate(
    metric={
        "type": "noise_sensitivity",
        "judge_model": {
            "endpoint": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-70b-instruct",
            "api_key_secret": "nvidia-api-key",
        },
    },
    dataset={
        "rows": [{
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "reference": "Paris",
            "retrieved_contexts": ["Paris is the capital of France.", "Irrelevant noise text."],
        }]
    },
)

Input Format:

{
  "user_input": "...",
  "response": "...",
  "reference": "...",
  "retrieved_contexts": ["...", "..."]
}

Response Format#

All live evaluation responses follow this structure:

{
  "metric": {
    "type": "faithfulness",
    "judge_model": {...}
  },
  "aggregate_scores": [
    {
      "name": "faithfulness",
      "count": 1,
      "mean": 0.95,
      "min": 0.95,
      "max": 0.95,
      "sum": 0.95
    }
  ],
  "row_scores": [
    {
      "index": 0,
      "row": {...},
      "scores": {"faithfulness": 0.95},
      "error": null
    }
  ]
}

Working with Results#

# Access aggregate scores
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

# Access per-row scores
for row in result.row_scores:
    if row.scores:
        print(f"Row {row.index}: {row.scores}")
    elif row.error:
        print(f"Row {row.index} failed: {row.error}")

Important Notes#

  1. Dataset Limit: Live evaluation supports up to 10 rows per request. For larger evaluations, use the job-based evaluation endpoints.

  2. Secret Management: API keys should be stored as secrets and referenced by name in api_key_secret. Never pass API keys directly in the request.

  3. Column Names: RAGAS metrics use specific column names:

    • user_input (not question)

    • response (not answer)

    • retrieved_contexts (not contexts)

    • reference (not ground_truth)

  4. Embeddings Model: Only response_relevancy requires an embeddings model. All other metrics use only the judge LLM.

  5. Tool Call Accuracy: This is the only RAGAS metric that does not require a judge LLM.

Metrics Summary#

Metric

Type

Requires Judge

Requires Embeddings

Description

Tool Call Accuracy

tool_call_accuracy

Exact match of tool calls

Topic Adherence

topic_adherence

Conversation stays on topic

Agent Goal Accuracy

agent_goal_accuracy

Agent achieved user’s goal

Answer Accuracy

answer_accuracy

Response matches reference

Context Relevance

context_relevance

Retrieved contexts are relevant

Response Groundedness

response_groundedness

Response grounded in context

Faithfulness

faithfulness

Response factually consistent

Context Recall

context_recall

Relevant content retrieved

Context Precision

context_precision

Retrieved chunks are relevant

Context Entity Recall

context_entity_recall

Entities correctly retrieved

Response Relevancy

response_relevancy

Response relevant to question

Noise Sensitivity

noise_sensitivity

Robustness to noisy context