RAG Evaluation Metrics#

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview#

RAG evaluation metrics require:

  • RAG Model: The LLM used to generate answers

  • Retriever Pipeline: Embedding model (and optional reranker) for document retrieval

  • Judge LLM: An LLM to evaluate answer quality

  • Judge Embeddings (optional): Required for some metrics like system/rag-response-relevancy

All RAG metrics require a judge LLM for evaluation. Some metrics additionally require judge embeddings for semantic similarity calculations.

Prerequisites#

Before running RAG evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.

  2. Model Endpoints: Access to RAG model, embedding model, and judge LLM endpoints

  3. API Keys (if required): Create secrets for any endpoints requiring authentication

  4. Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Supported Metrics#

RAG metrics are organized into four categories:

Category

Metrics

Faithfulness

system/rag-faithfulness, system/rag-response-groundedness

Answer Quality

system/rag-answer-correctness, system/rag-answer-relevancy, system/rag-answer-similarity, system/rag-answer-accuracy, system/rag-response-relevancy

Context Quality

system/rag-context-recall, system/rag-context-precision, system/rag-context-relevance, system/rag-context-entity-recall

Robustness

system/rag-noise-sensitivity

* Requires judge_embeddings in addition to judge_llm

See RAG Metrics Reference for detailed descriptions and requirements.


Run Metric Job#

RAG metrics run as asynchronous jobs. You can specify the metric configuration inline or reference a stored metric.

Basic RAG Evaluation#

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2,
                "inference_params": {
                    "max_tokens": 4000
                }
            }
        }
    }
)

print(f"Job created: {job.name} ({job.id})")

The RAGAS dataset format uses columnar structure where each field is a list:

{
    "question": [
        "When did the 2024 SF Taiwan Day take place?",
        "Where did the 2024 SF Taiwan Day take place?"
    ],
    "contexts": [
        ["The 2024 SF Taiwan Day was held on May 25th at the Oakland Coliseum."],
        ["The event featured cultural performances and food from Taiwan."]
    ],
    "ground_truth": [
        "May 25th",
        "Oakland Coliseum"
    ],
    "answer": [
        "The 2024 SF Taiwan Day took place on May 25th.",
        "The 2024 SF Taiwan Day took place at the Oakland Coliseum."
    ]
}

Score value range [0.0 – 1.0]

{
    "aggregate_scores": [
        {
            "name": "faithfulness",
            "count": 2,
            "mean": 0.95,
            "min": 0.9,
            "max": 1.0
        }
    ]
}

Note

If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).

RAG Evaluation with Reranker#

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            },
            "reranker_model": {
                "url": "https://integrate.api.nvidia.com/v1/ranking",
                "name": "nvidia/nv-rerankqa-mistral-4b-v3",
                "format": "nim",
                "api_key_secret": "reranker-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2
            }
        }
    }
)

RAG Evaluation with Judge Embeddings#

Some metrics like system/rag-response-relevancy require both judge LLM and judge embeddings:

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-response-relevancy",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2
            },
            "judge_embeddings": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1",
                    "name": "nvidia/nv-embedqa-e5-v5",
                    "api_key_secret": "judge-embedding-api-key"
                }
            }
        }
    }
)

RAG Evaluation with Inline Dataset#

Test with a small inline dataset before running on large datasets:

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "rows": [
                {
                    "question": ["What is the capital of France?", "Who wrote Romeo and Juliet?"],
                    "contexts": [
                        ["Paris is the capital city of France."],
                        ["William Shakespeare wrote Romeo and Juliet in the 1590s."]
                    ],
                    "ground_truth": ["Paris", "William Shakespeare"],
                    "answer": [
                        "The capital of France is Paris.",
                        "Romeo and Juliet was written by Shakespeare."
                    ]
                }
            ]
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)

RAG Evaluation with HuggingFace Dataset#

Load datasets directly from HuggingFace:

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "storage": {
                "type": "huggingface",
                "repo_id": "NotYours/test_ragas_dataset",
                "repo_type": "dataset"
            },
            "path": "dataset.json"
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)
# First create a secret for the HuggingFace token
client.secrets.create(
    name="hf-token",
    data="your-huggingface-token"
)

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "storage": {
                "type": "huggingface",
                "repo_id": "my-org/private-dataset",
                "repo_type": "dataset",
                "token_secret": "hf-token"
            },
            "path": "dataset.json"
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)

RAG Metrics Reference#

Use Case

Metric Name

Description

Required Columns

Detect hallucinations

system/rag-faithfulness

Measures factual consistency of generated answer with retrieved context

question, contexts, answer

system/rag-response-groundedness

Evaluates whether response is grounded in context without hallucinations

contexts, answer

system/rag-noise-sensitivity

Robustness to noisy or irrelevant context

question, contexts, answer, ground_truth

Validate answer accuracy

system/rag-answer-correctness*

Evaluates correctness against ground truth reference

question, answer, ground_truth

system/rag-answer-accuracy

Factual accuracy based on context and ground truth

question, answer, ground_truth

Check if answers address the question

system/rag-answer-relevancy*

Measures how relevant the answer is to the question

question, answer

system/rag-response-relevancy*

Response relevancy using embeddings similarity

question, answer

Measure semantic similarity

system/rag-answer-similarity*

Semantic similarity between answer and ground truth

answer, ground_truth

Measure retrieval quality

system/rag-context-recall

Coverage of ground truth information in retrieved context

question, contexts, ground_truth

system/rag-context-precision

Whether all retrieved chunks are relevant to the question

question, contexts, ground_truth

system/rag-context-relevance

Relevance of retrieved context to the question

question, contexts

system/rag-context-entity-recall

Recall of important entities from ground truth in context

contexts, ground_truth

* Requires judge_embeddings in addition to judge_llm

Required Columns: Dataset columns that must be present for the metric to be evaluated.


Metric Parameters#

Job Spec Parameters#

Parameter

Type

Required

Description

metric

string

Yes

Metric URN (e.g., system/rag-faithfulness)

model

object

Yes

RAG model configuration

retriever_pipeline

object

Yes

Retriever pipeline with embedding model

dataset

string/object

Yes

Dataset URN, inline rows, or HuggingFace config

metric_params

object

Yes

Metric-specific parameters

Model Configuration#

{
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-8b-instruct",
    "api_key_secret": "optional-model-api-key-ref"  # Optional: name of the secret containing the API key
}

Retriever Pipeline Configuration#

{
    "embedding_model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "format": "nim",
        "api_key_secret": "embedding-api-key"  # Optional: Name of secret containing API key
    },
    "reranker_model": {  # Optional
        "url": "https://integrate.api.nvidia.com/v1/ranking",
        "name": "nvidia/nv-rerankqa-mistral-4b-v3",
        "format": "nim",
        "api_key_secret": "reranker-api-key"  # Optional: Name of secret containing API key
    }
}

Metric Parameters (metric_params)#

Parameter

Type

Default

Description

dataset_format

string

"ragas"

Dataset format (ragas)

top_k

int

10

Number of top results to retrieve

truncate_long_documents

string

Omitted

Handle documents exceeding 65k characters. "start": keep last 65k chars, "end": keep first 65k chars

judge_llm

object

Required

Judge LLM configuration

judge_embeddings

object

Optional

Judge embeddings (required for some metrics)

Judge LLM Configuration#

{
    "model": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "name": "meta/llama-3.1-8b-instruct",
        "api_key_secret": "optional-judge-api-key-ref"  # Name of secret containing API key
    },
    "request_timeout": 120,      # Request timeout in seconds
    "max_retries": 3,            # Max retries for failed requests
    "parallelism": 2,            # Concurrent judge workers
    "inference_params": {
        "max_tokens": 4000,      # Max tokens for judge response
        "temperature": 0.1,       # Lower for consistent scoring
        "top_p": 0.9
    }
}

Judge Embeddings Configuration#

{
    "model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "api_key_secret": "judge-embedding-api-key"  # Optional: Name of secret containing API key
    }
}

Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for all endpoints that may require authentication
client.secrets.create(name="model-api-key", data="<your-model-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="judge-embedding-api-key", data="<your-judge-embedding-key>")
client.secrets.create(name="reranker-api-key", data="<your-reranker-key>")

Reference secrets by name in your metric configuration:

"model": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-8b-instruct",
    "api_key_secret": "optional-model-api-key-ref"  # Name of secret, not the actual API key
}

Dataset Format#

RAG metrics support the RAGAS dataset format.

RAGAS Format#

The RAGAS format uses columnar structure where each field is a list of values:

{
    "question": ["question #1", "question #2"],
    "contexts": [
        ["context #1 for Q1", "context #2 for Q1"],
        ["context #1 for Q2"]
    ],
    "answer": ["answer for Q1", "answer for Q2"],
    "ground_truth": ["ground truth for Q1", "ground truth for Q2"]
}

Field

Type

Required

Description

question

list[string]

Yes

List of questions

contexts

list[list[string]]

Some metrics

List of context passages per question

answer

list[string]

Some metrics

List of generated answers

ground_truth

list[string]

Some metrics

List of reference answers

Note

Different metrics require different columns. Check the metric documentation for specific requirements.

Built-in Datasets#

The platform provides built-in RAGAS datasets that can be referenced by name:

Dataset

Description

Use Case

ragas/amnesty_qa

Amnesty International Q&A dataset

General RAG evaluation

Usage:

# Reference built-in dataset by name
"dataset": "ragas/amnesty_qa"

You can also use custom datasets via:

  • FilesetUrn: Upload to Files API and reference as workspace/fileset-name/filename.json

  • Inline Dataset: Embed data directly in the API request

  • HuggingFace: Reference public or private HuggingFace datasets


Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.


Troubleshooting#

Common Errors#

Error

Cause

Solution

judge_llm is required

Missing judge LLM config for metric

Add judge_llm to metric_params

judge_embeddings is required

Using system/rag-response-relevancy without embeddings

Add judge_embeddings to metric_params

Job stuck in “pending”

Model endpoint not accessible

Verify endpoint URLs and API key secrets

Authentication failed

Invalid or missing API key

Check secret names match exactly

Low faithfulness scores

Context doesn’t support the answer

Increase top_k or improve retrieval

Tips for Better Results#

  • Use larger judge models (70B+) for more consistent scoring

  • Start with inline datasets to test your configuration before large evaluations

  • Set appropriate timeouts - judge LLM calls can take time with large contexts

  • Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits


Limitations#

  1. Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.

  2. Dataset Format: RAG metrics currently only support the RAGAS dataset format. Ensure your data matches the columnar structure.

  3. Embedding Dimensions: Ensure your embedding model dimensions are compatible with the configured vector store.

See also