RAG Evaluation Metrics#
RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.
Overview#
RAG evaluation metrics require:
RAG Model: The LLM used to generate answers
Retriever Pipeline: Embedding model (and optional reranker) for document retrieval
Judge LLM: An LLM to evaluate answer quality
Judge Embeddings (optional): Required for some metrics like
system/rag-response-relevancy
All RAG metrics require a judge LLM for evaluation. Some metrics additionally require judge embeddings for semantic similarity calculations.
Prerequisites#
Before running RAG evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to RAG model, embedding model, and judge LLM endpoints
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:
import os
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
Supported Metrics#
RAG metrics are organized into four categories:
Category |
Metrics |
|---|---|
Faithfulness |
|
Answer Quality |
|
Context Quality |
|
Robustness |
|
* Requires judge_embeddings in addition to judge_llm
See RAG Metrics Reference for detailed descriptions and requirements.
Run Metric Job#
RAG metrics run as asynchronous jobs. You can specify the metric configuration inline or reference a stored metric.
Basic RAG Evaluation#
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-faithfulness",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": "ragas/amnesty_qa",
"metric_params": {
"dataset_format": "ragas",
"top_k": 10,
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
},
"request_timeout": 120,
"max_retries": 3,
"parallelism": 2,
"inference_params": {
"max_tokens": 4000
}
}
}
}
)
print(f"Job created: {job.name} ({job.id})")
The RAGAS dataset format uses columnar structure where each field is a list:
{
"question": [
"When did the 2024 SF Taiwan Day take place?",
"Where did the 2024 SF Taiwan Day take place?"
],
"contexts": [
["The 2024 SF Taiwan Day was held on May 25th at the Oakland Coliseum."],
["The event featured cultural performances and food from Taiwan."]
],
"ground_truth": [
"May 25th",
"Oakland Coliseum"
],
"answer": [
"The 2024 SF Taiwan Day took place on May 25th.",
"The 2024 SF Taiwan Day took place at the Oakland Coliseum."
]
}
Score value range [0.0 – 1.0]
{
"aggregate_scores": [
{
"name": "faithfulness",
"count": 2,
"mean": 0.95,
"min": 0.9,
"max": 1.0
}
]
}
Note
If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).
RAG Evaluation with Reranker#
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-faithfulness",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
},
"reranker_model": {
"url": "https://integrate.api.nvidia.com/v1/ranking",
"name": "nvidia/nv-rerankqa-mistral-4b-v3",
"format": "nim",
"api_key_secret": "reranker-api-key"
}
},
"dataset": "ragas/amnesty_qa",
"metric_params": {
"dataset_format": "ragas",
"top_k": 10,
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
},
"request_timeout": 120,
"max_retries": 3,
"parallelism": 2
}
}
}
)
RAG Evaluation with Judge Embeddings#
Some metrics like system/rag-response-relevancy require both judge LLM and judge embeddings:
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-response-relevancy",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": "ragas/amnesty_qa",
"metric_params": {
"dataset_format": "ragas",
"top_k": 10,
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
},
"request_timeout": 120,
"max_retries": 3,
"parallelism": 2
},
"judge_embeddings": {
"model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"api_key_secret": "judge-embedding-api-key"
}
}
}
}
)
RAG Evaluation with Inline Dataset#
Test with a small inline dataset before running on large datasets:
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-faithfulness",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": {
"rows": [
{
"question": ["What is the capital of France?", "Who wrote Romeo and Juliet?"],
"contexts": [
["Paris is the capital city of France."],
["William Shakespeare wrote Romeo and Juliet in the 1590s."]
],
"ground_truth": ["Paris", "William Shakespeare"],
"answer": [
"The capital of France is Paris.",
"Romeo and Juliet was written by Shakespeare."
]
}
]
},
"metric_params": {
"dataset_format": "ragas",
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
}
}
}
}
)
RAG Evaluation with HuggingFace Dataset#
Load datasets directly from HuggingFace:
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-faithfulness",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": {
"storage": {
"type": "huggingface",
"repo_id": "NotYours/test_ragas_dataset",
"repo_type": "dataset"
},
"path": "dataset.json"
},
"metric_params": {
"dataset_format": "ragas",
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
}
}
}
}
)
# First create a secret for the HuggingFace token
client.secrets.create(
name="hf-token",
data="your-huggingface-token"
)
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/rag-faithfulness",
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "model-api-key"
},
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": {
"storage": {
"type": "huggingface",
"repo_id": "my-org/private-dataset",
"repo_type": "dataset",
"token_secret": "hf-token"
},
"path": "dataset.json"
},
"metric_params": {
"dataset_format": "ragas",
"judge_llm": {
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "judge-api-key"
}
}
}
}
)
RAG Metrics Reference#
Use Case |
Metric Name |
Description |
Required Columns |
|---|---|---|---|
Detect hallucinations |
|
Measures factual consistency of generated answer with retrieved context |
question, contexts, answer |
|
Evaluates whether response is grounded in context without hallucinations |
contexts, answer |
|
|
Robustness to noisy or irrelevant context |
question, contexts, answer, ground_truth |
|
Validate answer accuracy |
|
Evaluates correctness against ground truth reference |
question, answer, ground_truth |
|
Factual accuracy based on context and ground truth |
question, answer, ground_truth |
|
Check if answers address the question |
|
Measures how relevant the answer is to the question |
question, answer |
|
Response relevancy using embeddings similarity |
question, answer |
|
Measure semantic similarity |
|
Semantic similarity between answer and ground truth |
answer, ground_truth |
Measure retrieval quality |
|
Coverage of ground truth information in retrieved context |
question, contexts, ground_truth |
|
Whether all retrieved chunks are relevant to the question |
question, contexts, ground_truth |
|
|
Relevance of retrieved context to the question |
question, contexts |
|
|
Recall of important entities from ground truth in context |
contexts, ground_truth |
* Requires judge_embeddings in addition to judge_llm
Required Columns: Dataset columns that must be present for the metric to be evaluated.
Metric Parameters#
Job Spec Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Metric URN (e.g., |
|
object |
Yes |
RAG model configuration |
|
object |
Yes |
Retriever pipeline with embedding model |
|
string/object |
Yes |
Dataset URN, inline rows, or HuggingFace config |
|
object |
Yes |
Metric-specific parameters |
Model Configuration#
{
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "optional-model-api-key-ref" # Optional: name of the secret containing the API key
}
Retriever Pipeline Configuration#
{
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key" # Optional: Name of secret containing API key
},
"reranker_model": { # Optional
"url": "https://integrate.api.nvidia.com/v1/ranking",
"name": "nvidia/nv-rerankqa-mistral-4b-v3",
"format": "nim",
"api_key_secret": "reranker-api-key" # Optional: Name of secret containing API key
}
}
Metric Parameters (metric_params)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Dataset format (ragas) |
|
int |
10 |
Number of top results to retrieve |
|
string |
Omitted |
Handle documents exceeding 65k characters. |
|
object |
Required |
Judge LLM configuration |
|
object |
Optional |
Judge embeddings (required for some metrics) |
Judge LLM Configuration#
{
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "optional-judge-api-key-ref" # Name of secret containing API key
},
"request_timeout": 120, # Request timeout in seconds
"max_retries": 3, # Max retries for failed requests
"parallelism": 2, # Concurrent judge workers
"inference_params": {
"max_tokens": 4000, # Max tokens for judge response
"temperature": 0.1, # Lower for consistent scoring
"top_p": 0.9
}
}
Judge Embeddings Configuration#
{
"model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"api_key_secret": "judge-embedding-api-key" # Optional: Name of secret containing API key
}
}
Managing Secrets for Authenticated Endpoints#
Store API keys as secrets for secure authentication:
# Create secrets for all endpoints that may require authentication
client.secrets.create(name="model-api-key", data="<your-model-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="judge-embedding-api-key", data="<your-judge-embedding-key>")
client.secrets.create(name="reranker-api-key", data="<your-reranker-key>")
Reference secrets by name in your metric configuration:
"model": {
"url": "https://integrate.api.nvidia.com/v1/chat/completions",
"name": "meta/llama-3.1-8b-instruct",
"api_key_secret": "optional-model-api-key-ref" # Name of secret, not the actual API key
}
Dataset Format#
RAG metrics support the RAGAS dataset format.
RAGAS Format#
The RAGAS format uses columnar structure where each field is a list of values:
{
"question": ["question #1", "question #2"],
"contexts": [
["context #1 for Q1", "context #2 for Q1"],
["context #1 for Q2"]
],
"answer": ["answer for Q1", "answer for Q2"],
"ground_truth": ["ground truth for Q1", "ground truth for Q2"]
}
Field |
Type |
Required |
Description |
|---|---|---|---|
|
list[string] |
Yes |
List of questions |
|
list[list[string]] |
Some metrics |
List of context passages per question |
|
list[string] |
Some metrics |
List of generated answers |
|
list[string] |
Some metrics |
List of reference answers |
Note
Different metrics require different columns. Check the metric documentation for specific requirements.
Built-in Datasets#
The platform provides built-in RAGAS datasets that can be referenced by name:
Dataset |
Description |
Use Case |
|---|---|---|
|
Amnesty International Q&A dataset |
General RAG evaluation |
Usage:
# Reference built-in dataset by name
"dataset": "ragas/amnesty_qa"
You can also use custom datasets via:
FilesetUrn: Upload to Files API and reference as
workspace/fileset-name/filename.jsonInline Dataset: Embed data directly in the API request
HuggingFace: Reference public or private HuggingFace datasets
Job Management#
After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.
Troubleshooting#
Common Errors#
Error |
Cause |
Solution |
|---|---|---|
|
Missing judge LLM config for metric |
Add |
|
Using |
Add |
Job stuck in “pending” |
Model endpoint not accessible |
Verify endpoint URLs and API key secrets |
Authentication failed |
Invalid or missing API key |
Check secret names match exactly |
Low faithfulness scores |
Context doesn’t support the answer |
Increase |
Tips for Better Results#
Use larger judge models (70B+) for more consistent scoring
Start with inline datasets to test your configuration before large evaluations
Set appropriate timeouts - judge LLM calls can take time with large contexts
Use parallelism wisely - increase
parallelismfor faster evaluation, but respect rate limits
Limitations#
Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAG metrics currently only support the RAGAS dataset format. Ensure your data matches the columnar structure.
Embedding Dimensions: Ensure your embedding model dimensions are compatible with the configured vector store.
See also
Retriever Metrics - Evaluate retrieval quality
LLM-as-a-Judge - Custom judge-based evaluation
Agentic Metrics - Evaluate agent workflows