RAG Evaluation Metrics#

RAG (Retrieval Augmented Generation) metrics evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. These metrics use RAGAS to assess how well retrieved contexts support generated answers.

Overview#

RAG evaluation metrics require:

RAG Model: The LLM used to generate answers
Retriever Pipeline: Embedding model (and optional reranker) for document retrieval
Judge LLM: An LLM to evaluate answer quality
Judge Embeddings (optional): Required for some metrics like system/rag-response-relevancy

All RAG metrics require a judge LLM for evaluation. Some metrics additionally require judge embeddings for semantic similarity calculations.

Prerequisites#

Before running RAG evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to RAG model, embedding model, and judge LLM endpoints
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Supported Metrics#

RAG metrics are organized into four categories:

Category	Metrics
Faithfulness	`system/rag-faithfulness`, `system/rag-response-groundedness`
Answer Quality	`system/rag-answer-correctness`, `system/rag-answer-relevancy`, `system/rag-answer-similarity`, `system/rag-answer-accuracy`, `system/rag-response-relevancy`
Context Quality	`system/rag-context-recall`, `system/rag-context-precision`, `system/rag-context-relevance`, `system/rag-context-entity-recall`
Robustness	`system/rag-noise-sensitivity`

* Requires judge_embeddings in addition to judge_llm

See RAG Metrics Reference for detailed descriptions and requirements.

Run Metric Job#

RAG metrics run as asynchronous jobs. You can specify the metric configuration inline or reference a stored metric.

Basic RAG Evaluation#

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2,
                "inference_params": {
                    "max_tokens": 4000
                }
            }
        }
    }
)

print(f"Job created: {job.name} ({job.id})")

Data Format (RAGAS)

The RAGAS dataset format uses columnar structure where each field is a list:

{
    "question": [
        "When did the 2024 SF Taiwan Day take place?",
        "Where did the 2024 SF Taiwan Day take place?"
    ],
    "contexts": [
        ["The 2024 SF Taiwan Day was held on May 25th at the Oakland Coliseum."],
        ["The event featured cultural performances and food from Taiwan."]
    ],
    "ground_truth": [
        "May 25th",
        "Oakland Coliseum"
    ],
    "answer": [
        "The 2024 SF Taiwan Day took place on May 25th.",
        "The 2024 SF Taiwan Day took place at the Oakland Coliseum."
    ]
}

Result

Score value range [0.0 – 1.0]

{
    "aggregate_scores": [
        {
            "name": "faithfulness",
            "count": 2,
            "mean": 0.95,
            "min": 0.9,
            "max": 1.0
        }
    ]
}

Note

If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).

RAG Evaluation with Reranker#

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            },
            "reranker_model": {
                "url": "https://integrate.api.nvidia.com/v1/ranking",
                "name": "nvidia/nv-rerankqa-mistral-4b-v3",
                "format": "nim",
                "api_key_secret": "reranker-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2
            }
        }
    }
)

RAG Evaluation with Judge Embeddings#

Some metrics like system/rag-response-relevancy require both judge LLM and judge embeddings:

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-response-relevancy",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "ragas/amnesty_qa",
        "metric_params": {
            "dataset_format": "ragas",
            "top_k": 10,
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                },
                "request_timeout": 120,
                "max_retries": 3,
                "parallelism": 2
            },
            "judge_embeddings": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1",
                    "name": "nvidia/nv-embedqa-e5-v5",
                    "api_key_secret": "judge-embedding-api-key"
                }
            }
        }
    }
)

RAG Evaluation with Inline Dataset#

Test with a small inline dataset before running on large datasets:

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "rows": [
                {
                    "question": ["What is the capital of France?", "Who wrote Romeo and Juliet?"],
                    "contexts": [
                        ["Paris is the capital city of France."],
                        ["William Shakespeare wrote Romeo and Juliet in the 1590s."]
                    ],
                    "ground_truth": ["Paris", "William Shakespeare"],
                    "answer": [
                        "The capital of France is Paris.",
                        "Romeo and Juliet was written by Shakespeare."
                    ]
                }
            ]
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)

RAG Evaluation with HuggingFace Dataset#

Load datasets directly from HuggingFace:

SDK (Public Repo)

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "storage": {
                "type": "huggingface",
                "repo_id": "NotYours/test_ragas_dataset",
                "repo_type": "dataset"
            },
            "path": "dataset.json"
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)

SDK (Private Repo)

# First create a secret for the HuggingFace token
client.secrets.create(
    name="hf-token",
    data="your-huggingface-token"
)

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/rag-faithfulness",
        "model": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "name": "meta/llama-3.1-8b-instruct",
            "api_key_secret": "model-api-key"
        },
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": {
            "storage": {
                "type": "huggingface",
                "repo_id": "my-org/private-dataset",
                "repo_type": "dataset",
                "token_secret": "hf-token"
            },
            "path": "dataset.json"
        },
        "metric_params": {
            "dataset_format": "ragas",
            "judge_llm": {
                "model": {
                    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                    "name": "meta/llama-3.1-8b-instruct",
                    "api_key_secret": "judge-api-key"
                }
            }
        }
    }
)

RAG Metrics Reference#

Use Case	Metric Name	Description	Required Columns
Detect hallucinations	`system/rag-faithfulness`	Measures factual consistency of generated answer with retrieved context	question, contexts, answer
	`system/rag-response-groundedness`	Evaluates whether response is grounded in context without hallucinations	contexts, answer
	`system/rag-noise-sensitivity`	Robustness to noisy or irrelevant context	question, contexts, answer, ground_truth
Validate answer accuracy	`system/rag-answer-correctness`*	Evaluates correctness against ground truth reference	question, answer, ground_truth
	`system/rag-answer-accuracy`	Factual accuracy based on context and ground truth	question, answer, ground_truth
Check if answers address the question	`system/rag-answer-relevancy`*	Measures how relevant the answer is to the question	question, answer
	`system/rag-response-relevancy`*	Response relevancy using embeddings similarity	question, answer
Measure semantic similarity	`system/rag-answer-similarity`*	Semantic similarity between answer and ground truth	answer, ground_truth
Measure retrieval quality	`system/rag-context-recall`	Coverage of ground truth information in retrieved context	question, contexts, ground_truth
	`system/rag-context-precision`	Whether all retrieved chunks are relevant to the question	question, contexts, ground_truth
	`system/rag-context-relevance`	Relevance of retrieved context to the question	question, contexts
	`system/rag-context-entity-recall`	Recall of important entities from ground truth in context	contexts, ground_truth

* Requires judge_embeddings in addition to judge_llm

Required Columns: Dataset columns that must be present for the metric to be evaluated.

Metric Parameters#

Job Spec Parameters#

Parameter	Type	Required	Description
`metric`	string	Yes	Metric URN (e.g., `system/rag-faithfulness`)
`model`	object	Yes	RAG model configuration
`retriever_pipeline`	object	Yes	Retriever pipeline with embedding model
`dataset`	string/object	Yes	Dataset URN, inline rows, or HuggingFace config
`metric_params`	object	Yes	Metric-specific parameters

Model Configuration#

{
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-8b-instruct",
    "api_key_secret": "optional-model-api-key-ref"  # Optional: name of the secret containing the API key
}

Retriever Pipeline Configuration#

{
    "embedding_model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "format": "nim",
        "api_key_secret": "embedding-api-key"  # Optional: Name of secret containing API key
    },
    "reranker_model": {  # Optional
        "url": "https://integrate.api.nvidia.com/v1/ranking",
        "name": "nvidia/nv-rerankqa-mistral-4b-v3",
        "format": "nim",
        "api_key_secret": "reranker-api-key"  # Optional: Name of secret containing API key
    }
}

Metric Parameters (metric_params)#

Parameter	Type	Default	Description
`dataset_format`	string	`"ragas"`	Dataset format (ragas)
`top_k`	int	10	Number of top results to retrieve
`truncate_long_documents`	string	Omitted	Handle documents exceeding 65k characters. `"start"`: keep last 65k chars, `"end"`: keep first 65k chars
`judge_llm`	object	Required	Judge LLM configuration
`judge_embeddings`	object	Optional	Judge embeddings (required for some metrics)

Judge LLM Configuration#

{
    "model": {
        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
        "name": "meta/llama-3.1-8b-instruct",
        "api_key_secret": "optional-judge-api-key-ref"  # Name of secret containing API key
    },
    "request_timeout": 120,      # Request timeout in seconds
    "max_retries": 3,            # Max retries for failed requests
    "parallelism": 2,            # Concurrent judge workers
    "inference_params": {
        "max_tokens": 4000,      # Max tokens for judge response
        "temperature": 0.1,       # Lower for consistent scoring
        "top_p": 0.9
    }
}

Judge Embeddings Configuration#

{
    "model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "api_key_secret": "judge-embedding-api-key"  # Optional: Name of secret containing API key
    }
}

Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for all endpoints that may require authentication
client.secrets.create(name="model-api-key", data="<your-model-key>")
client.secrets.create(name="embedding-api-key", data="<your-embedding-key>")
client.secrets.create(name="judge-api-key", data="<your-judge-key>")
client.secrets.create(name="judge-embedding-api-key", data="<your-judge-embedding-key>")
client.secrets.create(name="reranker-api-key", data="<your-reranker-key>")

Reference secrets by name in your metric configuration:

"model": {
    "url": "https://integrate.api.nvidia.com/v1/chat/completions",
    "name": "meta/llama-3.1-8b-instruct",
    "api_key_secret": "optional-model-api-key-ref"  # Name of secret, not the actual API key
}

Dataset Format#

RAG metrics support the RAGAS dataset format.

RAGAS Format#

The RAGAS format uses columnar structure where each field is a list of values:

{
    "question": ["question #1", "question #2"],
    "contexts": [
        ["context #1 for Q1", "context #2 for Q1"],
        ["context #1 for Q2"]
    ],
    "answer": ["answer for Q1", "answer for Q2"],
    "ground_truth": ["ground truth for Q1", "ground truth for Q2"]
}

Field	Type	Required	Description
`question`	list[string]	Yes	List of questions
`contexts`	list[list[string]]	Some metrics	List of context passages per question
`answer`	list[string]	Some metrics	List of generated answers
`ground_truth`	list[string]	Some metrics	List of reference answers

Note

Different metrics require different columns. Check the metric documentation for specific requirements.

Built-in Datasets#

The platform provides built-in RAGAS datasets that can be referenced by name:

Dataset	Description	Use Case
`ragas/amnesty_qa`	Amnesty International Q&A dataset	General RAG evaluation

Usage:

# Reference built-in dataset by name
"dataset": "ragas/amnesty_qa"

You can also use custom datasets via:

FilesetUrn: Upload to Files API and reference as workspace/fileset-name/filename.json
Inline Dataset: Embed data directly in the API request
HuggingFace: Reference public or private HuggingFace datasets

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.

Troubleshooting#

Common Errors#

Error	Cause	Solution
`judge_llm is required`	Missing judge LLM config for metric	Add `judge_llm` to `metric_params`
`judge_embeddings is required`	Using `system/rag-response-relevancy` without embeddings	Add `judge_embeddings` to `metric_params`
Job stuck in “pending”	Model endpoint not accessible	Verify endpoint URLs and API key secrets
Authentication failed	Invalid or missing API key	Check secret names match exactly
Low faithfulness scores	Context doesn’t support the answer	Increase `top_k` or improve retrieval

Tips for Better Results#

Use larger judge models (70B+) for more consistent scoring
Start with inline datasets to test your configuration before large evaluations
Set appropriate timeouts - judge LLM calls can take time with large contexts
Use parallelism wisely - increase parallelism for faster evaluation, but respect rate limits

Limitations#

Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
Dataset Format: RAG metrics currently only support the RAGAS dataset format. Ensure your data matches the columnar structure.
Embedding Dimensions: Ensure your embedding model dimensions are compatible with the configured vector store.