Retriever Evaluation Metrics#
Retriever metrics evaluate the quality of document retrieval pipelines using standard TREC_EVAL-based information retrieval (IR) metrics. These metrics use pytrec_eval to measure retrieval accuracy based on relevance judgments.
Overview#
Retriever evaluation metrics require:
Retriever Pipeline: Embedding model for document retrieval
Reranker Model (optional): Reranking service for improved retrieval accuracy
Dataset: BEIR-formatted dataset with queries, corpus, and relevance judgments
Retriever metrics do not require a judge LLM—they compute scores based on the positions of relevant documents in the retrieved results.
Prerequisites#
Before running Retriever evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to embedding model endpoint (and optional reranker)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:
import os
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
Supported Metrics#
Retriever metrics are organized into categories. Metrics with {k} support cutoffs: k ∈ {5, 10, 20, 100}.
Category |
Metrics |
|---|---|
Precision |
|
Recall |
|
NDCG |
|
MAP |
|
Success/Rank |
|
Other |
|
See Retriever Metrics Reference for detailed descriptions.
Run Metric Job#
Retriever metrics run as asynchronous jobs.
Basic Retriever Evaluation#
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/retriever-ndcg-cut-10",
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": "beir/nfcorpus",
"metric_params": {
"dataset_format": "beir",
"top_k": 10
}
}
)
print(f"Job created: {job.name} ({job.id})")
BEIR datasets consist of three files:
corpus.jsonl - Documents to retrieve from:
{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}
{"_id": "doc2", "title": "Isaac Newton", "text": "Isaac Newton was an English mathematician and physicist."}
queries.jsonl - Queries to evaluate:
{"_id": "q1", "text": "Who developed the theory of relativity?"}
{"_id": "q2", "text": "Who discovered gravity?"}
qrels.tsv - Relevance judgments (tab-separated):
query-id corpus-id score
q1 doc1 1
q2 doc2 1
{
"aggregate_scores": [
{
"name": "ndcg_cut_10",
"count": 2,
"mean": 0.85,
"min": 0.8,
"max": 0.9
}
]
}
Note
If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).
Retriever Evaluation with Reranker#
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/retriever-ndcg-cut-10",
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
},
"reranker_model": {
"url": "https://integrate.api.nvidia.com/v1/ranking",
"name": "nvidia/nv-rerankqa-mistral-4b-v3",
"format": "nim",
"api_key_secret": "reranker-api-key"
}
},
"dataset": "beir/fiqa",
"metric_params": {
"dataset_format": "beir",
"top_k": 10
}
}
)
Retriever Evaluation with Custom BEIR Dataset#
job = client.evaluation.metric_jobs.create(
workspace="default",
spec={
"metric": "system/retriever-recall-10",
"retriever_pipeline": {
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "embedding-api-key"
}
},
"dataset": "default/my-custom-dataset", # Fileset URN
"metric_params": {
"dataset_format": "beir",
"top_k": 20,
"truncate_long_documents": "end"
}
}
)
Retriever Metrics Reference#
Precision Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Precision at k - fraction of top k results that are relevant. Available: k ∈ {5, 10, 20, 100} |
0.0 – 1.0 |
|
R-Precision - precision at R (number of relevant docs) |
0.0 – 1.0 |
|
Set-based Precision - precision over unique documents |
0.0 – 1.0 |
|
Set-based Relative Precision |
0.0 – 1.0 |
Recall Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Recall at k - fraction of relevant docs in top k. Available: k ∈ {5, 10, 20, 100} |
0.0 – 1.0 |
|
Set-based Recall - recall over unique documents |
0.0 – 1.0 |
NDCG Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Full NDCG - ranking quality with graded relevance |
0.0 – 1.0 |
|
NDCG at cutoff k. Available: k ∈ {5, 10, 20, 100} |
0.0 – 1.0 |
|
NDCG with relevance levels |
0.0 – 1.0 |
|
Rank-biased NDCG |
0.0 – 1.0 |
Mean Average Precision (MAP) Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Mean Average Precision (full) |
0.0 – 1.0 |
|
MAP at cutoff k. Available: k ∈ {5, 10, 20, 100} |
0.0 – 1.0 |
|
Geometric Mean of Average Precision |
0.0 – 1.0 |
|
Set-based MAP |
0.0 – 1.0 |
Reciprocal Rank & Success Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Mean Reciprocal Rank - inverse of first relevant doc rank |
0.0 – 1.0 |
|
Success at k - is there a relevant doc in top k? Available: k ∈ {5, 10, 20, 100} |
0.0 – 1.0 |
Other Metrics#
Metric Name |
Description |
Value Range |
|---|---|---|
|
Binary Preference - preference of relevant over non-relevant |
0.0 – 1.0 |
|
Geometric Mean of Binary Preference |
0.0 – 1.0 |
|
Inferred Average Precision |
0.0 – 1.0 |
|
Binary Gain |
0.0 – 1.0 |
|
Cumulative Gain |
0.0 – 1.0 |
|
11-point interpolated average precision |
0.0 – 1.0 |
|
Set-based F-measure |
0.0 – 1.0 |
Metric Parameters#
Job Spec Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Metric URN (e.g., |
|
object |
Yes |
Retriever pipeline with embedding model |
|
string |
Yes |
Dataset URN (e.g., |
|
object |
No |
Metric-specific parameters |
Retriever Pipeline Configuration#
{
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "optional-embedding-api-key-ref" # Name of secret containing API key
},
"reranker_model": { # Optional
"url": "https://integrate.api.nvidia.com/v1/ranking",
"name": "nvidia/nv-rerankqa-mistral-4b-v3",
"format": "nim",
"api_key_secret": "optional-reranker-api-key-ref" # Name of secret containing API key
}
}
Metric Parameters (metric_params)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Dataset format (beir) |
|
int |
10 |
Number of top results to retrieve |
|
string |
Omitted |
Handle documents exceeding 65k characters. |
Managing Secrets for Authenticated Endpoints#
Store API keys as secrets for secure authentication:
# Create secrets for embedding and reranker endpoints
client.secrets.create(name="embedding-api-key", data="your-embedding-key")
client.secrets.create(name="reranker-api-key", data="your-reranker-key")
Reference secrets by name in your metric configuration:
"embedding_model": {
"url": "https://integrate.api.nvidia.com/v1",
"name": "nvidia/nv-embedqa-e5-v5",
"format": "nim",
"api_key_secret": "optional-embedding-api-key-ref" # Name of secret, not the actual API key
}
Dataset Format#
Retriever metrics support the BEIR dataset format.
BEIR Format#
BEIR (Benchmarking Information Retrieval) datasets consist of three files:
corpus.jsonl#
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Unique document identifier |
|
string |
No |
Document title (optional) |
|
string |
Yes |
Document text content |
Example:
{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}
queries.jsonl#
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Unique query identifier |
|
string |
Yes |
Query text |
Example:
{"_id": "q1", "text": "Who developed the theory of relativity?"}
qrels.tsv#
Tab-separated file with relevance judgments:
Column |
Type |
Description |
|---|---|---|
|
string |
Query identifier (matches queries.jsonl |
|
string |
Document identifier (matches corpus.jsonl |
|
integer |
Relevance score (typically 1 for relevant, 0 for not) |
Example:
query-id corpus-id score
q1 doc1 1
Built-in Datasets#
The platform provides built-in BEIR datasets that can be referenced by name:
Dataset |
Description |
Corpus Size |
Use Case |
|---|---|---|---|
|
Natural language corpus for biomedical IR |
~3.6K docs |
Small dataset for quick testing |
|
Financial question answering |
~57K docs |
Domain-specific retrieval |
|
Scientific document retrieval |
~25K docs |
Academic/scientific retrieval |
|
Scientific fact verification |
~5K docs |
Fact-checking retrieval |
Usage:
# Reference built-in dataset by name
"dataset": "beir/nfcorpus"
You can also use custom BEIR datasets via:
FilesetUrn: Upload to Files API and reference as
workspace/fileset-nameThe fileset should contain
corpus.jsonl,queries.jsonl, andqrels.tsvfiles
Note
For a complete list of BEIR datasets, refer to the BEIR repository.
Job Management#
After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.
Choosing the Right Metric#
Use Case |
Recommended Metrics |
|---|---|
General retrieval quality |
|
Checking if relevant docs appear at all |
|
Top result quality |
|
Complete retrieval |
|
Ranking quality with graded relevance |
|
Troubleshooting#
Common Errors#
Error |
Cause |
Solution |
|---|---|---|
Job stuck in “pending” |
Embedding endpoint not accessible |
Verify endpoint URL and API key secret |
Authentication failed |
Invalid or missing API key |
Check secret name matches exactly |
Out of memory |
Large corpus |
Reduce corpus size or use |
Low recall scores |
Not enough documents retrieved |
Increase |
Zero scores |
Missing relevance judgments |
Ensure qrels.tsv has entries for your queries |
Tips for Better Results#
Start with small datasets like
beir/nfcorpusto validate configurationUse
top_k> cutoff - e.g., settop_k: 20when measuringretriever-ndcg-cut-10Enable truncation for datasets with long documents to avoid embedding errors
Compare with reranker - add a reranker to see ranking improvement
Limitations#
Dataset Format: Retriever metrics currently only support the BEIR dataset format. Ensure your data follows the required structure.
Document Length: Documents exceeding 65k characters may need truncation. Use
truncate_long_documents: "end"to keep the first 65k characters or"start"to keep the last 65k characters.Memory Usage: Large corpora may require significant memory for indexing. Consider limiting evaluation to smaller subsets for testing.
See also
RAG Metrics - Evaluate RAG pipelines with answer generation
LLM-as-a-Judge - Custom judge-based evaluation
Agentic Metrics - Evaluate agent workflows