Retriever Evaluation Metrics#

Retriever metrics evaluate the quality of document retrieval pipelines using standard TREC_EVAL-based information retrieval (IR) metrics. These metrics use pytrec_eval to measure retrieval accuracy based on relevance judgments.

Overview#

Retriever evaluation metrics require:

Retriever Pipeline: Embedding model for document retrieval
Reranker Model (optional): Reranking service for improved retrieval accuracy
Dataset: BEIR-formatted dataset with queries, corpus, and relevance judgments

Retriever metrics do not require a judge LLM—they compute scores based on the positions of relevant documents in the retrieved results.

Prerequisites#

Before running Retriever evaluations:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model Endpoints: Access to embedding model endpoint (and optional reranker)
API Keys (if required): Create secrets for any endpoints requiring authentication
Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Supported Metrics#

Retriever metrics are organized into categories. Metrics with {k} support cutoffs: k ∈ {5, 10, 20, 100}.

Category	Metrics
Precision	`system/retriever-p-{k}`, `system/retriever-rprec`, `system/retriever-set-p`, `system/retriever-set-relative-p`
Recall	`system/retriever-recall-{k}`, `system/retriever-set-recall`
NDCG	`system/retriever-ndcg`, `system/retriever-ndcg-cut-{k}`, `system/retriever-ndcg-rel`, `system/retriever-rndcg`
MAP	`system/retriever-map`, `system/retriever-map-cut-{k}`, `system/retriever-gm-map`, `system/retriever-set-map`
Success/Rank	`system/retriever-recip-rank`, `system/retriever-success-{k}`
Other	`system/retriever-bpref`, `system/retriever-gm-bpref`, `system/retriever-infap`, `system/retriever-11pt-avg`, `system/retriever-set-f`

See Retriever Metrics Reference for detailed descriptions.

Run Metric Job#

Retriever metrics run as asynchronous jobs.

Basic Retriever Evaluation#

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-ndcg-cut-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "beir/nfcorpus",
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 10
        }
    }
)

print(f"Job created: {job.name} ({job.id})")

Data Format (BEIR)

BEIR datasets consist of three files:

corpus.jsonl - Documents to retrieve from:

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}
{"_id": "doc2", "title": "Isaac Newton", "text": "Isaac Newton was an English mathematician and physicist."}

queries.jsonl - Queries to evaluate:

{"_id": "q1", "text": "Who developed the theory of relativity?"}
{"_id": "q2", "text": "Who discovered gravity?"}

qrels.tsv - Relevance judgments (tab-separated):

query-id	corpus-id	score
q1	doc1	1
q2	doc2	1

Result

{
    "aggregate_scores": [
        {
            "name": "ndcg_cut_10",
            "count": 2,
            "mean": 0.85,
            "min": 0.8,
            "max": 0.9
        }
    ]
}

Note

If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).

Retriever Evaluation with Reranker#

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-ndcg-cut-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            },
            "reranker_model": {
                "url": "https://integrate.api.nvidia.com/v1/ranking",
                "name": "nvidia/nv-rerankqa-mistral-4b-v3",
                "format": "nim",
                "api_key_secret": "reranker-api-key"
            }
        },
        "dataset": "beir/fiqa",
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 10
        }
    }
)

Retriever Evaluation with Custom BEIR Dataset#

SDK

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-recall-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "default/my-custom-dataset",  # Fileset URN
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 20,
            "truncate_long_documents": "end"
        }
    }
)

Retriever Metrics Reference#

Precision Metrics#

Metric Name	Description	Value Range
`system/retriever-p-{k}`	Precision at k - fraction of top k results that are relevant. Available: k ∈ {5, 10, 20, 100}	0.0 – 1.0
`system/retriever-rprec`	R-Precision - precision at R (number of relevant docs)	0.0 – 1.0
`system/retriever-set-p`	Set-based Precision - precision over unique documents	0.0 – 1.0
`system/retriever-set-relative-p`	Set-based Relative Precision	0.0 – 1.0

Recall Metrics#

Metric Name	Description	Value Range
`system/retriever-recall-{k}`	Recall at k - fraction of relevant docs in top k. Available: k ∈ {5, 10, 20, 100}	0.0 – 1.0
`system/retriever-set-recall`	Set-based Recall - recall over unique documents	0.0 – 1.0

NDCG Metrics#

Metric Name	Description	Value Range
`system/retriever-ndcg`	Full NDCG - ranking quality with graded relevance	0.0 – 1.0
`system/retriever-ndcg-cut-{k}`	NDCG at cutoff k. Available: k ∈ {5, 10, 20, 100}	0.0 – 1.0
`system/retriever-ndcg-rel`	NDCG with relevance levels	0.0 – 1.0
`system/retriever-rndcg`	Rank-biased NDCG	0.0 – 1.0

Mean Average Precision (MAP) Metrics#

Metric Name	Description	Value Range
`system/retriever-map`	Mean Average Precision (full)	0.0 – 1.0
`system/retriever-map-cut-{k}`	MAP at cutoff k. Available: k ∈ {5, 10, 20, 100}	0.0 – 1.0
`system/retriever-gm-map`	Geometric Mean of Average Precision	0.0 – 1.0
`system/retriever-set-map`	Set-based MAP	0.0 – 1.0

Reciprocal Rank & Success Metrics#

Metric Name	Description	Value Range
`system/retriever-recip-rank`	Mean Reciprocal Rank - inverse of first relevant doc rank	0.0 – 1.0
`system/retriever-success-{k}`	Success at k - is there a relevant doc in top k? Available: k ∈ {5, 10, 20, 100}	0.0 – 1.0

Other Metrics#

Metric Name	Description	Value Range
`system/retriever-bpref`	Binary Preference - preference of relevant over non-relevant	0.0 – 1.0
`system/retriever-gm-bpref`	Geometric Mean of Binary Preference	0.0 – 1.0
`system/retriever-infap`	Inferred Average Precision	0.0 – 1.0
`system/retriever-bing`	Binary Gain	0.0 – 1.0
`system/retriever-g`	Cumulative Gain	0.0 – 1.0
`system/retriever-11pt-avg`	11-point interpolated average precision	0.0 – 1.0
`system/retriever-set-f`	Set-based F-measure	0.0 – 1.0

Metric Parameters#

Job Spec Parameters#

Parameter	Type	Required	Description
`metric`	string	Yes	Metric URN (e.g., `system/retriever-ndcg-cut-10`)
`retriever_pipeline`	object	Yes	Retriever pipeline with embedding model
`dataset`	string	Yes	Dataset URN (e.g., `beir/nfcorpus`)
`metric_params`	object	No	Metric-specific parameters

Retriever Pipeline Configuration#

{
    "embedding_model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "format": "nim",
        "api_key_secret": "optional-embedding-api-key-ref"  # Name of secret containing API key
    },
    "reranker_model": {  # Optional
        "url": "https://integrate.api.nvidia.com/v1/ranking",
        "name": "nvidia/nv-rerankqa-mistral-4b-v3",
        "format": "nim",
        "api_key_secret": "optional-reranker-api-key-ref"  # Name of secret containing API key
    }
}

Metric Parameters (metric_params)#

Parameter	Type	Default	Description
`dataset_format`	string	`"beir"`	Dataset format (beir)
`top_k`	int	10	Number of top results to retrieve
`truncate_long_documents`	string	Omitted	Handle documents exceeding 65k characters. `"start"`: keep last 65k chars, `"end"`: keep first 65k chars

Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for embedding and reranker endpoints
client.secrets.create(name="embedding-api-key", data="your-embedding-key")
client.secrets.create(name="reranker-api-key", data="your-reranker-key")

Reference secrets by name in your metric configuration:

"embedding_model": {
    "url": "https://integrate.api.nvidia.com/v1",
    "name": "nvidia/nv-embedqa-e5-v5",
    "format": "nim",
    "api_key_secret": "optional-embedding-api-key-ref"  # Name of secret, not the actual API key
}

Dataset Format#

Retriever metrics support the BEIR dataset format.

BEIR Format#

BEIR (Benchmarking Information Retrieval) datasets consist of three files:

corpus.jsonl#

Field	Type	Required	Description
`_id`	string	Yes	Unique document identifier
`title`	string	No	Document title (optional)
`text`	string	Yes	Document text content

Example:

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}

queries.jsonl#

Field	Type	Required	Description
`_id`	string	Yes	Unique query identifier
`text`	string	Yes	Query text

Example:

{"_id": "q1", "text": "Who developed the theory of relativity?"}

qrels.tsv#

Tab-separated file with relevance judgments:

Column	Type	Description
`query-id`	string	Query identifier (matches queries.jsonl `_id`)
`corpus-id`	string	Document identifier (matches corpus.jsonl `_id`)
`score`	integer	Relevance score (typically 1 for relevant, 0 for not)

Example:

query-id	corpus-id	score
q1	doc1	1

Built-in Datasets#

The platform provides built-in BEIR datasets that can be referenced by name:

Dataset	Description	Corpus Size	Use Case
`beir/nfcorpus`	Natural language corpus for biomedical IR	~3.6K docs	Small dataset for quick testing
`beir/fiqa`	Financial question answering	~57K docs	Domain-specific retrieval
`beir/scidocs`	Scientific document retrieval	~25K docs	Academic/scientific retrieval
`beir/scifact`	Scientific fact verification	~5K docs	Fact-checking retrieval

Usage:

# Reference built-in dataset by name
"dataset": "beir/nfcorpus"

You can also use custom BEIR datasets via:

FilesetUrn: Upload to Files API and reference as workspace/fileset-name
The fileset should contain corpus.jsonl, queries.jsonl, and qrels.tsv files

Note

For a complete list of BEIR datasets, refer to the BEIR repository.

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.

Choosing the Right Metric#

Use Case	Recommended Metrics
General retrieval quality	`system/retriever-ndcg-cut-10`, `system/retriever-map`
Checking if relevant docs appear at all	`system/retriever-success-10`, `system/retriever-recall-10`
Top result quality	`system/retriever-recip-rank`, `system/retriever-p-5`
Complete retrieval	`system/retriever-recall-100`, `system/retriever-ndcg`
Ranking quality with graded relevance	`system/retriever-ndcg-cut-*` metrics

Troubleshooting#

Common Errors#

Error	Cause	Solution
Job stuck in “pending”	Embedding endpoint not accessible	Verify endpoint URL and API key secret
Authentication failed	Invalid or missing API key	Check secret name matches exactly
Out of memory	Large corpus	Reduce corpus size or use `truncate_long_documents`
Low recall scores	Not enough documents retrieved	Increase `top_k` parameter
Zero scores	Missing relevance judgments	Ensure qrels.tsv has entries for your queries

Tips for Better Results#

Start with small datasets like beir/nfcorpus to validate configuration
Use top_k > cutoff - e.g., set top_k: 20 when measuring retriever-ndcg-cut-10
Enable truncation for datasets with long documents to avoid embedding errors
Compare with reranker - add a reranker to see ranking improvement

Limitations#

Dataset Format: Retriever metrics currently only support the BEIR dataset format. Ensure your data follows the required structure.
Document Length: Documents exceeding 65k characters may need truncation. Use truncate_long_documents: "end" to keep the first 65k characters or "start" to keep the last 65k characters.
Memory Usage: Large corpora may require significant memory for indexing. Consider limiting evaluation to smaller subsets for testing.