Retriever Evaluation Metrics#

Retriever metrics evaluate the quality of document retrieval pipelines using standard TREC_EVAL-based information retrieval (IR) metrics. These metrics use pytrec_eval to measure retrieval accuracy based on relevance judgments.

Overview#

Retriever evaluation metrics require:

  • Retriever Pipeline: Embedding model for document retrieval

  • Reranker Model (optional): Reranking service for improved retrieval accuracy

  • Dataset: BEIR-formatted dataset with queries, corpus, and relevance judgments

Retriever metrics do not require a judge LLM—they compute scores based on the positions of relevant documents in the retrieved results.

Prerequisites#

Before running Retriever evaluations:

  1. Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.

  2. Model Endpoints: Access to embedding model endpoint (and optional reranker)

  3. API Keys (if required): Create secrets for any endpoints requiring authentication

  4. Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")

Supported Metrics#

Retriever metrics are organized into categories. Metrics with {k} support cutoffs: k ∈ {5, 10, 20, 100}.

Category

Metrics

Precision

system/retriever-p-{k}, system/retriever-rprec, system/retriever-set-p, system/retriever-set-relative-p

Recall

system/retriever-recall-{k}, system/retriever-set-recall

NDCG

system/retriever-ndcg, system/retriever-ndcg-cut-{k}, system/retriever-ndcg-rel, system/retriever-rndcg

MAP

system/retriever-map, system/retriever-map-cut-{k}, system/retriever-gm-map, system/retriever-set-map

Success/Rank

system/retriever-recip-rank, system/retriever-success-{k}

Other

system/retriever-bpref, system/retriever-gm-bpref, system/retriever-infap, system/retriever-11pt-avg, system/retriever-set-f

See Retriever Metrics Reference for detailed descriptions.


Run Metric Job#

Retriever metrics run as asynchronous jobs.

Basic Retriever Evaluation#

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-ndcg-cut-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "beir/nfcorpus",
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 10
        }
    }
)

print(f"Job created: {job.name} ({job.id})")

BEIR datasets consist of three files:

corpus.jsonl - Documents to retrieve from:

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}
{"_id": "doc2", "title": "Isaac Newton", "text": "Isaac Newton was an English mathematician and physicist."}

queries.jsonl - Queries to evaluate:

{"_id": "q1", "text": "Who developed the theory of relativity?"}
{"_id": "q2", "text": "Who discovered gravity?"}

qrels.tsv - Relevance judgments (tab-separated):

query-id	corpus-id	score
q1	doc1	1
q2	doc2	1
{
    "aggregate_scores": [
        {
            "name": "ndcg_cut_10",
            "count": 2,
            "mean": 0.85,
            "min": 0.8,
            "max": 0.9
        }
    ]
}

Note

If your model endpoint requires authentication, configure api_key_secret with the name of the secret containing the API key (see Managing Secrets for Authenticated Endpoints).

Retriever Evaluation with Reranker#

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-ndcg-cut-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            },
            "reranker_model": {
                "url": "https://integrate.api.nvidia.com/v1/ranking",
                "name": "nvidia/nv-rerankqa-mistral-4b-v3",
                "format": "nim",
                "api_key_secret": "reranker-api-key"
            }
        },
        "dataset": "beir/fiqa",
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 10
        }
    }
)

Retriever Evaluation with Custom BEIR Dataset#

job = client.evaluation.metric_jobs.create(
    workspace="default",
    spec={
        "metric": "system/retriever-recall-10",
        "retriever_pipeline": {
            "embedding_model": {
                "url": "https://integrate.api.nvidia.com/v1",
                "name": "nvidia/nv-embedqa-e5-v5",
                "format": "nim",
                "api_key_secret": "embedding-api-key"
            }
        },
        "dataset": "default/my-custom-dataset",  # Fileset URN
        "metric_params": {
            "dataset_format": "beir",
            "top_k": 20,
            "truncate_long_documents": "end"
        }
    }
)

Retriever Metrics Reference#

Precision Metrics#

Metric Name

Description

Value Range

system/retriever-p-{k}

Precision at k - fraction of top k results that are relevant. Available: k ∈ {5, 10, 20, 100}

0.0 – 1.0

system/retriever-rprec

R-Precision - precision at R (number of relevant docs)

0.0 – 1.0

system/retriever-set-p

Set-based Precision - precision over unique documents

0.0 – 1.0

system/retriever-set-relative-p

Set-based Relative Precision

0.0 – 1.0

Recall Metrics#

Metric Name

Description

Value Range

system/retriever-recall-{k}

Recall at k - fraction of relevant docs in top k. Available: k ∈ {5, 10, 20, 100}

0.0 – 1.0

system/retriever-set-recall

Set-based Recall - recall over unique documents

0.0 – 1.0

NDCG Metrics#

Metric Name

Description

Value Range

system/retriever-ndcg

Full NDCG - ranking quality with graded relevance

0.0 – 1.0

system/retriever-ndcg-cut-{k}

NDCG at cutoff k. Available: k ∈ {5, 10, 20, 100}

0.0 – 1.0

system/retriever-ndcg-rel

NDCG with relevance levels

0.0 – 1.0

system/retriever-rndcg

Rank-biased NDCG

0.0 – 1.0

Mean Average Precision (MAP) Metrics#

Metric Name

Description

Value Range

system/retriever-map

Mean Average Precision (full)

0.0 – 1.0

system/retriever-map-cut-{k}

MAP at cutoff k. Available: k ∈ {5, 10, 20, 100}

0.0 – 1.0

system/retriever-gm-map

Geometric Mean of Average Precision

0.0 – 1.0

system/retriever-set-map

Set-based MAP

0.0 – 1.0

Reciprocal Rank & Success Metrics#

Metric Name

Description

Value Range

system/retriever-recip-rank

Mean Reciprocal Rank - inverse of first relevant doc rank

0.0 – 1.0

system/retriever-success-{k}

Success at k - is there a relevant doc in top k? Available: k ∈ {5, 10, 20, 100}

0.0 – 1.0

Other Metrics#

Metric Name

Description

Value Range

system/retriever-bpref

Binary Preference - preference of relevant over non-relevant

0.0 – 1.0

system/retriever-gm-bpref

Geometric Mean of Binary Preference

0.0 – 1.0

system/retriever-infap

Inferred Average Precision

0.0 – 1.0

system/retriever-bing

Binary Gain

0.0 – 1.0

system/retriever-g

Cumulative Gain

0.0 – 1.0

system/retriever-11pt-avg

11-point interpolated average precision

0.0 – 1.0

system/retriever-set-f

Set-based F-measure

0.0 – 1.0


Metric Parameters#

Job Spec Parameters#

Parameter

Type

Required

Description

metric

string

Yes

Metric URN (e.g., system/retriever-ndcg-cut-10)

retriever_pipeline

object

Yes

Retriever pipeline with embedding model

dataset

string

Yes

Dataset URN (e.g., beir/nfcorpus)

metric_params

object

No

Metric-specific parameters

Retriever Pipeline Configuration#

{
    "embedding_model": {
        "url": "https://integrate.api.nvidia.com/v1",
        "name": "nvidia/nv-embedqa-e5-v5",
        "format": "nim",
        "api_key_secret": "optional-embedding-api-key-ref"  # Name of secret containing API key
    },
    "reranker_model": {  # Optional
        "url": "https://integrate.api.nvidia.com/v1/ranking",
        "name": "nvidia/nv-rerankqa-mistral-4b-v3",
        "format": "nim",
        "api_key_secret": "optional-reranker-api-key-ref"  # Name of secret containing API key
    }
}

Metric Parameters (metric_params)#

Parameter

Type

Default

Description

dataset_format

string

"beir"

Dataset format (beir)

top_k

int

10

Number of top results to retrieve

truncate_long_documents

string

Omitted

Handle documents exceeding 65k characters. "start": keep last 65k chars, "end": keep first 65k chars


Managing Secrets for Authenticated Endpoints#

Store API keys as secrets for secure authentication:

# Create secrets for embedding and reranker endpoints
client.secrets.create(name="embedding-api-key", data="your-embedding-key")
client.secrets.create(name="reranker-api-key", data="your-reranker-key")

Reference secrets by name in your metric configuration:

"embedding_model": {
    "url": "https://integrate.api.nvidia.com/v1",
    "name": "nvidia/nv-embedqa-e5-v5",
    "format": "nim",
    "api_key_secret": "optional-embedding-api-key-ref"  # Name of secret, not the actual API key
}

Dataset Format#

Retriever metrics support the BEIR dataset format.

BEIR Format#

BEIR (Benchmarking Information Retrieval) datasets consist of three files:

corpus.jsonl#

Field

Type

Required

Description

_id

string

Yes

Unique document identifier

title

string

No

Document title (optional)

text

string

Yes

Document text content

Example:

{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born theoretical physicist."}

queries.jsonl#

Field

Type

Required

Description

_id

string

Yes

Unique query identifier

text

string

Yes

Query text

Example:

{"_id": "q1", "text": "Who developed the theory of relativity?"}

qrels.tsv#

Tab-separated file with relevance judgments:

Column

Type

Description

query-id

string

Query identifier (matches queries.jsonl _id)

corpus-id

string

Document identifier (matches corpus.jsonl _id)

score

integer

Relevance score (typically 1 for relevant, 0 for not)

Example:

query-id	corpus-id	score
q1	doc1	1

Built-in Datasets#

The platform provides built-in BEIR datasets that can be referenced by name:

Dataset

Description

Corpus Size

Use Case

beir/nfcorpus

Natural language corpus for biomedical IR

~3.6K docs

Small dataset for quick testing

beir/fiqa

Financial question answering

~57K docs

Domain-specific retrieval

beir/scidocs

Scientific document retrieval

~25K docs

Academic/scientific retrieval

beir/scifact

Scientific fact verification

~5K docs

Fact-checking retrieval

Usage:

# Reference built-in dataset by name
"dataset": "beir/nfcorpus"

You can also use custom BEIR datasets via:

  • FilesetUrn: Upload to Files API and reference as workspace/fileset-name

  • The fileset should contain corpus.jsonl, queries.jsonl, and qrels.tsv files

Note

For a complete list of BEIR datasets, refer to the BEIR repository.


Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.


Choosing the Right Metric#

Use Case

Recommended Metrics

General retrieval quality

system/retriever-ndcg-cut-10, system/retriever-map

Checking if relevant docs appear at all

system/retriever-success-10, system/retriever-recall-10

Top result quality

system/retriever-recip-rank, system/retriever-p-5

Complete retrieval

system/retriever-recall-100, system/retriever-ndcg

Ranking quality with graded relevance

system/retriever-ndcg-cut-* metrics


Troubleshooting#

Common Errors#

Error

Cause

Solution

Job stuck in “pending”

Embedding endpoint not accessible

Verify endpoint URL and API key secret

Authentication failed

Invalid or missing API key

Check secret name matches exactly

Out of memory

Large corpus

Reduce corpus size or use truncate_long_documents

Low recall scores

Not enough documents retrieved

Increase top_k parameter

Zero scores

Missing relevance judgments

Ensure qrels.tsv has entries for your queries

Tips for Better Results#

  • Start with small datasets like beir/nfcorpus to validate configuration

  • Use top_k > cutoff - e.g., set top_k: 20 when measuring retriever-ndcg-cut-10

  • Enable truncation for datasets with long documents to avoid embedding errors

  • Compare with reranker - add a reranker to see ranking improvement


Limitations#

  1. Dataset Format: Retriever metrics currently only support the BEIR dataset format. Ensure your data follows the required structure.

  2. Document Length: Documents exceeding 65k characters may need truncation. Use truncate_long_documents: "end" to keep the first 65k characters or "start" to keep the last 65k characters.

  3. Memory Usage: Large corpora may require significant memory for indexing. Consider limiting evaluation to smaller subsets for testing.

See also