Evaluation Benchmarks#

Benchmarks are pre-configured evaluation suites that combine metrics with curated datasets to measure model performance against established standards. NeMo Evaluator provides two types of benchmarks:

  • Industry Benchmarks: Industry-standard academic benchmarks (MMLU, HumanEval, GSM8K, etc.) for comparing model capabilities against published baselines

  • Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets

Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks may not capture the nuances of your application—such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.

Industry Benchmarks vs Custom Benchmarks#

Type

Use Case

Dataset

Metrics

Industry Benchmarks

Compare against published baselines, regression testing, model selection

Canonical datasets (fixed)

Standardized metrics

Custom Benchmarks

Domain-specific evaluation, production monitoring, task-specific assessment

Your evaluation data

Your choice of metrics

Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Microservices that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices()

all_system_benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(all_system_benchmarks)

# Filter by evaluation category label
filtered_system_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "advanced_reasoning"},
)
print(filtered_system_benchmarks)

Category Label

Description

agentic

Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.

advanced_reasoning

Evaluate reasoning capabilities of large language models through complex tasks.

code

Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.

content_safety

Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.

instruction_following

Evaluate the ability to follow explicit formatting and structural instructions

language_understanding

Evaluate knowledge and reasoning across diverse subjects in different languages.

math

Evaluate mathematical reasoning abilities.

question_answering

Evaluate the ability to generate answers to questions.

rag

Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.

retrieval

Evaluate the quality of document retriever pipelines.

Create Custom Benchmarks#

Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you’ll need to create the metrics that define how to score your model’s outputs.

benchmark = client.evaluation.benchmarks.create(
    workspace="my-workspace",
    name="my-qa-benchmark",
    description="Evaluates question-answering quality",
    metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
    dataset="my-workspace/qa-test-dataset",
    labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)

List Custom Benchmarks#

List all custom benchmarks within your workspace.

benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace")

Filter custom benchmarks by label or dataset

extra_query = {
    # Filter by label
    "search[data.labels.my-label]": "label-value",
    # Filter by dataset
    "filter[dataset]": "my-workspace/qa-test-dataset",
}

benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace", extra_query=extra_query)

Run Benchmark Jobs#

Create a benchmark evaluation job to run the benchmark against your data.

Offline Job (Dataset Evaluation)#

Evaluate a pre-generated dataset:

from nemo_microservices.types.evaluation import BenchmarkOfflineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/my-qa-benchmark",
    )
)

print(f"Job created: {job.name}")

Online Job (Model Evaluation)#

Evaluate a model directly by generating outputs during the benchmark:

from nemo_microservices.types.evaluation import BenchmarkOnlineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro",
        model={
            "url": "<your-nim-url>/v1/completions",
            "name": "meta/llama-3.1-8b-instruct"
        }
    )
)

print(f"Job created: {job.name}")

Job Management#

After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.

Benchmark Categories#

Custom Benchmarks

Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.

Custom Benchmarks
Agentic Benchmarks

Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.

Agentic Benchmarks
Industry Benchmarks

Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.

Industry Benchmarks