Evaluation Benchmarks#

Benchmarks are pre-configured evaluation suites that combine metrics with curated datasets to measure model performance against established standards. NeMo Evaluator provides two types of benchmarks:

Industry Benchmarks: Industry-standard academic benchmarks (MMLU, HumanEval, GSM8K, etc.) for comparing model capabilities against published baselines
Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets

Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks may not capture the nuances of your application—such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.

Industry Benchmarks vs Custom Benchmarks#

Type	Use Case	Dataset	Metrics
Industry Benchmarks	Compare against published baselines, regression testing, model selection	Canonical datasets (fixed)	Standardized metrics
Custom Benchmarks	Domain-specific evaluation, production monitoring, task-specific assessment	Your evaluation data	Your choice of metrics

Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Microservices that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices()

all_system_benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(all_system_benchmarks)

# Filter by evaluation category label
filtered_system_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "advanced_reasoning"},
)
print(filtered_system_benchmarks)

Category Label	Description
`agentic`	Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.
`advanced_reasoning`	Evaluate reasoning capabilities of large language models through complex tasks.
`code`	Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.
`content_safety`	Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.
`instruction_following`	Evaluate the ability to follow explicit formatting and structural instructions
`language_understanding`	Evaluate knowledge and reasoning across diverse subjects in different languages.
`math`	Evaluate mathematical reasoning abilities.
`question_answering`	Evaluate the ability to generate answers to questions.
`rag`	Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.
`retrieval`	Evaluate the quality of document retriever pipelines.

Create Custom Benchmarks#

Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you’ll need to create the metrics that define how to score your model’s outputs.

benchmark = client.evaluation.benchmarks.create(
    workspace="my-workspace",
    name="my-qa-benchmark",
    description="Evaluates question-answering quality",
    metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
    dataset="my-workspace/qa-test-dataset",
    labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)

List Custom Benchmarks#

List all custom benchmarks within your workspace.

benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace")

Filter custom benchmarks by label or dataset

extra_query = {
    # Filter by label
    "search[data.labels.my-label]": "label-value",
    # Filter by dataset
    "filter[dataset]": "my-workspace/qa-test-dataset",
}

benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace", extra_query=extra_query)

Run Benchmark Jobs#

Create a benchmark evaluation job to run the benchmark against your data.

Offline Job (Dataset Evaluation)#

Evaluate a pre-generated dataset:

from nemo_microservices.types.evaluation import BenchmarkOfflineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/my-qa-benchmark",
    )
)

print(f"Job created: {job.name}")

Online Job (Model Evaluation)#

Evaluate a model directly by generating outputs during the benchmark:

from nemo_microservices.types.evaluation import BenchmarkOnlineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro",
        model={
            "url": "<your-nim-url>/v1/completions",
            "name": "meta/llama-3.1-8b-instruct"
        }
    )
)

print(f"Job created: {job.name}")

Job Management#

After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.

Benchmark Categories#

Custom Benchmarks

Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.

RAGAS BFCL

Custom Benchmarks

Agentic Benchmarks

Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.

RAGAS BFCL

Agentic Benchmarks

Industry Benchmarks

Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.

LM Harness BigCode

Industry Benchmarks