Evaluation Benchmarks#
Benchmarks are pre-configured evaluation suites that combine metrics with curated datasets to measure model performance against established standards. NeMo Evaluator provides two types of benchmarks:
Industry Benchmarks: Industry-standard academic benchmarks (MMLU, HumanEval, GSM8K, etc.) for comparing model capabilities against published baselines
Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets
Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks may not capture the nuances of your application—such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.
Industry Benchmarks vs Custom Benchmarks#
Type |
Use Case |
Dataset |
Metrics |
|---|---|---|---|
Compare against published baselines, regression testing, model selection |
Canonical datasets (fixed) |
Standardized metrics |
|
Domain-specific evaluation, production monitoring, task-specific assessment |
Your evaluation data |
Your choice of metrics |
Discover Industry Benchmarks#
Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.
Note
The system workspace is a reserved workspace for NeMo Microservices that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices()
all_system_benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(all_system_benchmarks)
# Filter by evaluation category label
filtered_system_benchmarks = client.evaluation.benchmarks.list(
workspace="system",
extra_query={"search[data.labels.eval_category]": "advanced_reasoning"},
)
print(filtered_system_benchmarks)
Category Label |
Description |
|---|---|
|
Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning. |
|
Evaluate reasoning capabilities of large language models through complex tasks. |
|
Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs. |
|
Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content. |
|
Evaluate the ability to follow explicit formatting and structural instructions |
|
Evaluate knowledge and reasoning across diverse subjects in different languages. |
|
Evaluate mathematical reasoning abilities. |
|
Evaluate the ability to generate answers to questions. |
|
Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. |
|
Evaluate the quality of document retriever pipelines. |
Create Custom Benchmarks#
Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you’ll need to create the metrics that define how to score your model’s outputs.
benchmark = client.evaluation.benchmarks.create(
workspace="my-workspace",
name="my-qa-benchmark",
description="Evaluates question-answering quality",
metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
dataset="my-workspace/qa-test-dataset",
labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)
List Custom Benchmarks#
List all custom benchmarks within your workspace.
benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace")
Filter custom benchmarks by label or dataset
extra_query = {
# Filter by label
"search[data.labels.my-label]": "label-value",
# Filter by dataset
"filter[dataset]": "my-workspace/qa-test-dataset",
}
benchmarks = client.evaluation.benchmarks.list(workspace="my-workspace", extra_query=extra_query)
Run Benchmark Jobs#
Create a benchmark evaluation job to run the benchmark against your data.
Offline Job (Dataset Evaluation)#
Evaluate a pre-generated dataset:
from nemo_microservices.types.evaluation import BenchmarkOfflineJobParam
job = client.evaluation.benchmark_jobs.create(
workspace="my-workspace",
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/my-qa-benchmark",
)
)
print(f"Job created: {job.name}")
Online Job (Model Evaluation)#
Evaluate a model directly by generating outputs during the benchmark:
from nemo_microservices.types.evaluation import BenchmarkOnlineJobParam
job = client.evaluation.benchmark_jobs.create(
workspace="my-workspace",
spec=BenchmarkOnlineJobParam(
benchmark="system/mmlu-pro",
model={
"url": "<your-nim-url>/v1/completions",
"name": "meta/llama-3.1-8b-instruct"
}
)
)
print(f"Job created: {job.name}")
Job Management#
After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.
Benchmark Categories#
Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.
Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.
Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.