Custom Benchmarks#
Custom benchmarks allow you to create reusable evaluation suites tailored to your specific use case. A benchmark combines one or more metrics with a dataset, enabling consistent evaluation across multiple models or pipeline versions.
Note
Custom benchmarks can only include custom metrics that you create in your workspace. System metrics (in the system workspace) cannot be included in custom benchmarks at this time. To use system metrics, see Industry Benchmarks.
Prerequisites#
Before creating a custom benchmark, ensure you have:
A workspace created for your project
One or more custom metrics defined in your workspace
A dataset uploaded as a fileset
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http://your-nemo-service:8000", # Required: your deployment URL
workspace="my-workspace",
)
Tip
Set base_url to your NeMo Evaluator deployment endpoint. For local development, this is typically http://localhost:8000.
Dataset Requirements#
Your dataset must be compatible with all metrics in the benchmark. Each metric defines input templates (like {{output}} and {{reference}}) that map to columns in your dataset.
For offline evaluation, your dataset should contain pre-generated model outputs as a JSON array:
[
{"input": "What is the capital of France?", "output": "Paris", "reference": "Paris"},
{"input": "What is 2+2?", "output": "4", "reference": "4"}
]
For online evaluation, your dataset contains inputs that will be sent to the model. The prompt_template you provide must reference columns in your dataset:
[
{"question": "What is the capital of France?", "expected_answer": "Paris"},
{"question": "What is 2+2?", "expected_answer": "4"}
]
Note
The dataset file must be named dataset.json and uploaded to your fileset.
Important
Ensure your dataset columns match both:
The input templates defined in your metrics (e.g.,
{{output}},{{reference}})The
prompt_templateused in online evaluation jobs (e.g.,{{question}})
Create a Custom Benchmark#
A benchmark requires:
name: Unique identifier within the workspace
description: Human-readable description of what the benchmark evaluates
metrics: List of metric references in
workspace/metric-nameformatdataset: Fileset reference in
workspace/fileset-nameformat
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http://your-nemo-service:8000",
workspace="my-workspace",
)
benchmark = client.evaluation.benchmarks.create(
name="customer-support-quality",
description="Evaluates response quality for customer support conversations",
metrics=[
"my-workspace/answer-relevancy",
"my-workspace/response-helpfulness",
],
dataset="my-workspace/support-test-cases",
)
print(f"Created benchmark: {benchmark.name}")
List Benchmarks#
List all benchmarks in a workspace:
benchmarks = client.evaluation.benchmarks.list()
for benchmark in benchmarks.data:
print(f"{benchmark.name}: {benchmark.description}")
Retrieve a Benchmark#
Get details of a specific benchmark:
benchmark = client.evaluation.benchmarks.retrieve(
name="customer-support-quality",
)
print(f"Benchmark: {benchmark.name}")
print(f"Description: {benchmark.description}")
print(f"Metrics: {benchmark.metrics}")
print(f"Dataset: {benchmark.dataset}")
Delete a Benchmark#
Remove a benchmark when it’s no longer needed:
client.evaluation.benchmarks.delete(
name="customer-support-quality",
)
Run Benchmark Evaluation Jobs#
Once you’ve created a benchmark, run evaluation jobs against it. There are two job types:
Job Type |
Use When |
Dataset Contains |
|---|---|---|
Offline |
You have pre-generated model outputs to evaluate |
Input, output, and reference columns |
Online |
You want to generate and evaluate responses in one job |
Input and reference columns (model generates output) |
Choose offline evaluation for:
Evaluating pre-generated model outputs (e.g., from batch inference)
Comparing multiple model versions on the same test set
Choose online evaluation for:
Testing a model endpoint with live inference
End-to-end evaluation that generates and scores outputs in a single job
Offline Evaluation#
Offline evaluation assesses pre-existing model outputs stored in your dataset. Use this when you have already generated responses and want to evaluate their quality.
from nemo_microservices.types.evaluation import BenchmarkOfflineJobParam
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/customer-support-quality",
),
)
print(f"Job created: {job.name}")
print(f"Status: {job.status}")
With Execution Parameters#
Control job execution with optional parameters:
from nemo_microservices.types.evaluation import (
BenchmarkOfflineJobParam,
EvaluationJobParamsParam,
)
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/customer-support-quality",
params=EvaluationJobParamsParam(
parallelism=8,
limit_samples=100, # Evaluate only first 100 samples
),
),
)
Online Evaluation#
Online evaluation generates model responses at runtime, then evaluates them against your metrics. Use this to evaluate a model’s live performance.
from nemo_microservices.types.evaluation import (
BenchmarkOnlineJobParam,
InlineModelParam,
)
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOnlineJobParam(
benchmark="my-workspace/customer-support-quality",
model=InlineModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-8b-instruct",
api_key_secret="my-workspace/nvidia-api-key",
),
prompt_template="Answer the following customer question:\n\n{{input}}",
),
)
Job Management#
After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.
Retrieve Results#
After your job completes, retrieve and analyze results. See Benchmark Results for detailed examples of downloading aggregate scores, row-level scores, and analyzing results with Pandas.
Complete Example#
Here’s a complete workflow for creating a benchmark and running an evaluation.
Note
This example assumes you have already created the metrics (exact-match, f1-score) and uploaded your dataset (qa-test-data). See Manage Metrics for how to create custom metrics.
import json
import time
from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import (
BenchmarkOfflineJobParam,
EvaluationJobParamsParam,
)
workspace = "my-workspace"
client = NeMoMicroservices(
base_url="http://your-nemo-service:8000",
workspace=workspace,
)
# 1. Create a custom benchmark
benchmark = client.evaluation.benchmarks.create(
name="qa-accuracy-benchmark",
description="Measures answer accuracy for Q&A tasks",
metrics=[f"{workspace}/exact-match", f"{workspace}/f1-score"],
dataset=f"{workspace}/qa-test-data",
)
print(f"Created benchmark: {benchmark.name}")
# 2. Run an offline evaluation job
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark=f"{workspace}/{benchmark.name}",
params=EvaluationJobParamsParam(parallelism=16),
),
)
print(f"Started job: {job.name}")
# 3. Wait for completion
status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while status.status in ("pending", "active", "created"):
print(f"Status: {status.status}")
time.sleep(10)
status = client.evaluation.benchmark_jobs.get_status(name=job.name)
print(f"Job completed: {status.status}")
# 4. Get results
if status.status == "completed":
results = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
name=job.name,
)
print("Results:")
print(json.dumps(results, indent=2))
Troubleshooting#
Common Errors#
“Benchmark not found”
Verify the benchmark reference format:
workspace/benchmark-nameEnsure the benchmark was created in the correct workspace
Check that the benchmark wasn’t deleted
“Metric not found”
Ensure all metrics referenced in the benchmark exist in your workspace
Remember that system metrics (
system/...) cannot be used in custom benchmarksVerify metric names match exactly (case-sensitive)
“Fileset not found”
Verify the dataset fileset was uploaded to the correct workspace
Check the fileset reference format:
workspace/fileset-nameEnsure at least one file was uploaded to the fileset
Job status “error”
Check job logs using
get_logs()for specific error messagesVerify your dataset columns match the metric input templates
For online jobs, verify the model endpoint is accessible and the API key is valid
Dataset column mismatch
Ensure your dataset contains all columns referenced by your metrics
For offline jobs: typically
input,output,referenceFor online jobs: columns referenced in your
prompt_template
Debugging Tips#
Start small: Test with
limit_samples=10to quickly identify issuesCheck logs: Always review job logs when a job fails
Validate dataset: Ensure your JSONL is valid and columns are consistent
Test metrics first: Run individual metric evaluations before combining into a benchmark
For additional help, see Troubleshooting.