Custom Benchmarks#

Custom benchmarks allow you to create reusable evaluation suites tailored to your specific use case. A benchmark combines one or more metrics with a dataset, enabling consistent evaluation across multiple models or pipeline versions.

Note

Custom benchmarks can only include custom metrics that you create in your workspace. System metrics (in the system workspace) cannot be included in custom benchmarks at this time. To use system metrics, see Industry Benchmarks.

Prerequisites#

Before creating a custom benchmark, ensure you have:

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http://your-nemo-service:8000",  # Required: your deployment URL
    workspace="my-workspace",
)

Tip

Set base_url to your NeMo Evaluator deployment endpoint. For local development, this is typically http://localhost:8000.

Dataset Requirements#

Your dataset must be compatible with all metrics in the benchmark. Each metric defines input templates (like {{output}} and {{reference}}) that map to columns in your dataset.

For offline evaluation, your dataset should contain pre-generated model outputs as a JSON array:

[
  {"input": "What is the capital of France?", "output": "Paris", "reference": "Paris"},
  {"input": "What is 2+2?", "output": "4", "reference": "4"}
]

For online evaluation, your dataset contains inputs that will be sent to the model. The prompt_template you provide must reference columns in your dataset:

[
  {"question": "What is the capital of France?", "expected_answer": "Paris"},
  {"question": "What is 2+2?", "expected_answer": "4"}
]

Note

The dataset file must be named dataset.json and uploaded to your fileset.

Important

Ensure your dataset columns match both:

  1. The input templates defined in your metrics (e.g., {{output}}, {{reference}})

  2. The prompt_template used in online evaluation jobs (e.g., {{question}})

Create a Custom Benchmark#

A benchmark requires:

  • name: Unique identifier within the workspace

  • description: Human-readable description of what the benchmark evaluates

  • metrics: List of metric references in workspace/metric-name format

  • dataset: Fileset reference in workspace/fileset-name format

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http://your-nemo-service:8000",
    workspace="my-workspace",
)

benchmark = client.evaluation.benchmarks.create(
    name="customer-support-quality",
    description="Evaluates response quality for customer support conversations",
    metrics=[
        "my-workspace/answer-relevancy",
        "my-workspace/response-helpfulness",
    ],
    dataset="my-workspace/support-test-cases",
)

print(f"Created benchmark: {benchmark.name}")

List Benchmarks#

List all benchmarks in a workspace:

benchmarks = client.evaluation.benchmarks.list()

for benchmark in benchmarks.data:
    print(f"{benchmark.name}: {benchmark.description}")

Retrieve a Benchmark#

Get details of a specific benchmark:

benchmark = client.evaluation.benchmarks.retrieve(
    name="customer-support-quality",
)

print(f"Benchmark: {benchmark.name}")
print(f"Description: {benchmark.description}")
print(f"Metrics: {benchmark.metrics}")
print(f"Dataset: {benchmark.dataset}")

Delete a Benchmark#

Remove a benchmark when it’s no longer needed:

client.evaluation.benchmarks.delete(
    name="customer-support-quality",
)

Run Benchmark Evaluation Jobs#

Once you’ve created a benchmark, run evaluation jobs against it. There are two job types:

Job Type

Use When

Dataset Contains

Offline

You have pre-generated model outputs to evaluate

Input, output, and reference columns

Online

You want to generate and evaluate responses in one job

Input and reference columns (model generates output)

Choose offline evaluation for:

  • Evaluating pre-generated model outputs (e.g., from batch inference)

  • Comparing multiple model versions on the same test set

Choose online evaluation for:

  • Testing a model endpoint with live inference

  • End-to-end evaluation that generates and scores outputs in a single job

Offline Evaluation#

Offline evaluation assesses pre-existing model outputs stored in your dataset. Use this when you have already generated responses and want to evaluate their quality.

from nemo_microservices.types.evaluation import BenchmarkOfflineJobParam

job = client.evaluation.benchmark_jobs.create(
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/customer-support-quality",
    ),
)

print(f"Job created: {job.name}")
print(f"Status: {job.status}")

With Execution Parameters#

Control job execution with optional parameters:

from nemo_microservices.types.evaluation import (
    BenchmarkOfflineJobParam,
    EvaluationJobParamsParam,
)

job = client.evaluation.benchmark_jobs.create(
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/customer-support-quality",
        params=EvaluationJobParamsParam(
            parallelism=8,
            limit_samples=100,  # Evaluate only first 100 samples
        ),
    ),
)

Online Evaluation#

Online evaluation generates model responses at runtime, then evaluates them against your metrics. Use this to evaluate a model’s live performance.

from nemo_microservices.types.evaluation import (
    BenchmarkOnlineJobParam,
    InlineModelParam,
)

job = client.evaluation.benchmark_jobs.create(
    spec=BenchmarkOnlineJobParam(
        benchmark="my-workspace/customer-support-quality",
        model=InlineModelParam(
            url="https://integrate.api.nvidia.com/v1/chat/completions",
            name="meta/llama-3.1-8b-instruct",
            api_key_secret="my-workspace/nvidia-api-key",
        ),
        prompt_template="Answer the following customer question:\n\n{{input}}",
    ),
)

Job Management#

After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.

Retrieve Results#

After your job completes, retrieve and analyze results. See Benchmark Results for detailed examples of downloading aggregate scores, row-level scores, and analyzing results with Pandas.

Complete Example#

Here’s a complete workflow for creating a benchmark and running an evaluation.

Note

This example assumes you have already created the metrics (exact-match, f1-score) and uploaded your dataset (qa-test-data). See Manage Metrics for how to create custom metrics.

import json
import time

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import (
    BenchmarkOfflineJobParam,
    EvaluationJobParamsParam,
)

workspace = "my-workspace"
client = NeMoMicroservices(
    base_url="http://your-nemo-service:8000",
    workspace=workspace,
)

# 1. Create a custom benchmark
benchmark = client.evaluation.benchmarks.create(
    name="qa-accuracy-benchmark",
    description="Measures answer accuracy for Q&A tasks",
    metrics=[f"{workspace}/exact-match", f"{workspace}/f1-score"],
    dataset=f"{workspace}/qa-test-data",
)
print(f"Created benchmark: {benchmark.name}")

# 2. Run an offline evaluation job
job = client.evaluation.benchmark_jobs.create(
    spec=BenchmarkOfflineJobParam(
        benchmark=f"{workspace}/{benchmark.name}",
        params=EvaluationJobParamsParam(parallelism=16),
    ),
)
print(f"Started job: {job.name}")

# 3. Wait for completion
status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while status.status in ("pending", "active", "created"):
    print(f"Status: {status.status}")
    time.sleep(10)
    status = client.evaluation.benchmark_jobs.get_status(name=job.name)

print(f"Job completed: {status.status}")

# 4. Get results
if status.status == "completed":
    results = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
        name=job.name,
    )
    print("Results:")
    print(json.dumps(results, indent=2))

Troubleshooting#

Common Errors#

“Benchmark not found”

  • Verify the benchmark reference format: workspace/benchmark-name

  • Ensure the benchmark was created in the correct workspace

  • Check that the benchmark wasn’t deleted

“Metric not found”

  • Ensure all metrics referenced in the benchmark exist in your workspace

  • Remember that system metrics (system/...) cannot be used in custom benchmarks

  • Verify metric names match exactly (case-sensitive)

“Fileset not found”

  • Verify the dataset fileset was uploaded to the correct workspace

  • Check the fileset reference format: workspace/fileset-name

  • Ensure at least one file was uploaded to the fileset

Job status “error”

  • Check job logs using get_logs() for specific error messages

  • Verify your dataset columns match the metric input templates

  • For online jobs, verify the model endpoint is accessible and the API key is valid

Dataset column mismatch

  • Ensure your dataset contains all columns referenced by your metrics

  • For offline jobs: typically input, output, reference

  • For online jobs: columns referenced in your prompt_template

Debugging Tips#

  1. Start small: Test with limit_samples=10 to quickly identify issues

  2. Check logs: Always review job logs when a job fails

  3. Validate dataset: Ensure your JSONL is valid and columns are consistent

  4. Test metrics first: Run individual metric evaluations before combining into a benchmark

For additional help, see Troubleshooting.