Download this tutorial as a Jupyter notebook

Run a Benchmark Evaluation#

Learn how to perform an evaluation job with an industry benchmark in approximately 15 minutes.

Prerequisites#

  1. Set up the NeMo Microservices Quickstart.

  2. A read access token for your Hugging Face account to access the benchmark dataset.

  3. A build.nvidia.com API key for inference with a hosted model for evaluation.

Tip: Cleanup cells at the end of the notebook can be uncommented to let you delete resources if needed.

Overview#

Use case: evaluate math capabilities of a large language model.

Objectives#

By the end of this notebook, you will:

  • Discover industry benchmarks

  • Use an industry benchmark to run an evaluation job

  • Evaluate a model on solving grade school math word problems with the GSM8k dataset.

  • View evaluation results.

  • Download job artifacts.

# Install required packages
%pip install -q nemo-microservices ipywidgets
# Imports
import json
import os
import time

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import EvaluationJobParamsParam, InlineModel, SystemBenchmarkOnlineJobParam
# Set variables needed for the tutorial
workspace = "my-workspace"
nvidia_api_key = os.getenv("NVIDIA_API_KEY", "<your NVIDIA API key>")
hf_token = os.getenv("HF_TOKEN", "<your readonly Hugging Face token>")

# Initialize the SDK client
client = NeMoMicroservices(
    base_url=os.getenv("NMP_BASE_URL"),
    workspace=workspace,
)

Create a workspace to manage your secrets and evaluation jobs.

client.workspaces.create(name=workspace)

Discover Evaluation Benchmarks#

NeMo Microservices provides ready-to-use evaluation benchmarks that are available in the reserved system workspace. They can evaluate models against published datasets on a set of pre-defined metrics.

Discover all the industry benchmarks.

all_industry_benchmarks = client.evaluation.benchmarks.list(workspace="system", page_size=200)
print("Number of available industry benchmarks:", all_industry_benchmarks.pagination.total_results)

print("Example benchmark")
print(all_industry_benchmarks.data[0].model_dump_json(indent=2, exclude_none=True))
Output
Number of available industry benchmarks: 131
Example benchmark
{
  "id": "",
  "entity_id": "",
  "workspace": "system",
  "description": "BFCL v3 simple single-turn function calling. Tests basic function call generation.",
  "labels": {
    "eval_harness": "bfcl",
    "eval_category": "agentic"
  },
  "name": "bfclv3-simple",
  "required_params": [],
  "optional_params": [],
  "supported_job_types": [
    "online"
  ]
}

You can filter industry benchmarks by labels like math or advanced_reasoning. View the benchmark descriptions to choose the one that suits your evaluation needs.

filtered_industry_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "math"},
)
print("Filtered industry benchmarks:", filtered_industry_benchmarks.pagination.total_results)

for benchmark in filtered_industry_benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")
Output
Filtered industry benchmarks: 12
math-test-500-nemo: math_test_500 questions, math, using NeMo's alignment template
aime-2025-nemo: AIME 2025 questions, math, using NeMo's alignment template
mgsm-cot: MGSM-CoT: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.
gsm8k-cot-instruct: GSM8K-instruct: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.
gsm8k: GSM8K: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.
mgsm: MGSM: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.
aa-aime-2024: AIME 2024 questions, math, using Artificial Analysis's setup.
aa-math-test-500: Open AI math test 500, using Artificial Analysis's setup.
aime-2024: AIME 2024 questions, math
aime-2025: AIME 2025 questions, math
math-test-500: Open AI math test 500
aime-2024-nemo: AIME 2024 questions, math, using NeMo's alignment template

For this tutorial, we will evaluate with the GSM8K benchmark. This benchmark evaluates the arithmetic reasoning of large language models using grade school math word problems.

Inspect the benchmark for details on how to configure the job. You will see that the benchmark supports online evaluations and requires the parameter hf_token. Online evaluation involves live inference calls to a model, whereas an offline evaluation expects a dataset representing pre-generated model outputs.

gsm8k_benchmark = client.evaluation.benchmarks.retrieve(workspace="system", name="gsm8k-cot-instruct")

print(gsm8k_benchmark.model_dump_json(indent=2, exclude_none=True))

Configure Model to Evaluate#

Evaluate any model such as one hosted from build.nvidia.com. Create a secret for the API key and configure the model with the URL and secret for the hosted model.

For the GSM8K benchmark, it requires the model’s respective tokenizer available on Hugging Face.

We will use the model nvidia/llama-3.3-nemotron-super-49b-v1 for this tutorial hosted on build.nvidia.com.

nvidia_api_key_secret = "nvidia-api-key"
client.secrets.create(name=nvidia_api_key_secret, data=nvidia_api_key)

model = InlineModel(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1",
    api_key_secret=nvidia_api_key_secret,
)

Run Evaluation Job#

The GSM8K benchmark requires the hf_token parameter to access the dataset from Hugging Face. Create a secret for your Hugging Face token to be referenced by the job.

gsm8k_benchmark.required_params
Output
[{'name': 'hf_token',
  'type': 'secret',
  'description': 'HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.'}]
hf_token_secret = "hf-token"
client.secrets.create(name=hf_token_secret, data=hf_token)

For this tutorial, we limit the job to only run on 15 samples from the benchmark dataset. Remove params.limit_samples to run the full evaluation.

Note: Parallelism controls the number of concurrent requests to the model during evaluation and can improve the job runtime. Parallelism is set to 1 for the tutorial because https://integrate.api.nvidia.com has a rate limit of 1.

job = client.evaluation.benchmark_jobs.create(
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gsm8k-cot-instruct",
        benchmark_params={
            "hf_token": hf_token_secret,
        },
        params=EvaluationJobParamsParam(limit_samples=15, parallelism=1),
        model=model,
    )
)

Monitor the job

job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while job_status.status in ("active", "pending", "created"):
    time.sleep(10)
    job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
    print("status:", job_status.status, job_status.status_details)
print(job_status.model_dump_json(indent=2, exclude_none=True))

View Evaluation Results#

Evaluation results are available once the evaluation job successfully completes.

The scores for GSM8K range from [0.0, 1.0] measuring the proportion of correct answers.

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name=job.name,
    workspace=workspace,
)
print(json.dumps(aggregate, indent=2))  # Returns a parsed dict with metric statistics
Output
{
  "scores": [
    {
      "name": "exact_match__flexible-extract",
      "value": 1.0,
      "stats": {
        "stderr": 0.0
      }
    },
    {
      "name": "exact_match__strict-match",
      "value": 1.0,
      "stats": {
        "stderr": 0.0
      }
    }
  ]
}

View Job Artifacts#

View the job logs for more insight into the evaluation.

logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)
for log_entry in logs_response.data:
    print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

# Handle pagination
while logs_response.next_page:
    logs_response = client.evaluation.benchmark_jobs.get_logs(
        name=job.name,
        page_cursor=logs_response.next_page
    )
    for log_entry in logs_response.data:
        print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

Download artifacts the job produced during evaluation to a tarball.

artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(name=job.name)
artifacts.write_to_file("evaluation_artifacts.tar.gz")
print("Saved artifacts to evaluation_artifacts.tar.gz")

Extract files from the tarball with the following command and an artifacts directory will be created.

tar -xf evaluation_artifacts.tar.gz

Next Steps#

Scale Up:

  • Run the full evaluation by omitting the job parameter limit_samples. The full evaluation can take up to 5-10 hours.

Apply to Your Domain:

  • Search through available benchmarks and run an evaluation job with another industry benchmark.

  • Evaluate another model, hosted on https://build.nvidia.com or another service or host your own NIM.

Learn More:

Cleanup#

Uncomment cleanup cells as needed to delete resources.

# # Delete evaluation jobs (PERMANENT)
# print("Deleting evaluation jobs...")
# for job in client.evaluation.benchmark_jobs.list().data:
#     client.evaluation.benchmark_jobs.delete(job.name)
#     print(f"Deleted evaluation job {job.name}")
# # Delete secrets
# print("Deleting secrets...")
# for secret in client.secrets.list().data:
#     client.secrets.delete(secret.name)
#     print(f"Deleted secret {secret.name}")