Download this tutorial as a Jupyter notebook

Run a Benchmark Evaluation#

Learn how to perform an evaluation job with an industry benchmark in approximately 15 minutes.

Prerequisites#

Set up the NeMo Microservices Quickstart.
A read access token for your Hugging Face account to access the benchmark dataset.
A build.nvidia.com API key for inference with a hosted model for evaluation.

Tip: Cleanup cells at the end of the notebook can be uncommented to let you delete resources if needed.

Overview#

Use case: evaluate math capabilities of a large language model.

Objectives#

By the end of this notebook, you will:

Discover industry benchmarks
Use an industry benchmark to run an evaluation job
Evaluate a model on solving grade school math word problems with the GSM8k dataset.
View evaluation results.
Download job artifacts.

# Install required packages
%pip install -q nemo-microservices ipywidgets

# Imports
import json
import os
import time

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import EvaluationJobParamsParam, InlineModel, SystemBenchmarkOnlineJobParam

# Set variables needed for the tutorial
workspace = "my-workspace"
nvidia_api_key = os.getenv("NVIDIA_API_KEY", "<your NVIDIA API key>")
hf_token = os.getenv("HF_TOKEN", "<your readonly Hugging Face token>")

# Initialize the SDK client
client = NeMoMicroservices(
    base_url=os.getenv("NMP_BASE_URL"),
    workspace=workspace,
)

Create a workspace to manage your secrets and evaluation jobs.

client.workspaces.create(name=workspace)

Discover Evaluation Benchmarks#

NeMo Microservices provides ready-to-use evaluation benchmarks that are available in the reserved system workspace. They can evaluate models against published datasets on a set of pre-defined metrics.

Discover all the industry benchmarks.

all_industry_benchmarks = client.evaluation.benchmarks.list(workspace="system", page_size=200)
print("Number of available industry benchmarks:", all_industry_benchmarks.pagination.total_results)

print("Example benchmark")
print(all_industry_benchmarks.data[0].model_dump_json(indent=2, exclude_none=True))

You can filter industry benchmarks by labels like math or advanced_reasoning. View the benchmark descriptions to choose the one that suits your evaluation needs.

filtered_industry_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "math"},
)
print("Filtered industry benchmarks:", filtered_industry_benchmarks.pagination.total_results)

for benchmark in filtered_industry_benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

For this tutorial, we will evaluate with the GSM8K benchmark. This benchmark evaluates the arithmetic reasoning of large language models using grade school math word problems.

Inspect the benchmark for details on how to configure the job. You will see that the benchmark supports online evaluations and requires the parameter hf_token. Online evaluation involves live inference calls to a model, whereas an offline evaluation expects a dataset representing pre-generated model outputs.

gsm8k_benchmark = client.evaluation.benchmarks.retrieve(workspace="system", name="gsm8k-cot-instruct")

print(gsm8k_benchmark.model_dump_json(indent=2, exclude_none=True))

Configure Model to Evaluate#

Evaluate any model such as one hosted from build.nvidia.com. Create a secret for the API key and configure the model with the URL and secret for the hosted model.

For the GSM8K benchmark, it requires the model’s respective tokenizer available on Hugging Face.

We will use the model nvidia/llama-3.3-nemotron-super-49b-v1 for this tutorial hosted on build.nvidia.com.

nvidia_api_key_secret = "nvidia-api-key"
client.secrets.create(name=nvidia_api_key_secret, data=nvidia_api_key)

model = InlineModel(
    url="https://integrate.api.nvidia.com/v1/chat/completions",
    name="nvidia/llama-3.3-nemotron-super-49b-v1",
    api_key_secret=nvidia_api_key_secret,
)

Run Evaluation Job#

The GSM8K benchmark requires the hf_token parameter to access the dataset from Hugging Face. Create a secret for your Hugging Face token to be referenced by the job.

gsm8k_benchmark.required_params

hf_token_secret = "hf-token"
client.secrets.create(name=hf_token_secret, data=hf_token)

For this tutorial, we limit the job to only run on 15 samples from the benchmark dataset. Remove params.limit_samples to run the full evaluation.

Note: Parallelism controls the number of concurrent requests to the model during evaluation and can improve the job runtime. Parallelism is set to 1 for the tutorial because https://integrate.api.nvidia.com has a rate limit of 1.

job = client.evaluation.benchmark_jobs.create(
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gsm8k-cot-instruct",
        benchmark_params={
            "hf_token": hf_token_secret,
        },
        params=EvaluationJobParamsParam(limit_samples=15, parallelism=1),
        model=model,
    )
)

Monitor the job

job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while job_status.status in ("active", "pending", "created"):
    time.sleep(10)
    job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
    print("status:", job_status.status, job_status.status_details)
print(job_status.model_dump_json(indent=2, exclude_none=True))

View Evaluation Results#

Evaluation results are available once the evaluation job successfully completes.

The scores for GSM8K range from [0.0, 1.0] measuring the proportion of correct answers.

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name=job.name,
    workspace=workspace,
)
print(json.dumps(aggregate, indent=2))  # Returns a parsed dict with metric statistics

View Job Artifacts#

View the job logs for more insight into the evaluation.

logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)
for log_entry in logs_response.data:
    print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

# Handle pagination
while logs_response.next_page:
    logs_response = client.evaluation.benchmark_jobs.get_logs(
        name=job.name,
        page_cursor=logs_response.next_page
    )
    for log_entry in logs_response.data:
        print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

Download artifacts the job produced during evaluation to a tarball.

artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(name=job.name)
artifacts.write_to_file("evaluation_artifacts.tar.gz")
print("Saved artifacts to evaluation_artifacts.tar.gz")

Extract files from the tarball with the following command and an artifacts directory will be created.

tar -xf evaluation_artifacts.tar.gz

Next Steps#

Scale Up:

Run the full evaluation by omitting the job parameter limit_samples. The full evaluation can take up to 5-10 hours.

Apply to Your Domain:

Search through available benchmarks and run an evaluation job with another industry benchmark.
Evaluate another model, hosted on https://build.nvidia.com or another service or host your own NIM.

Learn More:

Cleanup#

Uncomment cleanup cells as needed to delete resources.

# # Delete evaluation jobs (PERMANENT)
# print("Deleting evaluation jobs...")
# for job in client.evaluation.benchmark_jobs.list().data:
#     client.evaluation.benchmark_jobs.delete(job.name)
#     print(f"Deleted evaluation job {job.name}")

# # Delete secrets
# print("Deleting secrets...")
# for secret in client.secrets.list().data:
#     client.secrets.delete(secret.name)
#     print(f"Deleted secret {secret.name}")