Industry Benchmarks#

Evaluate with Published Datasets#

NeMo Microservices provides a streamlined API to evaluate large language models with publicly available datasets, offering over 130 industry benchmarks to run with evaluation jobs.

Benchmarks provide standardized methods for comparing model performance across different capabilities. These benchmarks are widely used in the research community and provide reliable, reproducible metrics for model assessment.

Use the Run a Benchmark Evaluation tutorial to gain a deeper understanding of how to use an industry benchmark and manage an evaluation job.

  • Standard Datasets: Most benchmarks include predefined datasets widely used in research.

  • Reproducible Metrics: Use established methodologies to calculate metrics.

  • Community Standards: You can compare results across different models and research groups.

Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Microservices that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices()

all_system_benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(all_system_benchmarks)

# Filter by evaluation category label
filtered_system_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_category]": "advanced_reasoning"},
)
print(filtered_system_benchmarks)

Category Label

Description

agentic

Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.

advanced_reasoning

Evaluate reasoning capabilities of large language models through complex tasks.

code

Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.

content_safety

Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.

instruction_following

Evaluate the ability to follow explicit formatting and structural instructions

language_understanding

Evaluate knowledge and reasoning across diverse subjects in different languages.

math

Evaluate mathematical reasoning abilities.

question_answering

Evaluate the ability to generate answers to questions.

rag

Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.

retrieval

Evaluate the quality of document retriever pipelines.

Choosing a Benchmark Variant#

Many benchmarks offer multiple variants optimized for different model types:

Variant

Endpoint

Description

-instruct

/v1/chat/completions

Zero-shot evaluation for instruction-tuned models

Base (no suffix)

/v1/completions

Few-shot evaluation for base models (requires tokenizer param)

-nemo

/v1/chat/completions

Optimized prompts for NVIDIA NeMo models, often no judge required

-cot

/v1/chat/completions

Chain-of-thought prompting for improved reasoning accuracy

Common Parameters#

Parameter

Description

parallelism

Number of concurrent inference requests. Higher values increase throughput but may hit rate limits.

limit_samples

Evaluate only the first N samples. Useful for testing before running full evaluations.

hf_token

Reference to a secret containing your Hugging Face token for accessing gated datasets.

tokenizer

Hugging Face tokenizer ID, required for completions-based benchmarks.

Advanced Reasoning#

Evaluate reasoning capabilities of large language models through complex tasks with datasets like GPQA, BIG-Bench Hard (BBH), or Multistep Soft Reasoning (MuSR).

  • Label: eval_category.advanced_reasoning

Available Benchmarks#

To get the latest benchmarks available for your system, filter with the eval_category.advanced_reasoning label.

Benchmark

Description

Required Params

system/gpqa-diamond

GPQA Diamond subset—198 graduate-level science questions

hf_token

system/gpqa-extended

GPQA Extended subset—546 questions in biology, physics, chemistry

hf_token

system/gpqa-main

GPQA Main subset—448 questions

hf_token

system/gpqa-diamond-nemo

GPQA Diamond with NeMo alignment template

hf_token

system/gpqa-diamond-cot

GPQA Diamond with chain-of-thought prompting

hf_token

system/gpqa

GPQA few-shot evaluation ¹

hf_token, tokenizer

system/bbh-instruct

BIG-Bench Hard—23 challenging reasoning tasks

hf_token

system/bbh

BIG-Bench Hard ¹

hf_token, tokenizer

system/musr

MuSR—multistep reasoning through narrative problems ¹

hf_token, tokenizer

¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")
job = client.evaluation.benchmark_jobs.create(
    description="GPQA Diamond evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-diamond",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="GPQA Extended evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-extended",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="GPQA Main evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-main",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="GPQA Diamond evaluation with NeMo template",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-diamond-nemo",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="GPQA Diamond chain-of-thought evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-diamond-cot",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="GPQA few-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="BIG-Bench Hard evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/bbh-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="BIG-Bench Hard evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/bbh",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="Multistep Soft Reasoning evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/musr",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)

Note

Most benchmarks require a Hugging Face token (hf_token) to access gated datasets. Create this secret before running evaluations:

import os

client.secrets.create(
    workspace=workspace,
    name="hf_token",
    data=os.getenv("HF_TOKEN", "<your Hugging Face token>")
)

Results#

All advanced reasoning benchmarks produce an accuracy score (0.0–1.0) measuring the proportion of correct answers:

  • GPQA: Multiple-choice accuracy (random baseline = 25%)

  • BBH: Exact match accuracy across 23 reasoning tasks

  • MuSR: Accuracy on multistep reasoning narratives

# Get results after job completes
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name=job.name,
)

# Print accuracy
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Instruction Following#

Evaluate a model’s ability to follow explicit formatting and structural instructions such as “include keyword x” or “use format y.”

  • Label: instruction_following

Available Benchmarks#

Benchmark

Description

Required Params

system/ifeval

IFEval—500 prompts testing adherence to verifiable instructions

hf_token

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")

job = client.evaluation.benchmark_jobs.create(
    description="IFEval instruction following evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/ifeval",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Results#

IFEval produces multiple accuracy scores measuring instruction compliance:

  • Prompt-level accuracy: Percentage of prompts where all instructions were followed

  • Instruction-level accuracy: Percentage of individual instructions followed across all prompts

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(name=job.name)
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Language Understanding#

Evaluate knowledge and reasoning across diverse subjects using MMLU (Massive Multitask Language Understanding) benchmarks covering 57 subjects across STEM, humanities, social sciences, and more.

  • Label: language_understanding

Available Benchmarks#

Benchmark

Description

Required Params

system/mmlu

MMLU—57 subjects, few-shot evaluation ¹

hf_token, tokenizer

system/mmlu-instruct

MMLU zero-shot with single-letter response format

hf_token

system/mmlu-pro

MMLU-Pro—10 answer choices, more rigorous ¹

hf_token, tokenizer

system/mmlu-pro-instruct

MMLU-Pro zero-shot with chat template

hf_token

system/mmlu-redux

MMLU-Redux—3,000 re-annotated questions ¹

hf_token, tokenizer

system/mmlu-redux-instruct

MMLU-Redux zero-shot with chat template

hf_token

system/wikilingua

WikiLingua—cross-lingual summarization

hf_token

system/mmlu-{lang}

Global-MMLU in 30+ languages ²

hf_token

¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.

² Supported languages: am, ar, bn, cs, de, el, en, es, fa, fil, fr, ha, he, hi, id, ig, it, ja, ko, ky, lt, mg, ms, ne, nl, ny, pl, pt, ro, ru, si, sn, so, sr, sv, sw, te, tr, uk, vi, yo

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")
job = client.evaluation.benchmark_jobs.create(
    description="MMLU zero-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="MMLU few-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="MMLU-Pro zero-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="MMLU-Pro few-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="MMLU-Redux zero-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu-redux-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="Global-MMLU Spanish evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mmlu-es",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="WikiLingua cross-lingual summarization",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/wikilingua",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Results#

Language understanding benchmarks produce accuracy scores:

  • MMLU/MMLU-Pro/MMLU-Redux: Multiple-choice accuracy across subjects (random baseline = 25% for MMLU, 10% for MMLU-Pro)

  • WikiLingua: ROUGE scores for summarization quality

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(name=job.name)
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Math & Reasoning#

Evaluate mathematical reasoning abilities from grade school arithmetic to competition-level mathematics.

  • Label: math

Available Benchmarks#

Benchmark

Description

Required Params

system/gsm8k

GSM8K—1,319 grade school math problems ¹

hf_token, tokenizer

system/gsm8k-cot-instruct

GSM8K with chain-of-thought zero-shot

hf_token

system/mgsm

MGSM—multilingual math (10 languages) ¹

hf_token, tokenizer

system/mgsm-cot

MGSM with chain-of-thought prompting

hf_token

system/aime-2024

AIME 2024—competition math ²

judge

system/aime-2025

AIME 2025—competition math ²

judge

system/aime-2024-nemo

AIME 2024 with NeMo template

system/aime-2025-nemo

AIME 2025 with NeMo template

system/math-test-500

MATH test set (500 problems) ²

judge

system/math-test-500-nemo

MATH test with NeMo template

system/aa-aime-2024

AIME 2024 (Artificial Analysis setup) ²

judge

system/aa-math-test-500

MATH test (Artificial Analysis setup) ²

judge

¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.

² Judge required: Requires a judge model to evaluate free-form math responses.

Important

For math benchmarks requiring a judge, use a model with strong instruction-following capabilities (70B+ parameters recommended). Smaller models may produce malformed judge outputs.

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")
job = client.evaluation.benchmark_jobs.create(
    description="GSM8K chain-of-thought evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gsm8k-cot-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="GSM8K few-shot evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gsm8k",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "tokenizer": "<your-model-tokenizer>",
        },
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="MGSM multilingual math evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mgsm-cot",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

No judge required with NeMo template.

job = client.evaluation.benchmark_jobs.create(
    description="AIME 2025 competition math",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/aime-2025-nemo",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Requires a judge model.

job = client.evaluation.benchmark_jobs.create(
    description="AIME 2025 competition math with judge",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/aime-2025",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "judge": {
                "model": {
                    "endpoint": "<your-nim-endpoint>/v1",
                    "name": "nvidia/llama-3.3-nemotron-super-49b-v1",
                }
            }
        },
    )
)

No judge required with NeMo template.

job = client.evaluation.benchmark_jobs.create(
    description="MATH test set evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/math-test-500-nemo",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Results#

Math benchmarks produce accuracy scores:

  • GSM8K/MGSM: Exact match accuracy on final numerical answers

  • AIME/MATH: Correctness as judged by the judge model (or exact match for NeMo variants)

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(name=job.name)
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Content Safety#

Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.

  • Label: content_safety

Available Benchmarks#

Benchmark

Description

Required Params

system/aegis-v2

AEGIS 2.0—12 hazard categories using Nemotron Safety Guard

hf_token, judge

system/wildguard

WildGuard—privacy, misinformation, harmful language

hf_token, judge

Important

Safety benchmarks require specific judge models deployed with a /v1/completions endpoint (not chat/completions):

Benchmark

Required Judge Model

Model ID

system/aegis-v2

Llama Nemotron Safety Guard V2

nvidia/llama-3.1-nemoguard-8b-content-safety

system/wildguard

WildGuard

allenai/wildguard

The benchmarks can take up to 1-3 hours.

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")

Requires Llama Nemotron Safety Guard V2 judge.

job = client.evaluation.benchmark_jobs.create(
    description="AEGIS-v2 content safety evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/aegis-v2",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "judge": {
                "model": {
                    "endpoint": "<your-safety-guard-endpoint>/v1/completions",
                    "name": "nvidia/llama-3.1-nemoguard-8b-content-safety",
                }
            }
        },
    )
)

Requires WildGuard judge.

job = client.evaluation.benchmark_jobs.create(
    description="WildGuard content safety evaluation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/wildguard",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={
            "hf_token": "hf_token",
            "judge": {
                "model": {
                    "endpoint": "<your-wildguard-endpoint>/v1/completions",
                    "name": "allenai/wildguard",
                }
            }
        },
    )
)

Results#

Safety benchmarks produce category-level safety rates:

  • AEGIS-v2: Safe/unsafe classification across 12 hazard categories (violence, hate, sexual content, etc.)

  • WildGuard: Safe/unsafe classification for privacy, misinformation, harmful language, malicious use

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(name=job.name)
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Troubleshooting Content Safety Benchmarks#

View Troubleshooting NeMo Evaluator for general troubleshooting steps of failed evaluation jobs.

This section covers common issues for the safety harness.

Hugging Face Error#

Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingface.co/ and log in to request access to the dataset or model.

datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.
GatedRepoError: 403 Client Error.

Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.

Incompatible Judge Model#

Using an unsupported judge model results in a job error. The aegis-v2 evaluation requires Llama Nemotron Safety Guard V2 judge and wildguard evaluation requires allenai/wildguard judge. KeyError is an example error for the wrong judge model like the following error.

Metrics calculated


        Evaluation Metrics         
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│           ERROR │           5.0 │
└─────────────────┴───────────────┘

...

Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
    return parse_output(output_dir)
  File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
    safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'

Unexpected Reasoning Traces#

Safety evaluations do not support reasoning traces and may result in the job error below.

ERROR    There are  at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.

If the target model outputs reasoning traces like <think>reasoning context</think>, configure the target model job.benchmark_params.judge.reasoning_params.end_token to only evaluate on the final thought. Consider specifying job.benchmark_params.judge.inference_params.max_tokens to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.

Additionally, if you are encountering this error, it could be caused by the model exceeding its token limit resulting in the full response being consumed by the model thinking. These results can be dropped by setting the job.benchmark_params.judge.reasoning_params.include_if_not_finished parameter.

benchmark_params = {
  "judge": {
    "model": "<judge>",
    "inference_params": {
        "max_tokens": 512
    },
    "reasoning_params": {
        "end_token": "</think>",
        "include_if_not_finished": False
    }
  }
}

Code#

Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.

  • Label: code

Available Benchmarks#

Benchmark

Description

Required Params

system/humaneval

HumanEval—164 Python problems ¹

system/humaneval-instruct

HumanEval for instruction-tuned models

system/humanevalplus

HumanEval+—80x more test cases ¹

system/mbpp

MBPP—Python programming problems

system/mbppplus

MBPP+—35x more test cases

system/mbppplus-nemo

MBPP+ with NeMo template

system/multiple-{lang}

MultiPL-E—HumanEval in 20+ languages ¹

¹ Completions-only: Requires /v1/completions endpoint.

MultiPL-E languages: clj (Clojure), cpp (C++), cs (C#), d, elixir, go, hs (Haskell), java, jl (Julia), js (JavaScript), lua, ml (OCaml), php, pl (Perl), r, rb (Ruby), rkt (Racket), rs (Rust), scala, sh (Bash), swift, ts (TypeScript)

The benchmarks can take up to 1–5 hours.

Examples#

from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

client = NeMoMicroservices(workspace="my-workspace")
job = client.evaluation.benchmark_jobs.create(
    description="HumanEval code generation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/humaneval-instruct",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="HumanEval code generation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/humaneval",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Extended test suite with 80x more test cases.

job = client.evaluation.benchmark_jobs.create(
    description="HumanEval+ code generation",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/humanevalplus",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="MBPP+ Python programming",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mbppplus-nemo",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
    )
)
job = client.evaluation.benchmark_jobs.create(
    description="MBPP+ Python programming",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/mbppplus",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="MultiPL-E JavaScript",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/multiple-js",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Requires a /v1/completions endpoint.

job = client.evaluation.benchmark_jobs.create(
    description="MultiPL-E Rust",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/multiple-rs",
        model={"endpoint": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
        params=EvaluationJobParams(parallelism=16),
    )
)

Results#

Code benchmarks produce pass@k metrics measuring functional correctness:

  • pass@1: Percentage of problems solved with one attempt

  • pass@10: Percentage of problems solved within 10 attempts (when n_samples > 1)

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(name=job.name)
for score in aggregate["scores"]:
    print(f"{score['name']}: {score['value']:.1%}")

For detailed results analysis, see Benchmark Results.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Run Benchmark Job#

Create a workspace if you have not already.

workspace = "my-workspace"

client.workspaces.create(name=workspace)

Create an evaluation job with a benchmark that satisfy the required and optional parameters.

Note

For benchmarks that require a Hugging Face token or other API keys for external services, create the secret to be referenced by the job.

Most benchmarks require a Hugging Face token (hf_token) to access gated datasets. Create this secret before running evaluations:

import os

client.secrets.create(
    workspace=workspace,
    name="hf_token",
    data=os.getenv("HF_TOKEN", "<your Hugging Face token>")
)
from nemo_microservices.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams

job = client.evaluation.benchmark_jobs.create(
    description="Example running system benchmark to evaluate my model's advanced reasoning capabilities.",
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/gpqa-diamond-cot",
        model={"endpoint": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        params=EvaluationJobParams(parallelism=16),
        benchmark_params={"hf_token": "hf_token"},
    )
)

Job Management#

After successfully creating a job, navigate to Benchmark Job Management to oversee its execution, monitor progress.