{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "(evaluate-run-an-evaluation)=\n", "# Run a Benchmark Evaluation\n", "\n", "Learn how to perform an evaluation job with an industry benchmark in approximately 15 minutes.\n", "\n", "## Prerequisites\n", "\n", "1. Set up the [NeMo Microservices Quickstart](nmp-quickstart).\n", "1. A read [access token](https://huggingface.co/settings/tokens) for your Hugging Face account to access the benchmark dataset.\n", "1. A [build.nvidia.com](https://build.nvidia.com) API key for inference with a hosted model for evaluation.\n", "\n", "> **Tip:** [Cleanup cells](#cleanup) at the end of the notebook can be uncommented to let you delete resources if needed.\n", "\n", "## Overview\n", "\n", "**Use case:** evaluate math capabilities of a large language model.\n", "\n", "## Objectives\n", "\n", "By the end of this notebook, you will:\n", "* Discover industry benchmarks\n", "* Use an industry benchmark to run an evaluation job\n", "* Evaluate a model on solving grade school math word problems with the [GSM8k dataset](https://arxiv.org/abs/2110.14168).\n", "* View evaluation results.\n", "* Download job artifacts." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Install required packages\n", "%pip install -q nemo-microservices ipywidgets" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "metadata": {}, "source": [ "# Imports\n", "import json\n", "import os\n", "import time\n", "\n", "from nemo_microservices import NeMoMicroservices\n", "from nemo_microservices.types.evaluation import EvaluationJobParamsParam, InlineModel, SystemBenchmarkOnlineJobParam" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "metadata": {}, "source": [ "# Set variables needed for the tutorial\n", "workspace = \"my-workspace\"\n", "nvidia_api_key = os.getenv(\"NVIDIA_API_KEY\", \"\")\n", "hf_token = os.getenv(\"HF_TOKEN\", \"\")\n", "\n", "# Initialize the SDK client\n", "client = NeMoMicroservices(\n", " base_url=os.getenv(\"NMP_BASE_URL\"),\n", " workspace=workspace,\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a workspace to manage your secrets and evaluation jobs." ] }, { "cell_type": "code", "metadata": {}, "source": [ "client.workspaces.create(name=workspace)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discover Evaluation Benchmarks\n", "\n", "NeMo Microservices provides ready-to-use evaluation benchmarks that are available in the reserved `system` workspace. They can evaluate models against published datasets on a set of pre-defined metrics.\n", "\n", "Discover all the industry benchmarks." ] }, { "cell_type": "code", "metadata": {}, "source": [ "all_industry_benchmarks = client.evaluation.benchmarks.list(workspace=\"system\", page_size=200)\n", "print(\"Number of available industry benchmarks:\", all_industry_benchmarks.pagination.total_results)\n", "\n", "print(\"Example benchmark\")\n", "print(all_industry_benchmarks.data[0].model_dump_json(indent=2, exclude_none=True))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{dropdown} Output\n", ":icon: code-square\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "Number of available industry benchmarks: 131\n", "Example benchmark\n", "{\n", " \"id\": \"\",\n", " \"entity_id\": \"\",\n", " \"workspace\": \"system\",\n", " \"description\": \"BFCL v3 simple single-turn function calling. Tests basic function call generation.\",\n", " \"labels\": {\n", " \"eval_harness\": \"bfcl\",\n", " \"eval_category\": \"agentic\"\n", " },\n", " \"name\": \"bfclv3-simple\",\n", " \"required_params\": [],\n", " \"optional_params\": [],\n", " \"supported_job_types\": [\n", " \"online\"\n", " ]\n", "}" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "You can filter industry benchmarks by labels like `math` or `advanced_reasoning`. View the benchmark descriptions to choose the one that suits your evaluation needs." ] }, { "cell_type": "code", "metadata": {}, "source": [ "filtered_industry_benchmarks = client.evaluation.benchmarks.list(\n", " workspace=\"system\",\n", " extra_query={\"search[data.labels.eval_category]\": \"math\"},\n", ")\n", "print(\"Filtered industry benchmarks:\", filtered_industry_benchmarks.pagination.total_results)\n", "\n", "for benchmark in filtered_industry_benchmarks:\n", " print(f\"{benchmark.name}: {benchmark.description}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{dropdown} Output\n", ":icon: code-square\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "Filtered industry benchmarks: 12\n", "math-test-500-nemo: math_test_500 questions, math, using NeMo's alignment template\n", "aime-2025-nemo: AIME 2025 questions, math, using NeMo's alignment template\n", "mgsm-cot: MGSM-CoT: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.\n", "gsm8k-cot-instruct: GSM8K-instruct: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems. This variant defaults to chain-of-thought zero-shot evaluation with custom instructions.\n", "gsm8k: GSM8K: The GSM8K benchmark evaluates the arithmetic reasoning of large language models using 1,319 grade school math word problems.\n", "mgsm: MGSM: The Multilingual Grade School Math (MGSM) benchmark evaluates the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, translated into ten diverse languages, and tests models using chain-of-thought prompting.\n", "aa-aime-2024: AIME 2024 questions, math, using Artificial Analysis's setup.\n", "aa-math-test-500: Open AI math test 500, using Artificial Analysis's setup.\n", "aime-2024: AIME 2024 questions, math\n", "aime-2025: AIME 2025 questions, math\n", "math-test-500: Open AI math test 500\n", "aime-2024-nemo: AIME 2024 questions, math, using NeMo's alignment template" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "For this tutorial, we will evaluate with the GSM8K benchmark. This benchmark evaluates the arithmetic reasoning of large language models using grade school math word problems.\n", "\n", "Inspect the benchmark for details on how to configure the job. You will see that the benchmark supports online evaluations and requires the parameter `hf_token`. Online evaluation involves live inference calls to a model, whereas an offline evaluation expects a dataset representing pre-generated model outputs." ] }, { "cell_type": "code", "metadata": {}, "source": [ "gsm8k_benchmark = client.evaluation.benchmarks.retrieve(workspace=\"system\", name=\"gsm8k-cot-instruct\")\n", "\n", "print(gsm8k_benchmark.model_dump_json(indent=2, exclude_none=True))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Model to Evaluate\n", "\n", "Evaluate any model such as one hosted from [build.nvidia.com](https://build.nvidia.com/). Create a secret for the API key and configure the model with the URL and secret for the hosted model.\n", "\n", "For the GSM8K benchmark, it requires the model's respective tokenizer available on Hugging Face.\n", "\n", "We will use the model `nvidia/llama-3.3-nemotron-super-49b-v1` for this tutorial hosted on [build.nvidia.com](https://build.nvidia.com/)." ] }, { "cell_type": "code", "metadata": {}, "source": [ "nvidia_api_key_secret = \"nvidia-api-key\"\n", "client.secrets.create(name=nvidia_api_key_secret, data=nvidia_api_key)\n", "\n", "model = InlineModel(\n", " url=\"https://integrate.api.nvidia.com/v1/chat/completions\",\n", " name=\"nvidia/llama-3.3-nemotron-super-49b-v1\",\n", " api_key_secret=nvidia_api_key_secret,\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Evaluation Job\n", "\n", "The GSM8K benchmark requires the `hf_token` parameter to access the dataset from Hugging Face. Create a secret for your Hugging Face token to be referenced by the job." ] }, { "cell_type": "code", "metadata": {}, "source": [ "gsm8k_benchmark.required_params" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{dropdown} Output\n", ":icon: code-square\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "[{'name': 'hf_token',\n", " 'type': 'secret',\n", " 'description': 'HuggingFace token for accessing datasets and tokenizers. Required for tasks that fetch from HuggingFace.'}]" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::" ] }, { "cell_type": "code", "metadata": {}, "source": [ "hf_token_secret = \"hf-token\"\n", "client.secrets.create(name=hf_token_secret, data=hf_token)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this tutorial, we limit the job to only run on 15 samples from the benchmark dataset. Remove `params.limit_samples` to run the full evaluation.\n", "\n", "> **Note:** Parallelism controls the number of concurrent requests to the model during evaluation and can improve the job runtime. Parallelism is set to 1 for the tutorial because https://integrate.api.nvidia.com has a rate limit of 1." ] }, { "cell_type": "code", "metadata": {}, "source": [ "job = client.evaluation.benchmark_jobs.create(\n", " spec=SystemBenchmarkOnlineJobParam(\n", " benchmark=\"system/gsm8k-cot-instruct\",\n", " benchmark_params={\n", " \"hf_token\": hf_token_secret,\n", " },\n", " params=EvaluationJobParamsParam(limit_samples=15, parallelism=1),\n", " model=model,\n", " )\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Monitor the job" ] }, { "cell_type": "code", "metadata": {}, "source": [ "job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)\n", "while job_status.status in (\"active\", \"pending\", \"created\"):\n", " time.sleep(10)\n", " job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)\n", " print(\"status:\", job_status.status, job_status.status_details)\n", "print(job_status.model_dump_json(indent=2, exclude_none=True))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View Evaluation Results\n", "\n", "Evaluation results are available once the evaluation job successfully completes.\n", "\n", "The scores for GSM8K range from [0.0, 1.0] measuring the proportion of correct answers." ] }, { "cell_type": "code", "metadata": {}, "source": [ "aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(\n", " name=job.name,\n", " workspace=workspace,\n", ")\n", "print(json.dumps(aggregate, indent=2)) # Returns a parsed dict with metric statistics" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{dropdown} Output\n", ":icon: code-square\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "{\n", " \"scores\": [\n", " {\n", " \"name\": \"exact_match__flexible-extract\",\n", " \"value\": 1.0,\n", " \"stats\": {\n", " \"stderr\": 0.0\n", " }\n", " },\n", " {\n", " \"name\": \"exact_match__strict-match\",\n", " \"value\": 1.0,\n", " \"stats\": {\n", " \"stderr\": 0.0\n", " }\n", " }\n", " ]\n", "}" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "## View Job Artifacts\n", "\n", "View the job logs for more insight into the evaluation." ] }, { "cell_type": "code", "metadata": {}, "source": [ "logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)\n", "for log_entry in logs_response.data:\n", " print(f\"[{log_entry.timestamp}] {log_entry.message.strip()}\")\n", "\n", "# Handle pagination\n", "while logs_response.next_page:\n", " logs_response = client.evaluation.benchmark_jobs.get_logs(\n", " name=job.name,\n", " page_cursor=logs_response.next_page\n", " )\n", " for log_entry in logs_response.data:\n", " print(f\"[{log_entry.timestamp}] {log_entry.message.strip()}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download artifacts the job produced during evaluation to a tarball." ] }, { "cell_type": "code", "metadata": {}, "source": [ "artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(name=job.name)\n", "artifacts.write_to_file(\"evaluation_artifacts.tar.gz\")\n", "print(\"Saved artifacts to evaluation_artifacts.tar.gz\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract files from the tarball with the following command and an `artifacts` directory will be created." ] }, { "cell_type": "code", "metadata": { "language": "shell", "vscode": { "languageId": "shellscript" } }, "source": [ "tar -xf evaluation_artifacts.tar.gz" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps\n", "\n", "**Scale Up:**\n", "- Run the full evaluation by omitting the job parameter `limit_samples`. The full evaluation can take up to 5-10 hours.\n", "\n", "**Apply to Your Domain:**\n", "- Search through available benchmarks and run an evaluation job with another industry benchmark.\n", "- Evaluate another model, hosted on https://build.nvidia.com or another service or host your own NIM.\n", "\n", "**Learn More:**\n", "- [Run Custom Evaluation Metrics](eval-metrics-index)\n", "- [Create Custom Benchmarks](eval-benchmarks-index)\n", "- [Other Evaluation Tutorials](evaluator-tutorials)\n", "\n", "## Cleanup\n", "\n", "Uncomment cleanup cells as needed to delete resources." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# # Delete evaluation jobs (PERMANENT)\n", "# print(\"Deleting evaluation jobs...\")\n", "# for job in client.evaluation.benchmark_jobs.list().data:\n", "# client.evaluation.benchmark_jobs.delete(job.name)\n", "# print(f\"Deleted evaluation job {job.name}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "metadata": {}, "source": [ "# # Delete secrets\n", "# print(\"Deleting secrets...\")\n", "# for secret in client.secrets.list().data:\n", "# client.secrets.delete(secret.name)\n", "# print(f\"Deleted secret {secret.name}\")" ], "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }