Benchmark Job Management#

Manage your evaluation job.

Note

Performance Tuning: You can improve evaluation performance by setting job.params.parallelism to control the number of concurrent requests. A typical default value is 16, but you may need to adjust based on your model’s capacity and rate limits.

Monitor Job#

Monitor the status and progress of a job in real-time.

job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while job_status.status in ("active", "pending", "created"):
    time.sleep(10)
    job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
    print("status:", job_status.status, job_status.status_details)
print(job_status)

Visit Troubleshooting NeMo Evaluator to help troubleshoot job failures.

Fetch Job Logs#

Get JSON logs with pagination. Logs are available for an active job and after the job terminates.

logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)
for log_entry in logs_response.data:
    print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

# Handle pagination
while logs_response.next_page:
    logs_response = client.evaluation.benchmark_jobs.get_logs(
        name=job.name,
        page_cursor=logs_response.next_page
    )
    for log_entry in logs_response.data:
        print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")

View Evaluation Results#

Evaluation results are available once the evaluation job successfully completes. Visit Benchmark Results for details on fetching evaluation results to analyze the job’s output.

Aggregate Scores#

Get the overall scores for each metric:

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name=job.name,
)
print(aggregate)  # Returns a parsed dict with metric statistics

The aggregate scores provide statistical summaries including mean, standard deviation, min, max, and percentiles for each metric in the benchmark.

Row-Level Scores#

Get per-sample scores for detailed analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    name=job.name,
)

for row in row_scores:
    print(row)  # Each row contains item, sample, and metrics

Row-level scores are returned in JSONL format, with each line containing the original dataset row, model output (for online jobs), and scores for each metric.

List Available Results#

List all result files generated by a job:

results_list = client.evaluation.benchmark_jobs.results.list(name=job.name)

for result in results_list.data:
    print(f"Available: {result.name}")

Common result files include:

aggregate-scores.json - Aggregate scores with statistics
row-scores.jsonl - Per-row scores in JSONL format
artifacts - Tarball of job artifacts

Download Job Artifacts#

Download artifacts the job produced during evaluation to a tarball.

artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(name=job.name)
artifacts.write_to_file("evaluation_artifacts.tar.gz")
print("Saved artifacts to evaluation_artifacts.tar.gz")

Extract files from the tarball with the following command and an artifacts directory will be created.

tar -xf evaluation_artifacts.tar.gz