Benchmark Job Management#
Manage your evaluation job.
Note
Performance Tuning: You can improve evaluation performance by setting job.params.parallelism to control the number of concurrent requests. A typical default value is 16, but you may need to adjust based on your model’s capacity and rate limits.
Monitor Job#
Monitor the status and progress of a job in real-time.
job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while job_status.status in ("active", "pending", "created"):
time.sleep(10)
job_status = client.evaluation.benchmark_jobs.get_status(name=job.name)
print("status:", job_status.status, job_status.status_details)
print(job_status)
Visit Troubleshooting NeMo Evaluator to help troubleshoot job failures.
Fetch Job Logs#
Get JSON logs with pagination. Logs are available for an active job and after the job terminates.
logs_response = client.evaluation.benchmark_jobs.get_logs(name=job.name)
for log_entry in logs_response.data:
print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")
# Handle pagination
while logs_response.next_page:
logs_response = client.evaluation.benchmark_jobs.get_logs(
name=job.name,
page_cursor=logs_response.next_page
)
for log_entry in logs_response.data:
print(f"[{log_entry.timestamp}] {log_entry.message.strip()}")
View Evaluation Results#
Evaluation results are available once the evaluation job successfully completes. Visit Benchmark Results for details on fetching evaluation results to analyze the job’s output.
Aggregate Scores#
Get the overall scores for each metric:
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
name=job.name,
)
print(aggregate) # Returns a parsed dict with metric statistics
The aggregate scores provide statistical summaries including mean, standard deviation, min, max, and percentiles for each metric in the benchmark.
Row-Level Scores#
Get per-sample scores for detailed analysis:
row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
name=job.name,
)
for row in row_scores:
print(row) # Each row contains item, sample, and metrics
Row-level scores are returned in JSONL format, with each line containing the original dataset row, model output (for online jobs), and scores for each metric.
List Available Results#
List all result files generated by a job:
results_list = client.evaluation.benchmark_jobs.results.list(name=job.name)
for result in results_list.data:
print(f"Available: {result.name}")
Common result files include:
aggregate-scores.json- Aggregate scores with statisticsrow-scores.jsonl- Per-row scores in JSONL formatartifacts- Tarball of job artifacts
Download Job Artifacts#
Download artifacts the job produced during evaluation to a tarball.
artifacts = client.evaluation.benchmark_jobs.results.artifacts.download(name=job.name)
artifacts.write_to_file("evaluation_artifacts.tar.gz")
print("Saved artifacts to evaluation_artifacts.tar.gz")
Extract files from the tarball with the following command and an artifacts directory will be created.
tar -xf evaluation_artifacts.tar.gz