Benchmark Results#
After your benchmark evaluation job completes, you can retrieve and analyze the results. Results include aggregate scores across all samples and detailed per-row scores for each metric in the benchmark.
Obtain Results#
You can obtain results using the SDK’s results methods.
Get Aggregate Scores#
Retrieve aggregate statistical summaries for all metrics in your evaluation:
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
name="my-job-name",
)
# Access metrics
for metric in aggregate["metrics"]:
metric_ref = metric["metric_ref"]
scores = metric["aggregation"]["scores"]
for score in scores:
print(f"{metric_ref}: {score['name']} = {score['mean']}")
The aggregate scores include:
Statistical measures: mean, standard deviation, min, max, sum
Percentiles: p10, p20, …, p90, p100
Histogram: Distribution bins for score values
Count information: Total samples and NaN counts
Get Row-Level Scores#
Retrieve per-sample scores for detailed analysis:
row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
name="my-job-name",
)
# Iterate through rows (JSONL format)
for row in row_scores:
item = row["item"] # Original dataset row
sample = row["sample"] # Model output (for online jobs)
metrics = row["metrics"] # Scores per metric
# Access individual metric scores
for metric_ref, scores in metrics.items():
for score in scores:
print(f"Sample metric {metric_ref}: {score['name']} = {score['value']}")
Row-level scores are returned in JSONL format where each line contains:
item: The original dataset rowsample: Model output (populated for online evaluation jobs)metrics: Scores for each metric, keyed by metric reference
List Available Results#
List all result files generated by a job:
results_list = client.evaluation.benchmark_jobs.results.list(name="my-job-name")
for result in results_list.data:
print(f"Available: {result.name}")
Common result files include:
aggregate-scores- Aggregate scores with statisticsrow-scores- Per-row scores in JSONL format
Parse Results#
You can load and analyze evaluation results using Python.
Using Pandas for Analysis#
Load aggregate scores into a DataFrame using json_normalize:
import pandas as pd
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
name="my-job-name",
)
# Flatten nested structure into a clean table
df_agg = pd.json_normalize(
aggregate["metrics"],
record_path=["aggregation", "scores"],
meta=["metric_ref"],
)
# View key statistics
print(df_agg[["metric_ref", "name", "mean", "std_dev", "min", "max", "count"]])
Example output:
metric_ref name mean std_dev min max count
0 my-workspace/exact-match exact-match 0.6 0.49 0.0 1.0 5
1 my-workspace/string-check string-check 0.8 0.40 0.0 1.0 5
Load row-level scores for detailed analysis:
row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
name="my-job-name",
)
# Flatten row scores into a DataFrame
rows = []
for row in row_scores:
flat_row = {**row["item"]}
for metric_ref, scores in row["metrics"].items():
for score in scores:
flat_row[score["name"]] = score["value"]
rows.append(flat_row)
df_rows = pd.DataFrame(rows)
print(df_rows)
Example output:
input output reference exact-match string-check
0 What is 2+2? The answer is 4 4 0.0 1.0
1 Capital of France? Paris is the capital Paris 0.0 1.0
2 Color of sky? Blue blue 1.0 0.0
3 Largest planet? Jupiter Jupiter 1.0 1.0
4 Water formula? H2O H2O 1.0 1.0
Identify low-scoring samples for review:
# Find rows where any score is below threshold
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} samples needing review")