Benchmark Results#

After your benchmark evaluation job completes, you can retrieve and analyze the results. Results include aggregate scores across all samples and detailed per-row scores for each metric in the benchmark.

Obtain Results#

You can obtain results using the SDK’s results methods.

Get Aggregate Scores#

Retrieve aggregate statistical summaries for all metrics in your evaluation:

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name="my-job-name",
)

# Access metrics
for metric in aggregate["metrics"]:
    metric_ref = metric["metric_ref"]
    scores = metric["aggregation"]["scores"]
    for score in scores:
        print(f"{metric_ref}: {score['name']} = {score['mean']}")

The aggregate scores include:

Statistical measures: mean, standard deviation, min, max, sum
Percentiles: p10, p20, …, p90, p100
Histogram: Distribution bins for score values
Count information: Total samples and NaN counts

Get Row-Level Scores#

Retrieve per-sample scores for detailed analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    name="my-job-name",
)

# Iterate through rows (JSONL format)
for row in row_scores:
    item = row["item"]  # Original dataset row
    sample = row["sample"]  # Model output (for online jobs)
    metrics = row["metrics"]  # Scores per metric
    
    # Access individual metric scores
    for metric_ref, scores in metrics.items():
        for score in scores:
            print(f"Sample metric {metric_ref}: {score['name']} = {score['value']}")

Row-level scores are returned in JSONL format where each line contains:

item: The original dataset row
sample: Model output (populated for online evaluation jobs)
metrics: Scores for each metric, keyed by metric reference

List Available Results#

List all result files generated by a job:

results_list = client.evaluation.benchmark_jobs.results.list(name="my-job-name")

for result in results_list.data:
    print(f"Available: {result.name}")

Common result files include:

aggregate-scores - Aggregate scores with statistics
row-scores - Per-row scores in JSONL format

Parse Results#

You can load and analyze evaluation results using Python.

Using Pandas for Analysis#

Load aggregate scores into a DataFrame using json_normalize:

import pandas as pd

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name="my-job-name",
)

# Flatten nested structure into a clean table
df_agg = pd.json_normalize(
    aggregate["metrics"],
    record_path=["aggregation", "scores"],
    meta=["metric_ref"],
)

# View key statistics
print(df_agg[["metric_ref", "name", "mean", "std_dev", "min", "max", "count"]])

Example output:

                  metric_ref          name  mean  std_dev  min  max  count
0   my-workspace/exact-match   exact-match   0.6     0.49  0.0  1.0      5
1  my-workspace/string-check  string-check   0.8     0.40  0.0  1.0      5

Load row-level scores for detailed analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    name="my-job-name",
)

# Flatten row scores into a DataFrame
rows = []
for row in row_scores:
    flat_row = {**row["item"]}
    for metric_ref, scores in row["metrics"].items():
        for score in scores:
            flat_row[score["name"]] = score["value"]
    rows.append(flat_row)

df_rows = pd.DataFrame(rows)
print(df_rows)

Example output:

                input                output reference  exact-match  string-check
      What is 2+2?       The answer is 4         4          0.0           1.0
Capital of France?  Paris is the capital     Paris          0.0           1.0
     Color of sky?                  Blue      blue          1.0           0.0
   Largest planet?               Jupiter   Jupiter          1.0           1.0
    Water formula?                   H2O       H2O          1.0           1.0

Identify low-scoring samples for review:

# Find rows where any score is below threshold
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} samples needing review")