Benchmark Results#

After your benchmark evaluation job completes, you can retrieve and analyze the results. Results include aggregate scores across all samples and detailed per-row scores for each metric in the benchmark.

Obtain Results#

You can obtain results using the SDK’s results methods.

Get Aggregate Scores#

Retrieve aggregate statistical summaries for all metrics in your evaluation:

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name="my-job-name",
)

# Access metrics
for metric in aggregate["metrics"]:
    metric_ref = metric["metric_ref"]
    scores = metric["aggregation"]["scores"]
    for score in scores:
        print(f"{metric_ref}: {score['name']} = {score['mean']}")

The aggregate scores include:

  • Statistical measures: mean, standard deviation, min, max, sum

  • Percentiles: p10, p20, …, p90, p100

  • Histogram: Distribution bins for score values

  • Count information: Total samples and NaN counts

Get Row-Level Scores#

Retrieve per-sample scores for detailed analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    name="my-job-name",
)

# Iterate through rows (JSONL format)
for row in row_scores:
    item = row["item"]  # Original dataset row
    sample = row["sample"]  # Model output (for online jobs)
    metrics = row["metrics"]  # Scores per metric
    
    # Access individual metric scores
    for metric_ref, scores in metrics.items():
        for score in scores:
            print(f"Sample metric {metric_ref}: {score['name']} = {score['value']}")

Row-level scores are returned in JSONL format where each line contains:

  • item: The original dataset row

  • sample: Model output (populated for online evaluation jobs)

  • metrics: Scores for each metric, keyed by metric reference

List Available Results#

List all result files generated by a job:

results_list = client.evaluation.benchmark_jobs.results.list(name="my-job-name")

for result in results_list.data:
    print(f"Available: {result.name}")

Common result files include:

  • aggregate-scores - Aggregate scores with statistics

  • row-scores - Per-row scores in JSONL format

Parse Results#

You can load and analyze evaluation results using Python.

Using Pandas for Analysis#

Load aggregate scores into a DataFrame using json_normalize:

import pandas as pd

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    name="my-job-name",
)

# Flatten nested structure into a clean table
df_agg = pd.json_normalize(
    aggregate["metrics"],
    record_path=["aggregation", "scores"],
    meta=["metric_ref"],
)

# View key statistics
print(df_agg[["metric_ref", "name", "mean", "std_dev", "min", "max", "count"]])

Example output:

                  metric_ref          name  mean  std_dev  min  max  count
0   my-workspace/exact-match   exact-match   0.6     0.49  0.0  1.0      5
1  my-workspace/string-check  string-check   0.8     0.40  0.0  1.0      5

Load row-level scores for detailed analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    name="my-job-name",
)

# Flatten row scores into a DataFrame
rows = []
for row in row_scores:
    flat_row = {**row["item"]}
    for metric_ref, scores in row["metrics"].items():
        for score in scores:
            flat_row[score["name"]] = score["value"]
    rows.append(flat_row)

df_rows = pd.DataFrame(rows)
print(df_rows)

Example output:

                input                output reference  exact-match  string-check
0        What is 2+2?       The answer is 4         4          0.0           1.0
1  Capital of France?  Paris is the capital     Paris          0.0           1.0
2       Color of sky?                  Blue      blue          1.0           0.0
3     Largest planet?               Jupiter   Jupiter          1.0           1.0
4      Water formula?                   H2O       H2O          1.0           1.0

Identify low-scoring samples for review:

# Find rows where any score is below threshold
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} samples needing review")