Bring Your Own Metric#

NeMo Microservices offers built-in metrics that can be configured to evaluate on your custom data. You can bring your own metric into the NeMo Microservices ecosystem with remote metrics.

A remote metric can integrate with your custom evaluation served by a REST API. You have full control of the logic and evaluation that executes and the reported scores.

Overview#

Remote metrics support two types:

Type

Use Case

Payload Structure

Generic Remote (remote)

Custom endpoints with configurable body/scores

User-defined Jinja template

NeMo Agent Toolkit Remote (nemo-agent-toolkit-remote)

NAT evaluator endpoints

Fixed: {evaluator_name, item}

NeMo Evaluator supports two evaluation modes:

Mode

Use Case

Dataset Size

Response

Live Evaluation

Rapid prototyping, testing

Up to 10 rows

Immediate (synchronous)

Job Evaluation

Production workloads, full datasets

Unlimited

Async (poll for completion)

Prerequisites#

Before running remote metric evaluations:

  1. Workspace: Have a workspace created.

  2. Remote endpoint: Have your evaluation endpoint running and accessible.

  3. API key (if required): If your endpoint requires authentication, create a secret to store the API key.

  4. Initialize the SDK:

import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import (
    EvaluateInlineDatasetParam,
    InlineRemoteMetricParam,
    InlineNeMoAgentToolkitRemoteMetricParam,
)

client = NeMoMicroservices(
    base_url=os.getenv("NMP_BASE_URL"),
    workspace="default",
)

Live Evaluation#

Live evaluation provides immediate results for rapid iteration when developing and testing your metrics.

Generic Remote Metric#

Use a generic remote metric when you need full control over the request payload and score extraction:

metric: InlineRemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateInlineDatasetParam = {
    "rows": [
        {"reference": "The capital is Paris", "output": "Paris is the capital"},
        {"reference": "2", "output": "2"},
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

# Access results
for score in result.aggregate_scores:
    print(f"{score.name}: mean={score.mean}, count={score.count}")

Key configuration:

  • body: Jinja template for the request payload. Use {{ item.<column> }} to access dataset columns.

  • scores: List of score definitions with a parser object containing JSONPath expression for extracting values from the response.

NeMo Agent Toolkit Remote Metric#

Use the NAT remote metric type when integrating with NeMo Agent Toolkit evaluators:

metric: InlineNeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://localhost:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

dataset: EvaluateInlineDatasetParam = {
    "rows": [
        {
            "id": "item_1",
            "input_obj": "What is the capital of France?",
            "expected_output_obj": "The capital of France is Paris.",
            "output_obj": "Paris is the capital of France.",
            "trajectory": [],
            "expected_trajectory": [],
            "full_dataset_entry": {},
        }
    ]
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset=dataset,
)

print(f"Score: {result.aggregate_scores[0].mean}")

The NAT metric automatically:

  • Sends payload: {"evaluator_name": "<name>", "item": <row_data>}

  • Extracts score from: $.result.score


Job-Based Evaluation#

For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.

Create a Job with Inline Metric#

from nemo_microservices.types.evaluation import (
    InlineRemoteMetricParam,
    MetricOfflineJobParam,
)

metric: InlineRemoteMetricParam = {
    "type": "remote",
    "url": "https://my-evaluation-server.test/evaluate",
    "body": {
        "reference": "{{ item.reference }}",
        "response": "{{ item.output }}"
    },
    "scores": [
        {
            "name": "accuracy",
            "parser": {"type": "json", "json_path": "$.result.accuracy"}
        }
    ],
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {"reference": "Paris", "output": "Paris"},
                {"reference": "2", "output": "2"},
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")
from nemo_microservices.types.evaluation import (
    InlineNeMoAgentToolkitRemoteMetricParam,
    MetricOfflineJobParam,
)

metric: InlineNeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "timeout_seconds": 30.0,
    "max_retries": 3,
}

job = client.evaluation.metric_jobs.create(
    spec=MetricOfflineJobParam(
        metric=metric,
        dataset={
            "rows": [
                {
                    "id": "item_1",
                    "input_obj": "What is the capital of France?",
                    "expected_output_obj": "The capital of France is Paris.",
                    "output_obj": "Paris is the capital.",
                    "trajectory": [],
                    "expected_trajectory": [],
                    "full_dataset_entry": {},
                }
            ]
        },
    ),
)

print(f"Job created: {job.name} ({job.id})")

Create a Stored Metric#

You can create a reusable metric and reference it by name in jobs:

# Create the metric
client.evaluation.metrics.create(
    type="remote",
    name="my-remote-metric",
    description="Custom evaluation metric for accuracy scoring",
    url="https://my-evaluation-server.test/evaluate",
    body={"reference": "{{ item.reference }}", "response": "{{ item.output }}"},
    scores=[{"name": "accuracy", "parser": {"type": "json", "json_path": "$.result.accuracy"}}],
)

# Use it in a job by reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
    spec={"metric": "default/my-remote-metric", "dataset": "default/my-dataset-fileset"},
)

Visit Manage Metrics for more information on how to modify or delete a metric.

Monitor Job Progress#

import time

while True:
    job_status = client.evaluation.metric_jobs.retrieve(job.name)
    print(f"Status: {job_status.status}")
    
    if job_status.status in ["completed", "error", "cancelled"]:
        break
    
    time.sleep(5)

Using API Key Secrets#

If your remote endpoint requires authentication, store the API key as a secret:

Create a Secret#

client.secrets.create(
    name="my-remote-api-key",
    data="your-api-key-value"
)

Reference the Secret in Your Metric#

from nemo_microservices.types.evaluation import (
    InlineRemoteMetricParam,
    InlineNeMoAgentToolkitRemoteMetricParam,
)

# Live evaluation with secret
metric: InlineRemoteMetricParam = {
    "type": "remote",
    "url": "https://my-authenticated-endpoint.test/evaluate",
    "body": {"input": "{{ item.input }}"},
    "scores": [{"name": "score", "parser": {"type": "json", "json_path": "$.score"}}],
    "api_key_secret": "my-remote-api-key",
}

result = client.evaluation.metrics.evaluate(
    metric=metric,
    dataset={"rows": [{"input": "test"}]},
)

# Job evaluation with secret
nat_metric: InlineNeMoAgentToolkitRemoteMetricParam = {
    "type": "nemo-agent-toolkit-remote",
    "url": "http://host.docker.internal:8001/evaluate_item",
    "evaluator_name": "similarity_eval",
    "api_key_secret": "my-remote-api-key",
}

job = client.evaluation.metric_jobs.create(
    spec={"metric": nat_metric, "dataset": {"rows": [...]}},
)

The secret is automatically resolved:

  • Live evaluation: Secret is fetched from the platform’s secrets service

  • Job evaluation: Secret is injected as an environment variable into the container

The API key is sent in the Authorization: Bearer <key> header.


Endpoint Requirements#

Your remote endpoint must:

  1. Accept POST requests with Content-Type: application/json

  2. Return a JSON response containing the score(s)

Example Endpoint (FastAPI)#

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class EvaluationRequest(BaseModel):
    reference: str
    response: str

class EvaluationResponse(BaseModel):
    result: dict

@app.post("/evaluate")
async def evaluate(request: EvaluationRequest) -> EvaluationResponse:
    # Your evaluation logic here
    accuracy = 1.0 if request.reference == request.response else 0.0
    return EvaluationResponse(result={"accuracy": accuracy})

NAT Endpoint Format#

NAT endpoints receive:

{
    "evaluator_name": "similarity_eval",
    "item": {
        "id": "item_1",
        "input_obj": "What is the capital of France?",
        "expected_output_obj": "The capital of France is Paris.",
        "output_obj": "Paris is the capital.",
        "trajectory": [],
        "expected_trajectory": [],
        "full_dataset_entry": {}
    }
}

And must return:

{
    "success": true,
    "result": {
        "id": "item_1",
        "score": 0.85,
        "reasoning": {"method": "cosine_similarity"}
    },
    "error": null
}

Configuration Options#

Metric Parameters#

Parameter

Type

Description

type

string

"remote" or "nemo-agent-toolkit-remote"

url

string

Endpoint URL

body

dict

(Generic only) Jinja template for request payload

scores

list

(Generic only) List of score configuration objects (see below)

evaluator_name

string

(NAT only) Name of the NAT evaluator

api_key_secret

string

Optional secret name for API key authentication

timeout_seconds

float

Request timeout (default: 30.0)

max_retries

int

Max retry attempts (default: 3)

Score Configuration (Generic Remote Only)#

Each score object in the scores list supports the following fields:

Field

Type

Required

Description

name

string

Yes

Score identifier (lowercase, numbers, underscores only)

parser

object

Yes

Parser configuration for extracting the score value (see below)

description

string

No

Human-readable description of the score

minimum

float

No

Minimum expected value for the score range (default: None = no bound)

maximum

float

No

Maximum expected value for the score range (default: None = no bound)

Parser configuration:

Field

Type

Required

Description

type

string

Yes

Parser type, must be "json"

json_path

string

Yes

JSONPath expression to extract the score value

Example with all fields:

"scores": [
    {
        "name": "accuracy",
        "parser": {
            "type": "json",
            "json_path": "$.result.accuracy"
        },
        "description": "Measures response accuracy against reference",
        "minimum": 0.0,
        "maximum": 1.0
    }
]

Job Management#

After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.


Limitations#

  1. Network access: For job-based evaluation, endpoints must be accessible from the job container. Use host.docker.internal for local endpoints.

  2. Response format: Scores must be extractable via JSONPath from the response. Ensure your endpoint returns properly structured JSON.

  3. Live evaluation limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.

See also