Bring Your Own Metric#
NeMo Microservices offers built-in metrics that can be configured to evaluate on your custom data. You can bring your own metric into the NeMo Microservices ecosystem with remote metrics.
A remote metric can integrate with your custom evaluation served by a REST API. You have full control of the logic and evaluation that executes and the reported scores.
Overview#
Remote metrics support two types:
Type |
Use Case |
Payload Structure |
|---|---|---|
Generic Remote ( |
Custom endpoints with configurable body/scores |
User-defined Jinja template |
NeMo Agent Toolkit Remote ( |
NAT evaluator endpoints |
Fixed: |
NeMo Evaluator supports two evaluation modes:
Mode |
Use Case |
Dataset Size |
Response |
|---|---|---|---|
Live Evaluation |
Rapid prototyping, testing |
Up to 10 rows |
Immediate (synchronous) |
Job Evaluation |
Production workloads, full datasets |
Unlimited |
Async (poll for completion) |
Prerequisites#
Before running remote metric evaluations:
Workspace: Have a workspace created.
Remote endpoint: Have your evaluation endpoint running and accessible.
API key (if required): If your endpoint requires authentication, create a secret to store the API key.
Initialize the SDK:
import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.types.evaluation import (
EvaluateInlineDatasetParam,
InlineRemoteMetricParam,
InlineNeMoAgentToolkitRemoteMetricParam,
)
client = NeMoMicroservices(
base_url=os.getenv("NMP_BASE_URL"),
workspace="default",
)
Live Evaluation#
Live evaluation provides immediate results for rapid iteration when developing and testing your metrics.
Generic Remote Metric#
Use a generic remote metric when you need full control over the request payload and score extraction:
metric: InlineRemoteMetricParam = {
"type": "remote",
"url": "https://my-evaluation-server.test/evaluate",
"body": {
"reference": "{{ item.reference }}",
"response": "{{ item.output }}"
},
"scores": [
{
"name": "accuracy",
"parser": {"type": "json", "json_path": "$.result.accuracy"}
}
],
"timeout_seconds": 30.0,
"max_retries": 3,
}
dataset: EvaluateInlineDatasetParam = {
"rows": [
{"reference": "The capital is Paris", "output": "Paris is the capital"},
{"reference": "2", "output": "2"},
]
}
result = client.evaluation.metrics.evaluate(
metric=metric,
dataset=dataset,
)
# Access results
for score in result.aggregate_scores:
print(f"{score.name}: mean={score.mean}, count={score.count}")
Key configuration:
body: Jinja template for the request payload. Use{{ item.<column> }}to access dataset columns.scores: List of score definitions with aparserobject containing JSONPath expression for extracting values from the response.
NeMo Agent Toolkit Remote Metric#
Use the NAT remote metric type when integrating with NeMo Agent Toolkit evaluators:
metric: InlineNeMoAgentToolkitRemoteMetricParam = {
"type": "nemo-agent-toolkit-remote",
"url": "http://localhost:8001/evaluate_item",
"evaluator_name": "similarity_eval",
"timeout_seconds": 30.0,
"max_retries": 3,
}
dataset: EvaluateInlineDatasetParam = {
"rows": [
{
"id": "item_1",
"input_obj": "What is the capital of France?",
"expected_output_obj": "The capital of France is Paris.",
"output_obj": "Paris is the capital of France.",
"trajectory": [],
"expected_trajectory": [],
"full_dataset_entry": {},
}
]
}
result = client.evaluation.metrics.evaluate(
metric=metric,
dataset=dataset,
)
print(f"Score: {result.aggregate_scores[0].mean}")
The NAT metric automatically:
Sends payload:
{"evaluator_name": "<name>", "item": <row_data>}Extracts score from:
$.result.score
Job-Based Evaluation#
For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.
Create a Job with Inline Metric#
from nemo_microservices.types.evaluation import (
InlineRemoteMetricParam,
MetricOfflineJobParam,
)
metric: InlineRemoteMetricParam = {
"type": "remote",
"url": "https://my-evaluation-server.test/evaluate",
"body": {
"reference": "{{ item.reference }}",
"response": "{{ item.output }}"
},
"scores": [
{
"name": "accuracy",
"parser": {"type": "json", "json_path": "$.result.accuracy"}
}
],
"timeout_seconds": 30.0,
"max_retries": 3,
}
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric=metric,
dataset={
"rows": [
{"reference": "Paris", "output": "Paris"},
{"reference": "2", "output": "2"},
]
},
),
)
print(f"Job created: {job.name} ({job.id})")
from nemo_microservices.types.evaluation import (
InlineNeMoAgentToolkitRemoteMetricParam,
MetricOfflineJobParam,
)
metric: InlineNeMoAgentToolkitRemoteMetricParam = {
"type": "nemo-agent-toolkit-remote",
"url": "http://host.docker.internal:8001/evaluate_item",
"evaluator_name": "similarity_eval",
"timeout_seconds": 30.0,
"max_retries": 3,
}
job = client.evaluation.metric_jobs.create(
spec=MetricOfflineJobParam(
metric=metric,
dataset={
"rows": [
{
"id": "item_1",
"input_obj": "What is the capital of France?",
"expected_output_obj": "The capital of France is Paris.",
"output_obj": "Paris is the capital.",
"trajectory": [],
"expected_trajectory": [],
"full_dataset_entry": {},
}
]
},
),
)
print(f"Job created: {job.name} ({job.id})")
Create a Stored Metric#
You can create a reusable metric and reference it by name in jobs:
# Create the metric
client.evaluation.metrics.create(
type="remote",
name="my-remote-metric",
description="Custom evaluation metric for accuracy scoring",
url="https://my-evaluation-server.test/evaluate",
body={"reference": "{{ item.reference }}", "response": "{{ item.output }}"},
scores=[{"name": "accuracy", "parser": {"type": "json", "json_path": "$.result.accuracy"}}],
)
# Use it in a job by reference (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
spec={"metric": "default/my-remote-metric", "dataset": "default/my-dataset-fileset"},
)
Visit Manage Metrics for more information on how to modify or delete a metric.
Monitor Job Progress#
import time
while True:
job_status = client.evaluation.metric_jobs.retrieve(job.name)
print(f"Status: {job_status.status}")
if job_status.status in ["completed", "error", "cancelled"]:
break
time.sleep(5)
Using API Key Secrets#
If your remote endpoint requires authentication, store the API key as a secret:
Create a Secret#
client.secrets.create(
name="my-remote-api-key",
data="your-api-key-value"
)
Reference the Secret in Your Metric#
from nemo_microservices.types.evaluation import (
InlineRemoteMetricParam,
InlineNeMoAgentToolkitRemoteMetricParam,
)
# Live evaluation with secret
metric: InlineRemoteMetricParam = {
"type": "remote",
"url": "https://my-authenticated-endpoint.test/evaluate",
"body": {"input": "{{ item.input }}"},
"scores": [{"name": "score", "parser": {"type": "json", "json_path": "$.score"}}],
"api_key_secret": "my-remote-api-key",
}
result = client.evaluation.metrics.evaluate(
metric=metric,
dataset={"rows": [{"input": "test"}]},
)
# Job evaluation with secret
nat_metric: InlineNeMoAgentToolkitRemoteMetricParam = {
"type": "nemo-agent-toolkit-remote",
"url": "http://host.docker.internal:8001/evaluate_item",
"evaluator_name": "similarity_eval",
"api_key_secret": "my-remote-api-key",
}
job = client.evaluation.metric_jobs.create(
spec={"metric": nat_metric, "dataset": {"rows": [...]}},
)
The secret is automatically resolved:
Live evaluation: Secret is fetched from the platform’s secrets service
Job evaluation: Secret is injected as an environment variable into the container
The API key is sent in the Authorization: Bearer <key> header.
Endpoint Requirements#
Your remote endpoint must:
Accept
POSTrequests withContent-Type: application/jsonReturn a JSON response containing the score(s)
Example Endpoint (FastAPI)#
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class EvaluationRequest(BaseModel):
reference: str
response: str
class EvaluationResponse(BaseModel):
result: dict
@app.post("/evaluate")
async def evaluate(request: EvaluationRequest) -> EvaluationResponse:
# Your evaluation logic here
accuracy = 1.0 if request.reference == request.response else 0.0
return EvaluationResponse(result={"accuracy": accuracy})
NAT Endpoint Format#
NAT endpoints receive:
{
"evaluator_name": "similarity_eval",
"item": {
"id": "item_1",
"input_obj": "What is the capital of France?",
"expected_output_obj": "The capital of France is Paris.",
"output_obj": "Paris is the capital.",
"trajectory": [],
"expected_trajectory": [],
"full_dataset_entry": {}
}
}
And must return:
{
"success": true,
"result": {
"id": "item_1",
"score": 0.85,
"reasoning": {"method": "cosine_similarity"}
},
"error": null
}
Configuration Options#
Metric Parameters#
Parameter |
Type |
Description |
|---|---|---|
|
string |
|
|
string |
Endpoint URL |
|
dict |
(Generic only) Jinja template for request payload |
|
list |
(Generic only) List of score configuration objects (see below) |
|
string |
(NAT only) Name of the NAT evaluator |
|
string |
Optional secret name for API key authentication |
|
float |
Request timeout (default: 30.0) |
|
int |
Max retry attempts (default: 3) |
Score Configuration (Generic Remote Only)#
Each score object in the scores list supports the following fields:
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Score identifier (lowercase, numbers, underscores only) |
|
object |
Yes |
Parser configuration for extracting the score value (see below) |
|
string |
No |
Human-readable description of the score |
|
float |
No |
Minimum expected value for the score range (default: None = no bound) |
|
float |
No |
Maximum expected value for the score range (default: None = no bound) |
Parser configuration:
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Parser type, must be |
|
string |
Yes |
JSONPath expression to extract the score value |
Example with all fields:
"scores": [
{
"name": "accuracy",
"parser": {
"type": "json",
"json_path": "$.result.accuracy"
},
"description": "Measures response accuracy against reference",
"minimum": 0.0,
"maximum": 1.0
}
]
Job Management#
After successfully creating a job, navigate to Metrics Job Management to oversee its execution, monitor progress.
Limitations#
Network access: For job-based evaluation, endpoints must be accessible from the job container. Use
host.docker.internalfor local endpoints.Response format: Scores must be extractable via JSONPath from the response. Ensure your endpoint returns properly structured JSON.
Live evaluation limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.
See also
Evaluation Results - Understanding and downloading results
LLM-as-a-Judge - Use an LLM to evaluate outputs
Agentic Evaluation - Evaluate agent workflows