Evaluate with LLM-as-a-Judge#
Use another LLM to evaluate outputs from your model or dataset with flexible scoring criteria. LLM-as-a-Judge is ideal for evaluating creative, complex, or domain-specific tasks where traditional metrics fall short.
Overview#
LLM-as-a-Judge evaluation works by sending your data to a “judge” LLM that scores responses according to criteria you define. You can evaluate:
Model outputs: Score how well a model responds to prompts
Pre-generated data: Evaluate existing question-answer pairs or conversations
Custom criteria: Define your own scoring rubrics or numerical ranges
NeMo Evaluator supports two evaluation modes:
Mode |
Use Case |
Response |
|---|---|---|
Live Evaluation |
Rapid prototyping, developing metrics, testing configurations. Dataset is limited to 10 rows. |
Immediate (synchronous) |
Job Evaluation |
Production workloads, full datasets |
Async (poll for completion) |
Prerequisites#
Before running LLM-as-a-Judge evaluations:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Judge LLM endpoint: Have access to an LLM that will serve as your judge (e.g., a NIM endpoint or OpenAI-compatible API)
API key (if required): If your judge endpoint requires authentication, create a secret to store the API key. The secret must be in the same workspace where you run evaluations.
Initialize the SDK:
import os
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url=os.getenv("NMP_BASE_URL"), workspace="default")
Live Evaluation#
Live evaluation is designed for rapid iteration when developing and refining your evaluation metrics. Use it to quickly test different judge prompts, scoring criteria, and data formats before committing to a full evaluation job. Results return immediately, making it easy to experiment and debug.
Basic Example with Range Scores#
Evaluate responses using numerical range scores (e.g., 1-5 scale):
result = client.evaluation.metrics.evaluate(
dataset={
"rows": [
{
"input": "What is the capital of France?",
"output": "The capital of France is Paris."
},
{
"input": "How do I make coffee?",
"output": "Boil water, add grounds to filter, pour water over grounds, let it drip."
}
]
},
metric={
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "helpfulness",
"description": "How helpful is the response (1=not helpful, 5=extremely helpful)",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "helpfulness"}
},
{
"name": "accuracy",
"description": "How accurate is the response (1=incorrect, 5=completely accurate)",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "accuracy"}
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "You are an expert judge. Rate each response on two dimensions (1-5 scale):\n- helpfulness: How useful is the response?\n- accuracy: How factually correct is the response?\n\nRespond with JSON: {\"helpfulness\": <1-5>, \"accuracy\": <1-5>}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
}
]
}
}
)
# Aggregate statistics across all rows
print(f"Metric: {result.metric}")
for score in result.aggregate_scores:
print(f" {score.name}: mean={score.mean:.2f}, count={score.count}")
# Per-row scores - useful for debugging and understanding individual results
print("\nPer-row scores:")
for row_result in result.row_scores:
print(f" Row {row_result.index}: {row_result.scores}")
The response includes both aggregate scores (statistics across all rows) and row scores (individual scores per row). For live evaluations, row scores are particularly valuable as they let you inspect exactly how the judge scored each input, making it easy to debug your metric configuration.
Example Response
# result.model_dump()
{
"metric": "quality-judge",
"aggregate_scores": [
{
"name": "helpfulness",
"count": 2,
"mean": 4.5,
"min": 4.0,
"max": 5.0
},
{
"name": "accuracy",
"count": 2,
"mean": 4.0,
"min": 3.0,
"max": 5.0
}
],
"row_scores": [
{
"index": 0,
"row": {"input": "What is the capital of France?", "output": "The capital of France is Paris."},
"scores": {"helpfulness": 5, "accuracy": 5}
},
{
"index": 1,
"row": {"input": "How do I make coffee?", "output": "Boil water, add grounds..."},
"scores": {"helpfulness": 4, "accuracy": 3}
}
]
}
Example with Rubric Scores#
Use rubric scores when you want categorical labels with explicit descriptions:
result = client.evaluation.metrics.evaluate(
dataset={
"rows": [
{"input": "Tell me a joke", "output": "Why did the chicken cross the road? To get to the other side!"},
{"input": "Explain quantum physics", "output": "I don't know."}
]
},
metric={
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "quality",
"description": "Overall quality of the response",
"rubric": [
{"label": "poor", "value": 0, "description": "Response is unhelpful or incorrect"},
{"label": "acceptable", "value": 1, "description": "Response is partially correct"},
{"label": "good", "value": 2, "description": "Response is correct and helpful"},
{"label": "excellent", "value": 3, "description": "Response is comprehensive and insightful"}
],
"parser": {"type": "json", "json_path": "quality"}
},
{
"name": "completeness",
"description": "How complete is the answer",
"rubric": [
{"label": "incomplete", "value": 0, "description": "Missing key information"},
{"label": "partial", "value": 1, "description": "Covers main points but lacks detail"},
{"label": "complete", "value": 2, "description": "Fully addresses the question"}
],
"parser": {"type": "json", "json_path": "completeness"}
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "You are an expert judge. Rate each response:\n- quality: poor | acceptable | good | excellent\n- completeness: incomplete | partial | complete\n\nRespond with JSON: {\"quality\": \"<label>\", \"completeness\": \"<label>\"}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}\n\nRate this response."
}
]
}
}
)
Example Response with Rubric Distribution
# Request with additional aggregate fields
result = client.evaluation.metrics.evaluate(
dataset=dataset,
metric=metric,
aggregate_fields=["rubric_distribution", "mode_category"]
)
# result.aggregate_scores[0] for "quality"
{
"name": "quality",
"count": 2,
"mean": 1.5,
"rubric_distribution": [
{"label": "poor", "value": 0, "count": 1},
{"label": "acceptable", "value": 1, "count": 0},
{"label": "good", "value": 2, "count": 0},
{"label": "excellent", "value": 3, "count": 1}
],
"mode_category": "poor"
}
Custom Aggregate Fields#
By default, aggregate scores include count, mean, min, and max. Request additional statistics:
result = client.evaluation.metrics.evaluate(
dataset=dataset,
metric=metric,
aggregate_fields=["std_dev", "variance", "percentiles", "histogram"]
)
# Access extended statistics
for score in result.aggregate_scores:
print(f"{score.name}:")
print(f" Mean: {score.mean:.3f}")
print(f" Std Dev: {score.std_dev:.3f}")
print(f" Variance: {score.variance:.3f}")
if score.percentiles:
print(f" Median (p50): {score.percentiles.p50:.3f}")
print(f" p90: {score.percentiles.p90:.3f}")
Job-Based Evaluation#
For larger datasets or production workloads, use job-based evaluation. Jobs run asynchronously and support datasets of any size.
Create an Evaluation Job#
Evaluate pre-generated outputs stored in a dataset:
job = client.evaluation.metric_jobs.create(
spec={
"metric": {
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "quality",
"description": "Overall quality of the response",
"rubric": [
{"label": "poor", "value": 0, "description": "Response is unhelpful"},
{"label": "good", "value": 1, "description": "Response is helpful"},
{"label": "excellent", "value": 2, "description": "Response is exceptional"}
],
"parser": {"type": "json", "json_path": "quality"}
}
],
"prompt_template": {
"messages": [
{
"role": "system",
"content": "Rate the response quality: poor, good, or excellent.\nRespond with JSON: {\"quality\": \"<label>\"}"
},
{
"role": "user",
"content": "Question: {{input}}\n\nResponse: {{output}}"
}
]
}
},
"dataset": {
"files_url": "hf://datasets/<namespace>/<dataset-name>"
},
"params": {
"parallelism": 16,
"limit_samples": 100 # Optional: limit for testing
}
}
)
print(f"Job created: {job.name} ({job.id})")
Reference a previously created metric by its URN:
# First, create and store the metric
client.evaluation.metrics.create(
name="my-quality-judge",
type="llm-judge",
model={
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
scores=[
{
"name": "quality",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "quality"}
}
],
prompt_template={
"messages": [
{"role": "system", "content": "Rate quality 1-5. Respond: {\"quality\": <1-5>}"},
{"role": "user", "content": "{{input}}\n{{output}}"}
]
}
)
# Then use it in a job by URN (workspace/metric-name)
job = client.evaluation.metric_jobs.create(
spec={
"metric": "default/my-quality-judge",
"dataset": {"files_url": "hf://datasets/<namespace>/<dataset-name>"},
"params": {"parallelism": 16}
}
)
Use inline rows for quick testing before running on full datasets:
job = client.evaluation.metric_jobs.create(
spec={
"metric": {
"type": "llm-judge",
"model": {
"url": "<judge-nim-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "informativeness",
"rubric": [
{"label": "uninformative", "value": 0},
{"label": "informative", "value": 1}
],
"parser": {"type": "json", "json_path": "informativeness"}
}
],
"prompt_template": {
"messages": [
{"role": "system", "content": "Is this response informative? Reply: {\"informativeness\": \"uninformative\" or \"informative\"}"},
{"role": "user", "content": "{{output}}"}
]
}
},
"dataset": {
"rows": [
{"output": "Paris is the capital of France."},
{"output": "I don't know."}
]
}
}
)
Monitor Job Progress#
import time
while True:
job_status = client.evaluation.metric_jobs.get_status(job.name)
print(f"Status: {job_status.status}")
if job_status.status in ["completed", "error", "cancelled"]:
break
time.sleep(5)
Retrieve Results#
# List available results for the job
results_list = client.evaluation.metric_jobs.results.list(job.name)
print(f"Available results: {[r.name for r in results_list.data]}")
# Get a specific result
result = client.evaluation.metric_jobs.results.retrieve(
name="evaluation_results",
job=job.name
)
print(result.model_dump_json(indent=2, exclude_none=True))
Example Job Results
{
"aggregate_scores": [
{
"name": "quality",
"count": 100,
"mean": 1.2,
"min": 0,
"max": 2
}
]
}
Score Configuration#
LLM-as-a-Judge supports two types of scores:
Range Scores#
Use range scores for numerical ratings on a continuous scale:
{
"name": "relevance",
"description": "How relevant is the response (1=irrelevant, 5=highly relevant)",
"minimum": 1,
"maximum": 5,
"parser": {"type": "json", "json_path": "relevance"}
}
Rubric Scores#
Use rubric scores for categorical evaluations with explicit criteria:
{
"name": "sentiment",
"description": "Sentiment of the response",
"rubric": [
{"label": "negative", "value": -1, "description": "Response has negative tone"},
{"label": "neutral", "value": 0, "description": "Response is neutral"},
{"label": "positive", "value": 1, "description": "Response has positive tone"}
],
"parser": {"type": "json", "json_path": "sentiment"}
}
Tip
Rubric scores use structured outputs by default, which constrains the judge model to output valid JSON. This significantly reduces parsing errors.
Score Parsers#
Configure how scores are extracted from judge responses:
Parser Type |
Use Case |
Example Pattern |
|---|---|---|
|
Judge outputs JSON (recommended) |
|
|
Extract from free-form text |
|
# JSON parser (recommended)
"parser": {"type": "json", "json_path": "quality"}
# Regex parser (for models that don't support structured output)
"parser": {"type": "regex", "pattern": "QUALITY: (\\w+)"}
Custom Judge Prompts#
Customize the judge prompt to match your evaluation criteria. Use Jinja2 templating to access data fields and score definitions.
Template Variables#
Variable |
Description |
|---|---|
|
Input field from dataset row |
|
Output field from dataset row |
|
Any field from the dataset row |
|
Model-generated response (when evaluating a model) |
|
Dictionary of score definitions |
Example: Custom Judge Template#
JUDGE_TEMPLATE = """You are an expert evaluator assessing AI assistant responses.
Evaluate the response on these criteria:
{% for score_name, score in scores.items() %}
- {{ score_name }}{% if score.description %}: {{ score.description }}{% endif %}
{% if score.rubric %}
Options: {% for r in score.rubric %}{{ r.name }}{% if not loop.last %}, {% endif %}{% endfor %}
{% endif %}
{% endfor %}
Respond with JSON containing your ratings.
"""
metric = {
"type": "llm-judge",
"model": {
"url": "<judge-url>/v1",
"name": "meta/llama-3.1-70b-instruct",
"format": "nim"
},
"scores": [
{
"name": "clarity",
"description": "How clear and understandable is the response",
"rubric": [
{"label": "confusing", "value": 0, "description": "Hard to understand"},
{"label": "clear", "value": 1, "description": "Easy to understand"},
{"label": "crystal_clear", "value": 2, "description": "Exceptionally well explained"}
],
"parser": {"type": "json", "json_path": "clarity"}
}
],
"prompt_template": {
"messages": [
{"role": "system", "content": JUDGE_TEMPLATE},
{"role": "user", "content": "Question: {{input}}\n\nResponse: {{output}}"}
]
}
}
Managing Secrets for Authenticated Endpoints#
If your judge model endpoint requires an API key, store it as a secret. The secret is automatically resolved from the same workspace as your evaluation.
Create a Secret#
# Create a secret with your API key
client.secrets.create(
name="judge-api-key",
data="your-api-key-here"
)
Reference the Secret in Your Metric#
metric = {
"type": "llm-judge",
"model": {
"url": "https://api.example.com/v1",
"name": "gpt-4",
"format": "openai",
"api_key_secret": "judge-api-key" # Just the secret name
},
# ... scores and prompt_template
}
Inference Parameters#
Control judge model behavior with inference parameters:
"prompt_template": {
"messages": [...],
"temperature": 0.1, # Lower for more consistent scoring
"max_tokens": 1024, # Increase if judge needs more space
"timeout": 30, # Request timeout in seconds
"stop": ["<|end_of_text|>"] # Stop sequences
}
Reasoning Model Configuration#
For reasoning-enabled models (like Nemotron), configure reasoning parameters:
metric = {
"type": "llm-judge",
"model": {
"url": "<nim-url>/v1",
"name": "nvidia/llama-3.3-nemotron-super-49b-v1",
"format": "nim"
},
# ... scores ...
"prompt_template": {
"messages": [...],
"system_prompt": "'detailed thinking on'",
"reasoning_params": {
"end_token": "</think>"
},
"temperature": 0.1,
"max_tokens": 4096
}
}
Limitations#
Judge Model Quality: Evaluation quality depends on the judge model’s ability to follow instructions. Larger models (70B+) typically produce more consistent results.
NaN Scores: If the judge output cannot be parsed, the score is marked as
NaN. Common causes:Insufficient
max_tokens(check for"finish_reason": "length"in results)Judge model not following output format instructions
Use structured outputs or explicit format instructions to reduce NaN rates
Structured Output Requirement: Rubric scores require the judge model to support guided decoding. If your judge doesn’t support this, use regex parsers with explicit format instructions.
Live Evaluation Limits: Live evaluations are limited to 10 rows. Use job-based evaluation for larger datasets.
See also
Evaluation Results - Understanding and downloading results
Agentic Evaluation - Evaluate agent workflows
RAG Evaluation - Evaluate retrieval-augmented generation