Similarity Metrics#

Template Evaluation Flow#

Template evaluation provides maximum flexibility for evaluating your models on proprietary, domain-specific, or novel tasks. With template evaluation, you can bring your own datasets, define your own prompts and templates using Jinja2, and select or implement the metrics that matter most for your use case. This approach is ideal when:

You want to evaluate on tasks, data, or formats not covered by academic benchmarks.
You need to measure model performance using custom or business-specific criteria.
You want to experiment with new evaluation methodologies, metrics, or workflows.
You need to create custom prompts and templates for specific use cases.

Prerequisites#

Tip

For detailed information on using Jinja2 templates in your template evaluations, including template objects, syntax, and examples, see the Templating Reference.

Example Job Execution#

Template Evaluation Types#

Table 1 Template Evaluation Comparison#
Evaluation	Use Case	Metrics	Example
Chat/Completion Tasks	Flexible chat/completion evaluation with custom prompts and metrics	BLEU, string-check, custom metrics	Evaluate Q&A, summarization, or custom chat flows
Tool-Calling	Evaluate function/tool call accuracy (OpenAI-compatible)	Tool-calling accuracy	Evaluate function-calling or API tasks

Chat/Completion Tasks#

Custom chat/completion evaluation allows you to assess model performance on flexible conversational or completion-based tasks using your own prompts, templates, and metrics. This is ideal for Q&A, summarization, or any scenario where you want to evaluate how well a model generates responses to user inputs, beyond standard academic benchmarks. You can define the structure of the conversation, specify expected outputs, and use metrics like BLEU or string-check to measure quality.

Config

{
  "type": "custom",
  "params": {
    "parallelism": 8
  },
  "tasks": {
    "qa": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {"role": "user", "content": "{{item.question}}"},
            {"role": "assistant", "content": "{{item.answer}}"}
          ],
          "max_tokens": 20,
          "temperature": 0.7,
          "top_p": 0.9
        }
      },
      "metrics": {
        "bleu": {
          "type": "bleu",
          "params": {
            "references": ["{{item.reference_answer | trim}}"]
          }
        },
        "rouge": {
          "type": "rouge",
          "params": {
            "ground_truth": "{{item.reference_answer | trim}}"
          }
        },
        "string-check": {
          "type": "string-check",
          "params": {
            "check": [
              "{{item.reference_answer | trim}}",
              "equals",
              "{{output_text | trim}}"
            ]
          }
        },
        "f1": {
            "type": "f1",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        },
        "em": {
            "type": "em",
            "params": {
                "ground_truth": "{{item.reference_answer | trim}}"
            }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Data Format

"question","answer","reference_answer"
"What is the capital of France?","Paris","The answer is Paris"
"What is 2+2?","4","The answer is 4"
"Square root of 256?","16","The answer is 16"

Result

{
  "tasks": {
    "qa": {
      "metrics": {
        "bleu": {
          "scores": {
            "sentence": {
              "value": 32.3,
              "stats": {
                "count": 200,
                "sum": 6460.66,
                "mean": 32.3
              }
            },
            "corpus": {
              "value": 14.0
            }
          }
        },
        "rouge": {
          "scores": {
            "rouge_1_score": {
              "value": 0.238671638808714,
              "stats": {
                "count": 10,
                "sum": 2.38671638808714,
                "mean": 0.238671638808714
              }
            },
            "rouge_2_score": {
              "value": 0.14953146173038,
              "stats": {
                "count": 10,
                "sum": 1.4953146173038,
                "mean": 0.14953146173038
              }
            },
            "rouge_3_score": {
              "value": 0.118334587614537,
              "stats": {
                "count": 10,
                "sum": 1.18334587614537,
                "mean": 0.118334587614537
              }
            },
            "rouge_L_score": {
              "value": 0.198059156106409,
              "stats": {
                "count": 10,
                "sum": 1.98059156106409,
                "mean": 0.198059156106409
              }
            }
          }
        },
        "string-check": {
          "scores": {
            "string-check": {
              "value": 0.255,
              "stats": {
                "count": 200,
                "sum": 51.0,
                "mean": 0.255
              }
            }
          }
        },
        "f1": {
          "scores": {
            "f1_score": {
              "value": 0.226293156870275,
              "stats": {
                "count": 10,
                "sum": 2.26293156870275,
                "mean": 0.226293156870275
              }
            }
          }
        },
        "em": {
          "scores": {
            "em_score": {
              "value": 0,
              "stats": {
                "count": 10,
                "sum": 0,
                "mean": 0
              }
            }
          }
        }
      }
    }
  }
}

Tool-Calling#

Evaluate accuracy of function/tool calls. Compare against ground truth calls. Supports OpenAI-compatible function calling format.

Config

{
    "type": "custom",
    "tasks": {
        "my-tool-calling-task": {
            "type": "chat-completion",
            "params": {
                "template": {
                    "messages": [
                        {"role": "user", "content": "{{item.messages[0].content}}"}
                    ]
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {
                        "tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"
                    }
                }
            },
            "dataset": {
                "files_url": "hf://datasets/default/<my-dataset>"
            }
        }
    }
}

Data Format

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [{"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}]}
  ],
  "tool_calls": [
    {"function": {"name": "book_table", "arguments": {"people": 2, "time": "7pm"}}}
  ]
}

Result

{
  "tasks": {
    "my-tool-calling-task": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 1.0
            },
            "function_name_and_args_accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Templating for Tasks#

This section explains how to use Jinja2 templates for prompts and tasks in template evaluation jobs.

Available Template Objects#

When rendering templates, two default objects are available:

item: Represents the current item from the dataset.
sample: Contains data related to the output from the model. The sample.output_text represents the completion text for completion models and the content of the first message for chat models.

The properties on the item object are derived from the dataset’s column names (for CSVs) or keys (for JSONs):

All non-alphanumeric characters are replaced with underscores.
Column names are converted to lowercase.
In case of conflicts, suffixes (_1, _2, etc.), are appended to the property names.

Templates for Chat Models#

Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, {{item.review}}.

Example Template for Chat-Completion Task#

{
    "messages": [{
        "role": "system",
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, {
        "role": "user",
        "content": "Determine if the following review is positive or negative: {{item.review}}"
    }]
}

Simple Chat Templating#

If your custom data is structured as prompt and ideal_response, you can structure this as a single-turn chat.

{ 
    "messages": [{
        "role": "system", 
        "content": "You are an expert in analyzing the sentiment of movie reviews."
    }, { 
        "role": "user", 
        "content": "Determine if the following review is positive or negative: {{item.prompt}}"
    }] 
} 

You can include this in a call to a /chat/completion endpoint.

Job v2

{
  "spec": {
    "config": {
      "type": "custom",
      "tasks": {
        "qa": {
          "type": "completion",
          "params": {
            "template": {
              "messages": [{
                "role": "system",
                "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
              }, { 
                "role": "user",
                "content": "Context: {{item.prompt}}"
              }] 
            }
          },
          "metrics": {
            "accuracy": {
              "type": "string-check",
              "params": {
                "check": [
                  "{{sample.output_text}}",
                  "contains",
                  "{{item.ideal_response}}"
                ]
              }
            }
          },
          "dataset": {
            "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
          }
        }
      }
    },
    "target": {
      "type": "model",
      "model": {
        "api_endpoint": {
          "url": "<my-nim-url>/v1/chat/completions",
          "model_id": "<my-model-id>"
        }
      }
    }
  }
}

Job v1

{
  "config": {
    "type": "custom",
    "tasks": {
      "qa": {
        "type": "completion",
        "params": {
          "template": {
            "messages": [{
              "role": "system",
              "content": "You are a helpful, respectful and honest assistant. \nExtract from the following context the minimal span word for word that best answers the question.\n."
            }, { 
              "role": "user",
              "content": "Context: {{item.prompt}}"
            }] 
          }
        },
        "metrics": {
          "accuracy": {
            "type": "string-check",
            "params": {
              "check": [
                "{{sample.output_text}}",
                "contains",
                "{{item.ideal_response}}"
              ]
            }
          }
        },
        "dataset": {
          "files_url": "hf://datasets/<my-dataset-namespace>/<my-dataset-name>"
        }
      }
    }
  },
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "url": "<my-nim-url>/v1/chat/completions",
        "model_id": "<my-model-id>"
      }
    }
  }
}

Messages Data Template#

If your custom data is already formatted as JSON, you can configure your template similar to the following:

{
    "messages": "{{ item.messages | tojson }}"
}

Inference Parameterization#

The template parameter allows customization of the parameters used when invoking your judge model. For example, to set the max_tokens and temperature of your judge model, specify appropriate values in your template. When doing this, you must also modify the template so that your prompt is the value for the prompt key:

"template": {
  "prompt": {"messages": [{"role": "system", "content": "Ping!"}]},
  "max_tokens": 1024,
  "temperature": 0.05
}

Note

The default value for max_tokens for judge models is set to 1024. It’s highly recommended to set an appropriate value for your judge model based on the expected outputs (for example, if using structured_output, ensure your max_tokens is set to accomodate the full JSON output). Incomplete JSON outputs will be represented as NaN values.

Metrics#

Template evaluation supports a wide range of metrics for different evaluation scenarios:

Table 2 Template Evaluation Metrics#
Metric	Description	Range	Use Case	Key Parameters
`bleu`	Computes BLEU; 100 represents a perfect match; higher is better.	0.0–100.0	Translation, summarization	`references` (templated list); optional `candidate` (else uses output)
`rouge`	Computes ROUGE scores; higher is better.	0.0–1.0	Summarization, text generation	`ground_truth`; optional `prediction` (else uses output)
`string-check`	Compares generated text to a reference and returns 0 or 1.	0.0–1.0	Q&A, classification	`check` = [left, op, right]; ops: `equals`, `!=`, `contains`, `startswith`, `endswith`. Example: `["{{item.reference_answer \| trim}}", "equals", "{{output_text \| trim}}"]`
`f1`	Computes F1 score per item and corpus; higher indicates greater similarity.	0.0–1.0	Classification, Q&A	`ground_truth`; optional `prediction` (else uses output)
`em`	Exact Match after normalization (case-insensitive, punctuation/articles removed, whitespace normalized).	0.0–1.0	Q&A, classification	`ground_truth`; optional `prediction` (else uses output)
`number-check`	Parses the last number and compares to a reference using numeric ops or tolerance.	0.0–1.0	Extraction, math, structured outputs	`check` = [left, op, right]; ops: `==`, `!=`, `>=`, `>`, `<=`, `<`. Tolerance form: `["absolute difference", left, right, "epsilon", <number>]`
`tool-calling-accuracy`	Evaluates correctness of function/tool calls (names and arguments).	0.0–1.0	Function calling evaluation	`tool_calls_ground_truth` (templated), for example: `"{{ item.tool_calls \| tojson }}"`

LLM-as-a-Judge Metric#

LLM-as-a-Judge can be used as a metric with advanced flexibility and can be combined with other custom metrics. We recommend using the llm-judge evaluation for most LLM-as-a-Judge use-cases.

{
  "type": "custom",
  "tasks": {
    "my-task": {
      "type": "chat-completion",
      "metrics": {
        "my-judge-metric": {
          "type": "llm-judge",
          "params": {
            "model": {
              // judge model configuration
              "api_endpoint": {
                "url": "<nim_url>",
                "model_id": "meta/llama-3.1-70b-instruct",
                "api_key": "<OPTIONAL_API_KEY>"
              }
            },
            "template": {
              // required
            },
            "scores": {
              // required
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

LLM-as-a-Judge Core Concepts#

The judge template is comprised of the judge prompt, judge response format, and jinja template(s). Each component is specific to the metric you want to evaluate and your evaluation dataset.

Judge prompt: The judge prompt defines the metric to evaluate your target for.

An as example, the judge prompt to evaluate a metric of the similarity between two items could be:

Your task is to evaluate the semantic similarity between two responses.
Judge output format: The judge needs to be guided to output a consistent parsable format by the score parser.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.
Jinja templating: Use jinja templating to format the judge prompt to evaluate the target sample.

Target model: The following is an example judge prompt with jinja templating to compare a column from your dataset to the output text from the target model response. For this example, the dataset contains a column named answer representing a golden example, and output_text, the special template variable for a model response.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.\nEXPECTED: {{item.answer}}\nACTUAL: {{output_text}}

Target dataset: The following example is a judge prompt with jinja templating to compare multiple columns of your dataset. For this example, the dataset contains columns named answer representing a golden example and response which can represent an offline model response.

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.\nEXPECTED: {{item.answer}}\nACTUAL: {{item.response}}

You can use jinja template functions to modify the content.
```
{{ item.answer | trim | lower }}
```

Task Types#

LLM-as-a-Judge is a metric that can be used with different task types:

chat-completion - Generate chat responses from a target model, then evaluate with an LLM judge. Use this for conversational tasks where you want to prompt a model in chat format and then judge the responses.
data - Evaluate existing prompt/response pairs directly (no model inference needed). Use this when you already have model outputs and want to judge them.

Choose chat-completion when you need to generate new outputs from a target model first or choose data when you already have model outputs to evaluate.

Chat-Completion Task

Use when you want to generate chat responses from a target model and then judge them.

The task params.template is the template for rendering the inference request for the target model. The template can use jinja templating to render content from the task dataset.

{
  "type": "custom",
  "tasks": {
    "my-chat-task": {
      "type": "chat-completion",
      "params": {
        "template": {
          "messages": [
            {
              "role": "system",
              "content": "You are a helpful assistant."
            },
            {
              "role": "user",
              "content": "{{item.user_message}}"
            }
          ]
        }
      },
      "metrics": {
        "helpfulness": {
          "type": "llm-judge",
          "params": {
            "model": {
              "api_endpoint": {
                "url": "<my-judge-nim-url>",
                "model_id": "<my-judge-model-id>"
              }
            },
            "template": {
              "messages": [
                {
                  "role": "system",
                  "content": "Your task is to evaluate how helpful an assistant's response is."
                },
                {
                  "role": "user",
                  "content": "Rate helpfulness from 1-5. Format: HELPFUL: X\n\nUSER: {{item.user_message}}\nASSISTANT: {{output_text}}"
                }
              ]
            },
            "scores": {
              "helpfulness": {
                "type": "integer",
                "parser": {
                  "type": "regex",
                  "pattern": "HELPFUL: (\\d)"
                }
              }
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Example dataset for Chat-Completion Task:

{
  "user_message": "What is the capital of France?"
}

Tip

The default value for max_tokens is 4096. You can customize the model generation’s temperature and max_tokens by defining them in your configuration’s tasks.<task>.params.template field:

"template": {
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "{{item.user_message}}"
    }
  ],
  "max_tokens": 8192,
  "temperature": 0.01
}

Completion Task

To evaluate generated completions from a target model completions instead of chat conversation, use the task type completion and task params.template as the prompt string.

{
  "type": "custom",
  "tasks": {
    "my-chat-task": {
      "type": "chat-completion",
      "params": {
        "template": "Answer this question: {{item.question}}\nAnswer:"
      },
      // requires model, template, scores
    }
  }
}

Example dataset for Completion Task:

{
  "question": "What is the capital of France?",
  "expected_answer": "Paris"
}

Data Task

Use when you have existing prompt/response pairs to evaluate directly.

{
  "type": "custom",
  "tasks": {
    "my-data-task": {
      "type": "data",
      "metrics": {
        "accuracy": {
          "type": "llm-judge",
          "params": {
            "model": {
              "api_endpoint": {
                "url": "<my-judge-nim-url>",
                "model_id": "<my-judge-model-id>"
              }
            },
            "template": {
              "messages": [
                {
                  "role": "system",
                  "content": "Your task is to evaluate the semantic similarity between two responses."
                },
                {
                  "role": "user",
                  "content": "Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10. \n\nRESPONSE 1: {{item.reference_answer}}\n\nRESPONSE 2: {{item.model_output}}.\n\n"
                }
              ]
            },
            "scores": {
              "similarity": {
                "type": "integer",
                "parser": {
                  "type": "regex",
                  "pattern": "SIMILARITY: (\\d+)"
                }
              }
            }
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/default/<my-dataset>"
      }
    }
  }
}

Example dataset for Dataset Task:

{
  "reference_answer": "Paris",
  "model_output": "The capital of France is Paris"
}

Result Examples

Chat-Completion Task Result:

{
  "tasks": {
    "my-chat-task": {
      "metrics": {
        "helpfulness": {
          "scores": {
            "helpfulness": {
              "value": 4,
              "stats": {
                "count": 75,
                "mean": 4.1,
                "min": 2,
                "max": 5
              }
            }
          }
        }
      }
    }
  }
}

Data Task Result:

{
  "tasks": {
    "my-data-task": {
      "metrics": {
        "accuracy": {
          "scores": {
            "similarity": {
              "value": 8,
              "stats": {
                "count": 100,
                "mean": 7.5,
                "min": 3,
                "max": 10
              }
            }
          }
        }
      }
    }
  }
}

Score Parser#

Build a score parser that is curated for your judge model and evaluation task. A score type must be a numerical value or boolean.

Supported score types: integer, number, boolean

Regex#

Use regular expression to parse the score from the judge model output. Build a regex that accounts for the specified judge output format for your configuration.

For example, the following pattern will match the formatted judge response SIMILARITY: 4 as outlined in the judge prompt:

Your task is to evaluate the semantic similarity between two responses. Respond in the following format SIMILARITY: 4. The similarity should be a score between 0 and 10.

            "scores": {
              "similarity": {
                "type": "integer",
                "parser": {
                  "type": "regex",
                  "pattern": "SIMILARITY: (\\d+)"
                }
              }
            }

Structured Output#

Some models perform better as a judge when configured with structured output.

Use JSON score parser in conjunction with structured_output to specify the format of the judge model and the JSON path to use as the score.

structured_output supports JSON schema
The score parser.json_path is required to be defined by the JSON schema.

              "structured_output": {
                "schema": {
                  "type": "object",
                  "properties": {
                    "similarity": { "type": "number" }
                   },
                   "required": ["similarity"],
                   "additionalProperties": false
                }
              },
              "scores": {
                "similarity": {
                  "type": "number",
                  "parser": {
                    "type": "json",
                    "json_path": "similarity"
                  }
                }
              }

The JSON score parser leverages NIM structured generation to format the output of the judge model.

Important

When using structured output, set an appropriate max_tokens value to accommodate the expected output. The default max_tokens value is 1024 for judge models. See Evaluate with LLM-as-a-Judge for details.

Important

Evaluator does not support structured output with OpenAI yet.

Important

Structured output and JSON score parser may not work well with reasoning models as the judge.

Custom Dataset Format#

Template evaluation supports custom datasets in various formats:

CSV files: Simple tabular data with headers
JSON files: Structured data with nested objects
JSONL files: Line-delimited JSON objects

The dataset format depends on your specific use case and the template structure you’re using. For detailed examples, see the configuration examples above.