About Evaluating#

Evaluation is powered by NeMo Microservices, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Microservices enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

See also

For a comprehensive overview of evaluation concepts, capabilities, and how NeMo Evaluator fits into the NeMo ecosystem, refer to Evaluation Concepts.


Tutorials#

After deploying NeMo Microservices Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.

Run a benchmark evaluation

Learn how to run an evaluation with a built-in benchmark.

Run a Benchmark Evaluation
Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

Evaluate Response Quality with LLM-as-a-Judge

Understanding the Evaluation Workflow#

Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Microservices. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).

High-Level Evaluation Process#

At a high level, the evaluation process consists of the following steps:

  1. (Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.

  2. Run an Evaluation Job: Submit an evaluation metric job or benchmark job.

  3. Retrieve Results: Get your evaluation results to analyze model performance.


Available Evaluations#

Review configurations, data formats, and result examples for each evaluation.

Industry Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

Industry Benchmarks
Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

Retriever Evaluation Metrics
RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

RAG Evaluation Metrics
Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

Agentic Evaluation Metrics
LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.

Evaluate with LLM-as-a-Judge
Similarity Metrics

Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.

Similarity Metrics

Run Evaluation Jobs#

Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Microservices orchestrates the evaluation workflow, runs the specified metrics, and stores the results.

Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.

Manage Jobs#

Understand how to run evaluation jobs.

Benchmark Jobs

Create a job with an industry benchmark or custom benchmark.

Evaluation Benchmarks
Monitor Benchmark Jobs

Monitor the status and progress of a job in real-time.

Benchmark Job Management
Benchmark Job Results

Get the results of your evaluation jobs.

Benchmark Results
Metric Jobs

Create a metric job.

Evaluation Metrics
API Reference

View the NeMo Evaluator API reference.

Evaluator API