About Evaluating#

Evaluation is powered by NeMo Microservices, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

NeMo Microservices enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

Tutorials#

After deploying NeMo Microservices Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.

Run a benchmark evaluation

Learn how to run an evaluation with a built-in benchmark.

beginner nemo-evaluator

Run a Benchmark Evaluation

Run an LLM Judge Eval

Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.

custom-dataset

Evaluate Response Quality with LLM-as-a-Judge

Understanding the Evaluation Workflow#

Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Microservices. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).

High-Level Evaluation Process#

At a high level, the evaluation process consists of the following steps:

(Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.
- Upload your dataset files and manage Filesets
Run an Evaluation Job: Submit an evaluation metric job or benchmark job.
Retrieve Results: Get your evaluation results to analyze model performance.

Available Evaluations#

Review configurations, data formats, and result examples for each evaluation.

Industry Benchmarks

Standard benchmarks for code generation, safety, reasoning, and tool-calling.

code-generation safety-evaluation reasoning-tasks

Industry Benchmarks

Retrieval

Evaluate document retrieval pipelines on standard or custom datasets.

recall@k ndcg@k

Retriever Evaluation Metrics

RAG

Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).

recall@k faithfulness answer-relevancy

RAG Evaluation Metrics

Agentic

Assess agent-based and multi-step reasoning models, including topic adherence and tool use.

topic-adherence tool-call-accuracy goal-accuracy

Agentic Evaluation Metrics

LLM-as-a-Judge

Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.

custom-scoring rubrics

Evaluate with LLM-as-a-Judge

Similarity Metrics

Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.

F1 ROUGE BLEU

Similarity Metrics

Run Evaluation Jobs#

Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Microservices orchestrates the evaluation workflow, runs the specified metrics, and stores the results.

Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.

Manage Jobs#

Understand how to run evaluation jobs.

Benchmark Jobs

Create a job with an industry benchmark or custom benchmark.

Evaluation Benchmarks

Monitor Benchmark Jobs

Monitor the status and progress of a job in real-time.

Benchmark Job Management

Benchmark Job Results

Get the results of your evaluation jobs.

Benchmark Results

Metric Jobs

Create a metric job.

Evaluation Metrics

API Reference

View the NeMo Evaluator API reference.

Evaluator API