About Evaluating#
Evaluation is powered by NeMo Microservices, a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale. The service provides automated workflows for over 100 industry benchmarks, LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.
NeMo Microservices enables real-time evaluations of your LLM application through APIs, guiding you in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.
See also
For a comprehensive overview of evaluation concepts, capabilities, and how NeMo Evaluator fits into the NeMo ecosystem, refer to Evaluation Concepts.
Tutorials#
After deploying NeMo Microservices Quickstart, use the following tutorials to learn how to accomplish common evaluation tasks. These step-by-step guides help you evaluate models using different benchmarks and metrics.
Learn how to run an evaluation with a built-in benchmark.
Learn how to evaluate a fine-tuned model using the LLM Judge metric with a custom dataset.
Understanding the Evaluation Workflow#
Before diving into specific evaluation flows, understand the general workflow for evaluating models with NeMo Microservices. The evaluation process involves creating targets (what to evaluate), configs (how to evaluate), and jobs (running the evaluation).
High-Level Evaluation Process#
At a high level, the evaluation process consists of the following steps:
(Optional) Prepare Custom Data: Determine if your evaluation requires a custom dataset.
Upload your dataset files and manage Filesets
Run an Evaluation Job: Submit an evaluation metric job or benchmark job.
Retrieve Results: Get your evaluation results to analyze model performance.
Available Evaluations#
Review configurations, data formats, and result examples for each evaluation.
Standard benchmarks for code generation, safety, reasoning, and tool-calling.
Evaluate document retrieval pipelines on standard or custom datasets.
Evaluate Retrieval Augmented Generation pipelines (retrieval plus generation).
Assess agent-based and multi-step reasoning models, including topic adherence and tool use.
Use another LLM to evaluate outputs with flexible scoring criteria. Define custom rubrics or numerical ranges.
Create metrics for text similarity, exact matching, and standard NLP evaluations using Jinja2 templating.
Run Evaluation Jobs#
Evaluation jobs execute the actual evaluation by combining targets and configs. When you submit a job, NeMo Microservices orchestrates the evaluation workflow, runs the specified metrics, and stores the results.
Jobs can be monitored in real-time and support various authentication methods for accessing external services. After a job completes, you can retrieve detailed results including metrics, logs, and performance data.
Manage Jobs#
Understand how to run evaluation jobs.
Create a job with an industry benchmark or custom benchmark.
Monitor the status and progress of a job in real-time.
Get the results of your evaluation jobs.
Create a metric job.
View the NeMo Evaluator API reference.