Evaluation Concepts#

NVIDIA NeMo Evaluator is a cloud-native microservice for evaluating large language models (LLMs), RAG pipelines, and AI agents at enterprise scale as part of the NeMo ecosystem. The service provides automated workflows for academic benchmarks (including LM Harness, BigCode, BFCL, Safety Harness, and Simple Evals), LLM-as-a-judge scoring, and specialized metrics for RAG and agent systems.

What is NeMo Evaluator?#

NeMo Evaluator enables real-time evaluations of your LLM applications through APIs, guiding developers and researchers in refining and optimizing LLMs for enhanced performance and real-world applicability. The NeMo Evaluator APIs can be seamlessly automated within development pipelines, enabling faster iterations without the need for live data. It is cost effective and suitable for pre-deployment checks and regression testing.

The development of Large Language Models (LLMs) has become pivotal in shaping intelligent applications across various domains. Enterprises today have a large number of LLMs to choose from and need a rigorous and systematic evaluation framework to choose the LLM that best suits their use case.

Evaluation Capabilities#

NVIDIA NeMo Evaluator supports evaluation of:

LLMs: Assess model performance using academic benchmarks, custom metrics, and LLM-as-a-Judge approaches
Retriever Pipelines: Measure document retrieval quality with metrics like Recall@K and NDCG@K
RAG Pipelines: Evaluate complete Retrieval Augmented Generation workflows including faithfulness, answer relevancy, and context precision
AI Agents: Test multi-step reasoning, tool use accuracy, topic adherence, and goal completion

For detailed information on evaluation types, refer to About Evaluating.

Architecture and Key Components#

NeMo Evaluator operates as a microservice within the NeMo platform and orchestrates evaluation workflows using several components:

Core Dependencies#

NeMo Data Store: Stores evaluation datasets, results, and artifacts
NeMo Entity Store: Manages evaluation targets, configs, and dataset metadata
PostgreSQL: Stores evaluation job metadata and status information

Optional Dependencies#

Milvus: Recommended for production Retriever and RAG pipeline evaluations (provides vector storage and similarity search). A local file-based fallback is available for development.
NIM (NVIDIA Inference Microservices): Provides model inference endpoints for evaluations

Workflow Orchestration#

When you submit an evaluation job, NeMo Evaluator:

Validates the target (model/pipeline/dataset to evaluate) and config (evaluation settings)
Retrieves datasets from NeMo Data Store or external sources
Executes the evaluation flow with specified metrics
Stores results and job artifacts back to NeMo Data Store
Updates job status and metadata

For deployment information, refer to NeMo Evaluator Deployment Guide.

Key Evaluation Concepts#

Understanding the following concepts is essential to using NeMo Evaluator effectively. The evaluation workflow centers around three core entities: Targets (what to evaluate), Configs (how to evaluate), and Jobs (the execution of an evaluation).

Job Lifecycle#

When you submit an evaluation job, it progresses through the following phases:

Validation: Retrieves target and config from NeMo Entity Store, validates compatibility, and runs prechecks
Compilation: Submits the evaluation as a job orchestrated by the Jobs Microservice.
Execution: Loads datasets, performs inference (if needed), computes metrics, and tracks progress
Completion: Uploads results to NeMo Data Store and updates final job status

For detailed information on job execution and troubleshooting, refer to Metric Job Management and Benchmark Job Management.

Job Status#

Jobs progress through several states:

CREATED: Job created but not yet scheduled
PENDING: Job accepted, waiting to start
RUNNING: Actively executing evaluation tasks
COMPLETED: Successfully finished all tasks
FAILED: Encountered an error during execution
CANCELLING: Job is being cancelled
CANCELLED: Manually stopped by user

Job Monitoring#

While a job runs, you can:

Query job status and progress percentage
View task-level status for multi-task evaluations
Access real-time logs (v2 API)
Track samples processed count

Job Results#

After completion, retrieve:

Metrics: Aggregated scores for each task and metric
Detailed Results: Per-sample outputs and scores
Logs: Execution logs for debugging
Metadata: Job configuration, timing, and resource usage

NeMo Evaluator Use Cases#

The following table shows common use cases and their corresponding documentation:

Evaluation Focus	Use Cases	NeMo Evaluator Documentation
Evaluations	What are the different evaluation options available? What evaluations should I run? I want to make a model scorecard.	Evaluation Metrics Evaluation Benchmarks
Data	What does the evaluation data look like? I have my own data - How do I evaluate LLMs against this data?	Evaluation Metrics

Integration with NeMo Microservices#

NeMo Evaluator integrates with other NeMo platform services to provide a complete evaluation solution:

NeMo Data Store: Stores datasets, evaluation results, and job artifacts
NeMo Entity Store: Manages metadata for datasets, models, targets, and configs
NIM: Provides model inference endpoints for LLM evaluations
NeMo Customizer: Can trigger evaluations before and after customization workflows
NeMo Guardrails: Evaluations can include guardrailed endpoints for safety assessments

The following diagram shows NeMo Evaluator’s interaction with other NeMo Microservices:

Evaluator Interaction with other NeMo Microservices

For information on the complete NeMo platform, refer to Overview of NeMo Microservices.