Download this tutorial as a Jupyter notebook

DPO Customization#

Learn how to use the NeMo Microservices Platform to create a DPO (Direct Preference Optimization) job using a custom dataset.

About#

DPO is an advanced fine-tuning technique for preference-based alignment. If you’re new to fine-tuning, consider starting with LoRA or Full SFT tutorials first.

Direct Preference Optimization (DPO) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the DPO paper.

DPO shares similarities with Full SFT training workflows but differs in a few key ways:

Aspect	SFT (Supervised Fine-Tuning)	DPO (Direct Preference Optimization)
Data Requirements	Labeled instruction-response pairs where the desired output is explicitly provided	Pairwise preference data, where for a given input, one response is explicitly preferred over another
Learning Objective	Directly teaches the model to generate a specific “correct” response	Directly optimizes the model to align with human preferences by maximizing the probability of preferred responses and minimizing rejected ones, without needing an explicit reward model
Alignment Focus	Aligns the model with the specific examples present in its training data	Aligns the model with broader human preferences, which can be more effective for subjective tasks or those without a single “correct” answer
Computational Efficiency	Standard fine-tuning efficiency	More computationally efficient than SFT (especially when compared to full RLHF methods) as it bypasses the need to train a separate reward model

What you can achieve with DPO:

Align with human preferences: Directly optimize your model to produce outputs that align with subjective human preferences without requiring explicit reward modeling
Refine response quality: Improve helpfulness, harmlessness, honesty, and other nuanced qualities that are easier to compare than to define
Control tone and style: Adjust the model’s communication style, verbosity, formality, and other subjective characteristics
Implement safety guardrails: Teach the model to avoid harmful or undesirable responses by training on preferred vs. rejected response pairs
Optimize subjective tasks: Excel at tasks where there are multiple acceptable answers but clear preferences exist (creative writing, dialogue, explanations)

When to choose DPO:

Subjective quality matters: Your task involves style, tone, or other qualities where there’s no single “correct” answer but clear preferences exist
You have preference data: You can collect pairwise comparisons (preferred vs. rejected responses) more easily than perfect labeled examples
Refining existing capabilities: You want to make targeted improvements to an already-trained model without major capability changes
Complex evaluation: Humans find it easier to compare which of two responses is better than to create the ideal response themselves (especially for multi-turn conversations, creative tasks, or nuanced outputs)
Robust behavior changes: You need more reliable behavior modification than prompting can provide, without the complexity of full RLHF
Lower compute than RLHF: You want human preference alignment but with simpler training that doesn’t require reinforcement learning infrastructure

When to choose SFT:

Clear correct answers: Your task has objectively correct outputs (code generation, structured data extraction, following specific formats)
High-quality examples: You have well-labeled input-output pairs that demonstrate exactly what the model should produce
Imitation learning: You want the model to closely mimic a specific style, format, or knowledge base from expert demonstrations
Foundational capabilities: You’re establishing new task-specific capabilities before fine-tuning preferences (SFT is often done before DPO)
Stable, predictable outputs: You need consistent formatting or structure that’s well-defined in your training examples
Traditional NLP tasks: Instruction following, translation, summarization, or classification where gold-standard labels exist

Prerequisites#

Before starting this tutorial, ensure you have:

Completed the Quickstart to install and deploy NeMo Microservices locally
Installed the Python SDK (included with pip install nemo-microservices)
Set up organizational entities (namespaces and projects) if you’re new to the platform

Quick Start#

1. Initialize SDK#

The SDK needs to know your NMP server URL. By default, http://localhost:8080 is used in accordance with the Quickstart guide. If NMP is running at a custom location, you can override the URL by setting the NMP_BASE_URL environment variable:

export NMP_BASE_URL=<YOUR_NMP_BASE_URL>

import os
from nemo_microservices import NeMoMicroservices, ConflictError
from nemo_microservices.types.customization import (
    CustomizationJobInputParam,
    CustomizationTargetParamParam,
    HyperparametersParam,
    DpoConfigParam
)

NMP_BASE_URL = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
sdk = NeMoMicroservices(
    base_url=NMP_BASE_URL,
    workspace="default"
)

2. Prepare Dataset#

Create your data in JSONL format - one JSON object per line. The platform auto-detects your data format. Supported dataset formats are listed below.

Flexible Data Setup:

No validation file? The platform automatically creates a 10% validation split
Multiple files? Upload to training/ or validation/ subdirectories—they’ll be automatically merged
Format detection: Your data format is auto-detected at training time

In this tutorial the following dataset directory structure will be used:

my_dataset
`-- training.jsonl
`-- validation.jsonl

Binary Preference Format#

DPO training requires preference pairs with three fields:

prompt: The input prompt (can be a string or array of message objects)
chosen: The preferred response
rejected: The less preferred response

{"prompt": [{"role": "user", "content": "What is the capital of France?"}], "chosen": "The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center.", "rejected": "I think the capital of France might be London or Paris, I'm not entirely sure."}

Tulu3 Preference Dataset Format#

This format contains complete conversation histories for both the chosen (preferred) and rejected responses.

Required fields:

chosen: Full conversation with the preferred response (list of message objects, last must be assistant)
rejected: Full conversation with the rejected response (list of message objects, last must be assistant)

{"chosen": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}], "rejected": [{"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "I'm not sure, but I think it might be London or Paris."}]}

HelpSteer Dataset Format#

This format uses numeric preference scores to indicate which response is better. The context can be either a simple string or an array of message objects.

Required fields:

context: The input context (can be a string or array of message objects)
response1: First response option
response2: Second response option
overall_preference: Preference score where negative values mean response1 is preferred, positive values mean response2 is preferred, and 0 indicates a tie

{"context": "Explain how to use git rebase", "response1": "Git rebase is a command that rewrites commit history by moving or combining commits. Use 'git rebase main' to reapply your branch commits on top of main. This creates a linear history and avoids merge commits.", "response2": "Use git rebase to change commits. Just type git rebase and it will work.", "overall_preference": -2}

3. Create Dataset FileSet and Upload Training Data#

Install huggingface datasets package to download public nvidia/HelpSteer3 dataset if it’s not installed in your Python environment:

pip install datasets

Download nvidia/HelpSteer3 Dataset#

from pathlib import Path
from datasets import load_dataset, Dataset
ds = load_dataset("nvidia/HelpSteer3", "preference")

# Adjust these values to change the size of the training and validation sets
# The larger the datasets, the better the model will perform but longer the training will take
# For the purpose of this tutorial, we'll use a small subset of the dataset
training_size = 3000
validation_size = 300
DATASET_PATH = Path("dpo-dataset").absolute()

# Get training split and verify it's a Dataset (not IterableDataset)
train_dataset = ds["train"]
validation_dataset = ds["validation"]
assert isinstance(train_dataset, Dataset), "Expected Dataset type"
assert isinstance(validation_dataset, Dataset), "Expected Dataset type"

# Select subsets and save to JSONL files
testing_ds = train_dataset.select(range(training_size))
validation_ds = validation_dataset.select(range(validation_size))

# Create directory if it doesn't exist
os.makedirs(DATASET_PATH, exist_ok=True)

# Save subsets to JSONL files
testing_ds.to_json(f"{DATASET_PATH}/training.jsonl")
validation_ds.to_json(f"{DATASET_PATH}/validation.jsonl")

print(f"Saved training.jsonl with {len(testing_ds)} rows")
print(f"Saved validation.jsonl with {len(validation_ds)} rows")

# Create fileset to store DPO training data
DATASET_NAME = "dpo-dataset"

try:
    sdk.filesets.create(
        workspace="default",
        name=DATASET_NAME,
        description="dpo training data"
    )
    print(f"Created fileset: {DATASET_NAME}")
except ConflictError:
    print(f"Fileset '{DATASET_NAME}' already exists, continuing...")

# Upload training data files individually to ensure correct structure
sdk.filesets.fsspec.put(
    lpath=DATASET_PATH,  # Local directory with your JSONL files
    rpath=f"default/{DATASET_NAME}/",
    recursive=True
)

# Validate training data is uploaded correctly
print("Training data:")
print(sdk.filesets.list_files(name=DATASET_NAME, workspace="default").model_dump_json(indent=2))

4. Secrets Setup#

If you plan to use NGC or HuggingFace models, you’ll need to configure authentication:

NGC models (ngc:// URIs): Requires NGC API key
HuggingFace models (hf:// URIs): Requires HF token for gated/private models

Configure these as secrets in your platform. See Managing Secrets for detailed instructions.

Get your credentials to access base models:

NGC API Key (Setup → Generate API Key)
HuggingFace Token (Create token with Read access)

Quick Setup Example#

In this tutorial we are going to work with meta-llama/Llama-3.2-1B-Instruct model from HuggingFace. Ensure that you have sufficient permissions to download the model. If you cannot see the files in the meta-llama/Llama-3.2-1B-Instruct Hugging Face page, request access

HuggingFace Authentication:

For gated models (Llama, Gemma), you must provide a HuggingFace token via the token_secret parameter
Get your token from HuggingFace Settings (requires Read access)
Accept the model’s terms on the HuggingFace model page before using it. Example: meta-llama/Llama-3.2-1B-Instruct
For public models, you can omit the token_secret parameter when creating a fileset for model in the next step

# Export the HF_TOKEN and NGC_API_KEY environment variables if they are not already set
HF_TOKEN = os.getenv("HF_TOKEN")
NGC_API_KEY = os.getenv("NGC_API_KEY")


def create_or_get_secret(name: str, value: str | None, label: str):
    if not value:
        raise ValueError(f"{label} is not set")
    try:
        secret = sdk.secrets.create(
            name=name,
            workspace="default",
            data=value,
        )
        print(f"Created secret: {name}")
        return secret
    except ConflictError:
        print(f"Secret '{name}' already exists, continuing...")
        return sdk.secrets.retrieve(name=name, workspace="default")


# Create HuggingFace token secret
hf_secret = create_or_get_secret("hf-token", HF_TOKEN, "HF_TOKEN")
print("HF_TOKEN secret:")
print(hf_secret.model_dump_json(indent=2))

# Create NGC API key secret
# Uncomment the line below if you have NGC API Key and want to finetune NGC models
# ngc_api_key = create_or_get_secret("ngc-api-key", NGC_API_KEY, "NGC_API_KEY")

5. Create Base Model FileSet#

Create a fileset pointing to meta-llama/Llama-3.2-1B-Instruct model in HuggingFace that we will train with DPO. Model downloading will take place at the DPO finetuning job creation time. This step creates a pointer to the Hugging Face and does not download the model.

Note: for public models, you can omit the token_secret parameter when creating a model fileset.

# Create a fileset pointing to the desired HuggingFace model
from nemo_microservices.types.filesets import HuggingfaceStorageConfigParam

HF_REPO_ID = "meta-llama/Llama-3.2-1B-Instruct"
MODEL_NAME = "llama-3-2-1b-base"

# Ensure you have a HuggingFace token secret created
try:
    base_model = sdk.filesets.create(
        workspace="default",
        name=MODEL_NAME,
        description="Llama 3.2 1B base model from HuggingFace",
        storage=HuggingfaceStorageConfigParam(
            type="huggingface",
            # repo_id is the full model name from Hugging Face
            repo_id=HF_REPO_ID,
            repo_type="model",
            # we use the secret created in the previous step
            token_secret=hf_secret.name
        )
    )
except ConflictError as e:
    print(f"Base model fileset already exists. Skipping creation.")
    base_model = sdk.filesets.retrieve(
        workspace="default",
        name="llama-3-2-1b-base",
    )

print(f"Base model fileset: fileset://default/{base_model.name}")
print("Base model fileset files list:")
print((sdk.filesets.list_files(name="llama-3-2-1b-base", workspace="default")).model_dump_json(indent=2))

6. Create DPO Finetuning Job#

Create a customization job with an inline target referencing the base model and dataset filesets created in previous steps.

Target model_uri Format:

Currently, model_uri must reference a FileSet:

FileSet: fileset://workspace/fileset-name

Support for direct HuggingFace (hf://) and NGC (ngc://) URIs is coming soon. For now, create a fileset and upload your base model from these sources as shown in step 4.

GPU Requirements:

1B models: 1 GPU (24GB+ VRAM)
3B models: 1-2 GPUs
8B models: 2-4 GPUs
70B models: 8+ GPUs

Adjust num_gpus_per_node and tensor_parallel_size based on your model size.

import uuid
job_suffix = uuid.uuid4().hex[:4]

JOB_NAME = f"my-dpo-job-{job_suffix}"

job = sdk.customization.jobs.create(
    name=JOB_NAME,
    workspace="default",
    spec=CustomizationJobInputParam(
        target=CustomizationTargetParamParam(
            workspace="default",
            model_uri=f"fileset://default/{base_model.name}"
        ),
        dataset=f"fileset://default/{DATASET_NAME}",
        hyperparameters=HyperparametersParam(
            training_type="dpo",
            finetuning_type="all_weights",
            epochs=1,
            batch_size=16,
            learning_rate=0.00005,
            max_seq_length=4096,
            dpo=DpoConfigParam(
                ref_policy_kl_penalty=0.1
            ),
            # GPU and parallelism settings
            num_gpus_per_node=1,
            num_nodes=1,
            tensor_parallel_size=1,
            pipeline_parallel_size=1,
            micro_batch_size=1,
        )
    )
)

print(f"Job ID: {job.name}")
print(f"Output model: {job.spec.output_model}")

7. Track Training Progress#

import time
from IPython.display import clear_output

# Poll job status every 10 seconds until completed
while True:
    status = sdk.audit.jobs.get_status(
        name=job.name,
        workspace="default"
    )
    
    clear_output(wait=True)
    print(f"Job Status: {status.status}")

    # Extract training progress from nested steps structure
    step: int | None = None
    max_steps: int | None = None
    training_phase: str | None = None

    for job_step in status.steps or []:
        if job_step.name == "customization-training-job":
            for task in job_step.tasks or []:
                task_details = task.status_details or {}
                step = task_details.get("step")
                max_steps = task_details.get("max_steps")
                training_phase = task_details.get("phase")
                break
            break

    if step is not None and max_steps is not None:
        progress_pct = (step / max_steps) * 100
        print(f"Training Progress: Step {step}/{max_steps} ({progress_pct:.1f}%)")
        if training_phase:
            print(f"Training Phase: {training_phase}")
    else:
        print("Training step not started yet or progress info not available")
    
    # Exit loop when job is completed (or failed/cancelled)
    if status.status in ("completed", "failed", "cancelled"):
        print(f"\nJob finished with status: {status.status}")
        break
    
    time.sleep(10)

Interpreting DPO Training Metrics:

DPO training produces several key metrics:

Metric	Description	What to Look For
loss	Total training loss (preference_loss + sft_loss)	Should decrease over training
preference_loss	Core DPO loss measuring preference learning	Starts near ln(2) ≈ 0.693, should decrease
sft_loss	SFT regularization term (often 0 for pure DPO)	Depends on configuration
accuracy	Fraction of samples where chosen > rejected	Should increase toward 80-95%+
rewards_chosen_mean	Average implicit reward for chosen responses	Should be positive
rewards_rejected_mean	Average implicit reward for rejected responses	Should be negative

Key Indicators:

Reward Margin = rewards_chosen_mean - rewards_rejected_mean
- Should be positive and increasing
- Indicates the model is learning to distinguish preferences
Accuracy Interpretation:
- 50% = random chance (no learning)
- 66-75% = early/moderate learning
- 80%+ = good preference learning
- 95%+ = strong preference alignment

Troubleshooting:

Loss near ln(2) ≈ 0.693: Model is at random chance level, training just starting or not learning
Accuracy stuck at ~50%: Check data quality, increase learning rate, or verify preference labels
Negative reward margin: Model is learning the wrong direction—check chosen/rejected labels
Loss increasing: Learning rate too high or data quality issues

Note: Training metrics measure optimization progress, not final model quality. Always evaluate the deployed model on your specific use case.

8. Deploy Fine-Tuned Model#

Once training completes, deploy using the Deployment Management Service:

# Validate model entity exists
model_entity = sdk.models.retrieve(workspace='default', name=job.spec.output_model)
print(model_entity.model_dump_json(indent=2))

from nemo_microservices.types.inference import NIMDeploymentParam

# Create deployment config
deploy_suffix = uuid.uuid4().hex[:4]
DEPLOYMENT_CONFIG_NAME = f"dpo-model-deployment-cfg-{deploy_suffix}"
DEPLOYMENT_NAME = f"dpo-model-deployment-{deploy_suffix}"

deployment_config = sdk.inference.deployment_configs.create(
    workspace="default",
    name=DEPLOYMENT_CONFIG_NAME,
    nim_deployment=NIMDeploymentParam(
        image_name="nvcr.io/nim/nvidia/llm-nim",
        image_tag="1.13.1",
        gpu=1,
        model_name=job.spec.output_model,  # ModelEntity name from training,
        model_namespace="default",  # Workspace where ModelEntity lives
    )
)

# Deploy model using deployment_config created above
deployment = sdk.inference.deployments.create(
    workspace="default",
    name=DEPLOYMENT_NAME,
    config=deployment_config.name
)


# Check deployment status
deployment_status = sdk.inference.deployments.retrieve(
    name=deployment.name,
    workspace="default"
)

print(f"Deployment name: {deployment.name}")
print(f"Deployment status: {deployment_status.status}")

Monitor status of deployment#

import time
from IPython.display import clear_output

# Poll deployment status every 15 seconds until ready
TIMEOUT_MINUTES = 30
start_time = time.time()
timeout_seconds = TIMEOUT_MINUTES * 60

print(f"Monitoring deployment '{deployment.name}'...")
print(f"Timeout: {TIMEOUT_MINUTES} minutes\n")

while True:
    deployment_status = sdk.inference.deployments.retrieve(
        name=deployment.name,
        workspace="default"
    )
    
    elapsed = time.time() - start_time
    elapsed_min = int(elapsed // 60)
    elapsed_sec = int(elapsed % 60)
    
    clear_output(wait=True)
    print(f"Deployment: {deployment.name}")
    print(f"Status: {deployment_status.status}")
    print(f"Elapsed time: {elapsed_min}m {elapsed_sec}s")
    
    # Check if deployment is ready
    if deployment_status.status == "READY":
        print("\nDeployment is ready!")
        break
    
    # Check for failure states
    if deployment_status.status in ("FAILED", "ERROR", "TERMINATED", "LOST"):
        print(f"\nDeployment failed with status: {deployment_status.status}")
        break
    
    # Check timeout
    if elapsed > timeout_seconds:
        print(f"\nTimeout reached ({TIMEOUT_MINUTES} minutes). Deployment may still be in progress.")
        print("You can continue to check status manually or wait longer.")
        break
    
    time.sleep(15)

The deployment service automatically:

Downloads model weights from the Files service
Provisions storage (PVC) for the weights
Configures and starts the NIM container

Multi-GPU Deployment:

For larger models requiring multiple GPUs, configure parallelism with environment variables:

deployment_config = sdk.inference.deployment_configs.create(
    workspace="default",
    name="sft-model-config-multigpu",
    
    nim_deployment={
        "image_name": "nvcr.io/nim/nvidia/llm-nim",
        "image_tag": "1.13.1",
        "gpu": 2,  # Total GPUs
        "additional_envs": {
            "NIM_TENSOR_PARALLEL_SIZE": "2",  # Tensor parallelism
            "NIM_PIPELINE_PARALLEL_SIZE": "1"  # Pipeline parallelism
        }
    }
)

Single-Node Constraint: Model deployments are limited to a single node. The maximum gpu value depends on the total GPUs available on a single node in your cluster. Multi-node deployments are not supported.

GPU Parallelism#

By default, NIM uses all GPUs for tensor parallelism (TP). You can customize this behavior using the NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables.

Strategy	Description	Best For
Tensor Parallel (TP)	Splits model layers across GPUs	Lowest latency
Pipeline Parallel (PP)	Splits model depth across GPUs	Highest throughput

Formula: gpu = NIM_TENSOR_PARALLEL_SIZE × NIM_PIPELINE_PARALLEL_SIZE

Example Configurations#

Default (TP=8, PP=1) — Lowest Latency

"gpu": 8
# NIM automatically sets NIM_TENSOR_PARALLEL_SIZE=8

Balanced (TP=4, PP=2)

"gpu": 8,
"additional_envs": {
    "NIM_TENSOR_PARALLEL_SIZE": "4",
    "NIM_PIPELINE_PARALLEL_SIZE": "2"
}

Throughput Optimized (TP=2, PP=4)

"gpu": 8,
"additional_envs": {
    "NIM_TENSOR_PARALLEL_SIZE": "2",
    "NIM_PIPELINE_PARALLEL_SIZE": "4"
}

9. Evaluate Your Model#

After training, evaluate whether your model meets your requirements:

Quick Manual Evaluation#

# Wait for deployment to be ready, then test
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short email to my colleague."}
]

response = sdk.inference.gateway.provider.post(
    "v1/chat/completions",
    name=deployment.name,
    workspace="default",
    body={
        "model": f"default/{job.spec.output_model}",  # Match the model_name from deployment config
        "messages": messages,
        "temperature": 0.7,
        "max_tokens": 256
    }
)

# Display prompt and completion
print("=" * 60)
print("PROMPT")
print("=" * 60)
for msg in messages:
    print(f"[{msg['role'].upper()}]")
    print(msg["content"])
    print()

print("=" * 60)
print("COMPLETION")
print("=" * 60)
print("[ASSISTANT]")
completion = response["choices"][0]["message"]["content"]
print(completion)

Evaluation Best Practices#

Manual Evaluation (Recommended)

Test with real-world examples from your use case
Compare responses to base model and expected outputs
Verify the model exhibits desired behavior changes
Check edge cases and error handling

What to look for:

✅ Model follows your desired output format
✅ Applies domain knowledge correctly
✅ Maintains general language capabilities
✅ Avoids unwanted behaviors or biases
❌ Doesn’t hallucinate facts not in training data
❌ Doesn’t produce repetitive or nonsensical outputs

Hyperparameters#

For detailed information on all available hyperparameters, recommended values, and tuning guidance, see the Hyperparameter Reference.

Troubleshooting#

Job fails during model download:

Verify authentication secrets are configured (see Managing Secrets)
For gated HuggingFace models (Llama, Gemma), accept the license on the model page
Check the model_uri format is correct (fileset://)
Ensure you have accepted the model’s terms of service on HuggingFace
Check job status and logs: sdk.customization.jobs.retrieve(name=job.name, workspace="default")

Job fails with OOM (Out of Memory) error:

First try: Reduce micro_batch_size from 2 to 1
Still OOM: Reduce batch_size from 16 to 8
Still OOM: Reduce max_seq_length from 2048 to 1024 or 512
Last resort: Increase GPU count and use tensor_parallel_size for model sharding

Loss curves not decreasing (underfitting):

Increase training duration: epochs: 5-10 instead of 3
Adjust learning rate: Try 1e-5 to 1e-4
Add warmup: Set warmup_steps to ~10% of total training steps
Check data quality: Verify formatting, remove duplicates, ensure diversity

Training loss decreases but validation loss increases (overfitting):

Reduce epochs: Try epochs: 1-2 instead of 5+
Lower learning rate: Use 2e-5 or 1e-5
Increase dataset size and diversity
Verify train/validation split has no data leakage

Model output quality is poor despite good training metrics:

Training metrics optimize for loss, not your actual task—evaluate on real use cases
Review data quality, format, and diversity—metrics can be misleading with poor data
Try a different base model size or architecture
Adjust learning rate and batch size
Compare to baseline: Test base model to ensure fine-tuning improved performance

Deployment fails:

Verify output model exists: sdk.models.retrieve(name=job.spec.output_model, workspace="default")
Check deployment logs: sdk.inference.deployments.get_logs(name=deployment.name, workspace="default")
Ensure sufficient GPU resources available for model size
Verify NIM image tag 1.13.1 is compatible with your model

Next Steps#

Monitor training metrics in detail
Evaluate your fine-tuned model using the Evaluator service
Learn about LoRA customization for resource-efficient fine-tuning
Explore knowledge distillation to compress larger models