{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- @nemo-nb: process -->\n",
    "<!-- @nemo-nb: download -->\n",
    "\n",
    "# DPO Customization\n",
    "\n",
    "Learn how to use the NeMo Microservices Platform to create a DPO (Direct Preference Optimization) job using a custom dataset.\n",
    "\n",
    "## About\n",
    "\n",
    "DPO is an advanced fine-tuning technique for preference-based alignment. If you're new to fine-tuning, consider starting with [LoRA](./lora-customization-job) or [Full SFT](./sft-customization-job) tutorials first.\n",
    "\n",
    "Direct Preference Optimization (DPO) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the [DPO paper](https://arxiv.org/pdf/2305.18290).\n",
    "\n",
    "DPO shares similarities with Full SFT training workflows but differs in a few key ways:\n",
    "\n",
    "| Aspect | SFT (Supervised Fine-Tuning) | DPO (Direct Preference Optimization) |\n",
    "| --- | --- | --- |\n",
    "| Data Requirements | Labeled instruction-response pairs where the desired output is explicitly provided | Pairwise preference data, where for a given input, one response is explicitly preferred over another |\n",
    "| Learning Objective | Directly teaches the model to generate a specific \"correct\" response | Directly optimizes the model to align with human preferences by maximizing the probability of preferred responses and minimizing rejected ones, without needing an explicit reward model |\n",
    "| Alignment Focus | Aligns the model with the specific examples present in its training data | Aligns the model with broader human preferences, which can be more effective for subjective tasks or those without a single \"correct\" answer |\n",
    "| Computational Efficiency | Standard fine-tuning efficiency | More computationally efficient than SFT (especially when compared to full RLHF methods) as it bypasses the need to train a separate reward model |\n",
    "\n",
    "**What you can achieve with DPO:**\n",
    "- **Align with human preferences**: Directly optimize your model to produce outputs that align with subjective human preferences without requiring explicit reward modeling\n",
    "- **Refine response quality**: Improve helpfulness, harmlessness, honesty, and other nuanced qualities that are easier to compare than to define\n",
    "- **Control tone and style**: Adjust the model's communication style, verbosity, formality, and other subjective characteristics\n",
    "- **Implement safety guardrails**: Teach the model to avoid harmful or undesirable responses by training on preferred vs. rejected response pairs\n",
    "- **Optimize subjective tasks**: Excel at tasks where there are multiple acceptable answers but clear preferences exist (creative writing, dialogue, explanations)\n",
    "\n",
    "**When to choose DPO:**\n",
    "- **Subjective quality matters**: Your task involves style, tone, or other qualities where there's no single \"correct\" answer but clear preferences exist\n",
    "- **You have preference data**: You can collect pairwise comparisons (preferred vs. rejected responses) more easily than perfect labeled examples\n",
    "- **Refining existing capabilities**: You want to make targeted improvements to an already-trained model without major capability changes\n",
    "- **Complex evaluation**: Humans find it easier to compare which of two responses is better than to create the ideal response themselves (especially for multi-turn conversations, creative tasks, or nuanced outputs)\n",
    "- **Robust behavior changes**: You need more reliable behavior modification than prompting can provide, without the complexity of full RLHF\n",
    "- **Lower compute than RLHF**: You want human preference alignment but with simpler training that doesn't require reinforcement learning infrastructure\n",
    "\n",
    "**When to choose SFT:**\n",
    "- **Clear correct answers**: Your task has objectively correct outputs (code generation, structured data extraction, following specific formats)\n",
    "- **High-quality examples**: You have well-labeled input-output pairs that demonstrate exactly what the model should produce\n",
    "- **Imitation learning**: You want the model to closely mimic a specific style, format, or knowledge base from expert demonstrations\n",
    "- **Foundational capabilities**: You're establishing new task-specific capabilities before fine-tuning preferences (SFT is often done before DPO)\n",
    "- **Stable, predictable outputs**: You need consistent formatting or structure that's well-defined in your training examples\n",
    "- **Traditional NLP tasks**: Instruction following, translation, summarization, or classification where gold-standard labels exist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before starting this tutorial, ensure you have:\n",
    "\n",
    "1. **Completed the [Quickstart](../../get-started/installation.md)** to install and deploy NeMo Microservices locally\n",
    "2. **Installed the Python SDK** (included with `pip install nemo-microservices`)\n",
    "3. **Set up organizational entities** (namespaces and projects) if you're new to the platform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick Start\n",
    "\n",
    "### 1. Initialize SDK\n",
    "\n",
    "The SDK needs to know your NMP server URL. By default, `http://localhost:8080` is used in accordance with the [Quickstart](../../get-started/installation.md) guide. If NMP is running at a custom location, you can override the URL by setting the `NMP_BASE_URL` environment variable:\n",
    "\n",
    "```sh\n",
    "export NMP_BASE_URL=<YOUR_NMP_BASE_URL>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from nemo_microservices import NeMoMicroservices, ConflictError\n",
    "from nemo_microservices.types.customization import (\n",
    "    CustomizationJobInputParam,\n",
    "    CustomizationTargetParamParam,\n",
    "    HyperparametersParam,\n",
    "    DpoConfigParam\n",
    ")\n",
    "\n",
    "NMP_BASE_URL = os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n",
    "sdk = NeMoMicroservices(\n",
    "    base_url=NMP_BASE_URL,\n",
    "    workspace=\"default\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Prepare Dataset\n",
    "\n",
    "Create your data in JSONL format - one JSON object per line. The platform auto-detects your data format. Supported dataset formats are listed below.\n",
    "\n",
    "**Flexible Data Setup:**\n",
    "- **No validation file?** The platform automatically creates a 10% validation split\n",
    "- **Multiple files?** Upload to `training/` or `validation/` subdirectories\u2014they'll be automatically merged\n",
    "- **Format detection:** Your data format is auto-detected at training time\n",
    "\n",
    "In this tutorial the following dataset directory structure will be used:\n",
    "```\n",
    "my_dataset\n",
    "`-- training.jsonl\n",
    "`-- validation.jsonl\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Binary Preference Format\n",
    "DPO training requires preference pairs with three fields:\n",
    "- **`prompt`**: The input prompt (can be a string or array of message objects)\n",
    "- **`chosen`**: The preferred response\n",
    "- **`rejected`**: The less preferred response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "language": "json"
   },
   "outputs": [],
   "source": [
    "{\"prompt\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}], \"chosen\": \"The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center.\", \"rejected\": \"I think the capital of France might be London or Paris, I'm not entirely sure.\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Tulu3 Preference Dataset Format\n",
    "This format contains complete conversation histories for both the chosen (preferred) and rejected responses.\n",
    "\n",
    "Required fields:\n",
    "- **`chosen`**: Full conversation with the preferred response (list of message objects, last must be assistant)\n",
    "- **`rejected`**: Full conversation with the rejected response (list of message objects, last must be assistant)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "{\"chosen\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}, {\"role\": \"assistant\", \"content\": \"The capital of France is Paris.\"}], \"rejected\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}, {\"role\": \"assistant\", \"content\": \"I'm not sure, but I think it might be London or Paris.\"}]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### HelpSteer Dataset Format\n",
    "This format uses numeric preference scores to indicate which response is better. The context can be either a simple string or an array of message objects.\n",
    "\n",
    "Required fields:\n",
    "- **`context`**: The input context (can be a string or array of message objects)\n",
    "- **`response1`**: First response option\n",
    "- **`response2`**: Second response option\n",
    "- **`overall_preference`**: Preference score where negative values mean response1 is preferred, positive values mean response2 is preferred, and 0 indicates a tie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "language": "json"
   },
   "outputs": [],
   "source": [
    "{\"context\": \"Explain how to use git rebase\", \"response1\": \"Git rebase is a command that rewrites commit history by moving or combining commits. Use 'git rebase main' to reapply your branch commits on top of main. This creates a linear history and avoids merge commits.\", \"response2\": \"Use git rebase to change commits. Just type git rebase and it will work.\", \"overall_preference\": -2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Create Dataset FileSet and Upload Training Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install huggingface datasets package to download public [nvidia/HelpSteer3](https://huggingface.co/datasets/nvidia/HelpSteer3) dataset if it's not installed in your Python environment:\n",
    "\n",
    "```sh\n",
    "pip install datasets\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download nvidia/HelpSteer3 Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from datasets import load_dataset, Dataset\n",
    "ds = load_dataset(\"nvidia/HelpSteer3\", \"preference\")\n",
    "\n",
    "# Adjust these values to change the size of the training and validation sets\n",
    "# The larger the datasets, the better the model will perform but longer the training will take\n",
    "# For the purpose of this tutorial, we'll use a small subset of the dataset\n",
    "training_size = 3000\n",
    "validation_size = 300\n",
    "DATASET_PATH = Path(\"dpo-dataset\").absolute()\n",
    "\n",
    "# Get training split and verify it's a Dataset (not IterableDataset)\n",
    "train_dataset = ds[\"train\"]\n",
    "validation_dataset = ds[\"validation\"]\n",
    "assert isinstance(train_dataset, Dataset), \"Expected Dataset type\"\n",
    "assert isinstance(validation_dataset, Dataset), \"Expected Dataset type\"\n",
    "\n",
    "# Select subsets and save to JSONL files\n",
    "testing_ds = train_dataset.select(range(training_size))\n",
    "validation_ds = validation_dataset.select(range(validation_size))\n",
    "\n",
    "# Create directory if it doesn't exist\n",
    "os.makedirs(DATASET_PATH, exist_ok=True)\n",
    "\n",
    "# Save subsets to JSONL files\n",
    "testing_ds.to_json(f\"{DATASET_PATH}/training.jsonl\")\n",
    "validation_ds.to_json(f\"{DATASET_PATH}/validation.jsonl\")\n",
    "\n",
    "print(f\"Saved training.jsonl with {len(testing_ds)} rows\")\n",
    "print(f\"Saved validation.jsonl with {len(validation_ds)} rows\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create fileset to store DPO training data\n",
    "DATASET_NAME = \"dpo-dataset\"\n",
    "\n",
    "try:\n",
    "    sdk.filesets.create(\n",
    "        workspace=\"default\",\n",
    "        name=DATASET_NAME,\n",
    "        description=\"dpo training data\"\n",
    "    )\n",
    "    print(f\"Created fileset: {DATASET_NAME}\")\n",
    "except ConflictError:\n",
    "    print(f\"Fileset '{DATASET_NAME}' already exists, continuing...\")\n",
    "\n",
    "# Upload training data files individually to ensure correct structure\n",
    "sdk.filesets.fsspec.put(\n",
    "    lpath=DATASET_PATH,  # Local directory with your JSONL files\n",
    "    rpath=f\"default/{DATASET_NAME}/\",\n",
    "    recursive=True\n",
    ")\n",
    "\n",
    "# Validate training data is uploaded correctly\n",
    "print(\"Training data:\")\n",
    "print(sdk.filesets.list_files(name=DATASET_NAME, workspace=\"default\").model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Secrets Setup\n",
    "\n",
    "If you plan to use NGC or HuggingFace models, you'll need to configure authentication:\n",
    "\n",
    "- **NGC models** (`ngc://` URIs): Requires NGC API key\n",
    "- **HuggingFace models** (`hf://` URIs): Requires HF token for gated/private models\n",
    "\n",
    "\n",
    "Configure these as secrets in your platform. See [Managing Secrets](../../set-up/manage-secrets.md) for detailed instructions.\n",
    "\n",
    "Get your credentials to access base models:\n",
    "- [NGC API Key](https://ngc.nvidia.com/) (Setup \u2192 Generate API Key)\n",
    "- [HuggingFace Token](https://huggingface.co/settings/tokens) (Create token with Read access)\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "#### Quick Setup Example\n",
    "\n",
    "In this tutorial we are going to work with [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) model from HuggingFace. Ensure that you have sufficient permissions to download the model. If you cannot see the files in the [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) Hugging Face page, request access\n",
    "\n",
    "**HuggingFace Authentication:**\n",
    "- For gated models (Llama, Gemma), you must provide a HuggingFace token via the `token_secret` parameter\n",
    "- Get your token from [HuggingFace Settings](https://huggingface.co/settings/tokens) (requires Read access)\n",
    "- Accept the model's terms on the HuggingFace model page before using it. Example: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main)\n",
    "- For public models, you can omit the `token_secret` parameter when creating a fileset for model in the next step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export the HF_TOKEN and NGC_API_KEY environment variables if they are not already set\n",
    "HF_TOKEN = os.getenv(\"HF_TOKEN\")\n",
    "NGC_API_KEY = os.getenv(\"NGC_API_KEY\")\n",
    "\n",
    "\n",
    "def create_or_get_secret(name: str, value: str | None, label: str):\n",
    "    if not value:\n",
    "        raise ValueError(f\"{label} is not set\")\n",
    "    try:\n",
    "        secret = sdk.secrets.create(\n",
    "            name=name,\n",
    "            workspace=\"default\",\n",
    "            data=value,\n",
    "        )\n",
    "        print(f\"Created secret: {name}\")\n",
    "        return secret\n",
    "    except ConflictError:\n",
    "        print(f\"Secret '{name}' already exists, continuing...\")\n",
    "        return sdk.secrets.retrieve(name=name, workspace=\"default\")\n",
    "\n",
    "\n",
    "# Create HuggingFace token secret\n",
    "hf_secret = create_or_get_secret(\"hf-token\", HF_TOKEN, \"HF_TOKEN\")\n",
    "print(\"HF_TOKEN secret:\")\n",
    "print(hf_secret.model_dump_json(indent=2))\n",
    "\n",
    "# Create NGC API key secret\n",
    "# Uncomment the line below if you have NGC API Key and want to finetune NGC models\n",
    "# ngc_api_key = create_or_get_secret(\"ngc-api-key\", NGC_API_KEY, \"NGC_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Create Base Model FileSet\n",
    "\n",
    "Create a fileset pointing to [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) model in HuggingFace that we will train with DPO. Model downloading will take place at the DPO finetuning job creation time. This step creates a pointer to the Hugging Face and does not download the model.\n",
    "\n",
    "Note: for public models, you can omit the `token_secret` parameter when creating a model fileset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a fileset pointing to the desired HuggingFace model\n",
    "from nemo_microservices.types.filesets import HuggingfaceStorageConfigParam\n",
    "\n",
    "HF_REPO_ID = \"meta-llama/Llama-3.2-1B-Instruct\"\n",
    "MODEL_NAME = \"llama-3-2-1b-base\"\n",
    "\n",
    "# Ensure you have a HuggingFace token secret created\n",
    "try:\n",
    "    base_model = sdk.filesets.create(\n",
    "        workspace=\"default\",\n",
    "        name=MODEL_NAME,\n",
    "        description=\"Llama 3.2 1B base model from HuggingFace\",\n",
    "        storage=HuggingfaceStorageConfigParam(\n",
    "            type=\"huggingface\",\n",
    "            # repo_id is the full model name from Hugging Face\n",
    "            repo_id=HF_REPO_ID,\n",
    "            repo_type=\"model\",\n",
    "            # we use the secret created in the previous step\n",
    "            token_secret=hf_secret.name\n",
    "        )\n",
    "    )\n",
    "except ConflictError as e:\n",
    "    print(f\"Base model fileset already exists. Skipping creation.\")\n",
    "    base_model = sdk.filesets.retrieve(\n",
    "        workspace=\"default\",\n",
    "        name=\"llama-3-2-1b-base\",\n",
    "    )\n",
    "\n",
    "print(f\"Base model fileset: fileset://default/{base_model.name}\")\n",
    "print(\"Base model fileset files list:\")\n",
    "print((sdk.filesets.list_files(name=\"llama-3-2-1b-base\", workspace=\"default\")).model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Create DPO Finetuning Job\n",
    "Create a customization job with an inline target referencing the base model and dataset filesets created in previous steps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Target `model_uri` Format:**\n",
    "\n",
    "Currently, `model_uri` must reference a FileSet:\n",
    "- **FileSet:** `fileset://workspace/fileset-name`\n",
    "\n",
    "Support for direct HuggingFace (`hf://`) and NGC (`ngc://`) URIs is coming soon. For now, create a fileset and upload your base model from these sources as shown in step 4.\n",
    "\n",
    "**GPU Requirements:**\n",
    "- 1B models: 1 GPU (24GB+ VRAM)\n",
    "- 3B models: 1-2 GPUs  \n",
    "- 8B models: 2-4 GPUs\n",
    "- 70B models: 8+ GPUs \n",
    "\n",
    "Adjust `num_gpus_per_node` and `tensor_parallel_size` based on your model size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "job_suffix = uuid.uuid4().hex[:4]\n",
    "\n",
    "JOB_NAME = f\"my-dpo-job-{job_suffix}\"\n",
    "\n",
    "job = sdk.customization.jobs.create(\n",
    "    name=JOB_NAME,\n",
    "    workspace=\"default\",\n",
    "    spec=CustomizationJobInputParam(\n",
    "        target=CustomizationTargetParamParam(\n",
    "            workspace=\"default\",\n",
    "            model_uri=f\"fileset://default/{base_model.name}\"\n",
    "        ),\n",
    "        dataset=f\"fileset://default/{DATASET_NAME}\",\n",
    "        hyperparameters=HyperparametersParam(\n",
    "            training_type=\"dpo\",\n",
    "            finetuning_type=\"all_weights\",\n",
    "            epochs=1,\n",
    "            batch_size=16,\n",
    "            learning_rate=0.00005,\n",
    "            max_seq_length=4096,\n",
    "            dpo=DpoConfigParam(\n",
    "                ref_policy_kl_penalty=0.1\n",
    "            ),\n",
    "            # GPU and parallelism settings\n",
    "            num_gpus_per_node=1,\n",
    "            num_nodes=1,\n",
    "            tensor_parallel_size=1,\n",
    "            pipeline_parallel_size=1,\n",
    "            micro_batch_size=1,\n",
    "        )\n",
    "    )\n",
    ")\n",
    "\n",
    "print(f\"Job ID: {job.name}\")\n",
    "print(f\"Output model: {job.spec.output_model}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Track Training Progress"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import clear_output\n",
    "\n",
    "# Poll job status every 10 seconds until completed\n",
    "while True:\n",
    "    status = sdk.audit.jobs.get_status(\n",
    "        name=job.name,\n",
    "        workspace=\"default\"\n",
    "    )\n",
    "    \n",
    "    clear_output(wait=True)\n",
    "    print(f\"Job Status: {status.status}\")\n",
    "\n",
    "    # Extract training progress from nested steps structure\n",
    "    step: int | None = None\n",
    "    max_steps: int | None = None\n",
    "    training_phase: str | None = None\n",
    "\n",
    "    for job_step in status.steps or []:\n",
    "        if job_step.name == \"customization-training-job\":\n",
    "            for task in job_step.tasks or []:\n",
    "                task_details = task.status_details or {}\n",
    "                step = task_details.get(\"step\")\n",
    "                max_steps = task_details.get(\"max_steps\")\n",
    "                training_phase = task_details.get(\"phase\")\n",
    "                break\n",
    "            break\n",
    "\n",
    "    if step is not None and max_steps is not None:\n",
    "        progress_pct = (step / max_steps) * 100\n",
    "        print(f\"Training Progress: Step {step}/{max_steps} ({progress_pct:.1f}%)\")\n",
    "        if training_phase:\n",
    "            print(f\"Training Phase: {training_phase}\")\n",
    "    else:\n",
    "        print(\"Training step not started yet or progress info not available\")\n",
    "    \n",
    "    # Exit loop when job is completed (or failed/cancelled)\n",
    "    if status.status in (\"completed\", \"failed\", \"cancelled\"):\n",
    "        print(f\"\\nJob finished with status: {status.status}\")\n",
    "        break\n",
    "    \n",
    "    time.sleep(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpreting DPO Training Metrics:**\n",
    "\n",
    "DPO training produces several key metrics:\n",
    "\n",
    "| Metric | Description | What to Look For |\n",
    "|--------|-------------|------------------|\n",
    "| **loss** | Total training loss (preference_loss + sft_loss) | Should decrease over training |\n",
    "| **preference_loss** | Core DPO loss measuring preference learning | Starts near ln(2) \u2248 0.693, should decrease |\n",
    "| **sft_loss** | SFT regularization term (often 0 for pure DPO) | Depends on configuration |\n",
    "| **accuracy** | Fraction of samples where chosen > rejected | Should increase toward 80-95%+ |\n",
    "| **rewards_chosen_mean** | Average implicit reward for chosen responses | Should be positive |\n",
    "| **rewards_rejected_mean** | Average implicit reward for rejected responses | Should be negative |\n",
    "\n",
    "**Key Indicators:**\n",
    "\n",
    "- **Reward Margin** = `rewards_chosen_mean - rewards_rejected_mean`\n",
    "  - Should be positive and increasing\n",
    "  - Indicates the model is learning to distinguish preferences\n",
    "\n",
    "- **Accuracy Interpretation:**\n",
    "  - 50% = random chance (no learning)\n",
    "  - 66-75% = early/moderate learning\n",
    "  - 80%+ = good preference learning\n",
    "  - 95%+ = strong preference alignment\n",
    "\n",
    "**Troubleshooting:**\n",
    "\n",
    "- **Loss near ln(2) \u2248 0.693**: Model is at random chance level, training just starting or not learning\n",
    "- **Accuracy stuck at ~50%**: Check data quality, increase learning rate, or verify preference labels\n",
    "- **Negative reward margin**: Model is learning the wrong direction\u2014check chosen/rejected labels\n",
    "- **Loss increasing**: Learning rate too high or data quality issues\n",
    "\n",
    "**Note:** Training metrics measure optimization progress, not final model quality. Always evaluate the deployed model on your specific use case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8. Deploy Fine-Tuned Model\n",
    "\n",
    "Once training completes, deploy using the Deployment Management Service:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validate model entity exists\n",
    "model_entity = sdk.models.retrieve(workspace='default', name=job.spec.output_model)\n",
    "print(model_entity.model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_microservices.types.inference import NIMDeploymentParam\n",
    "\n",
    "# Create deployment config\n",
    "deploy_suffix = uuid.uuid4().hex[:4]\n",
    "DEPLOYMENT_CONFIG_NAME = f\"dpo-model-deployment-cfg-{deploy_suffix}\"\n",
    "DEPLOYMENT_NAME = f\"dpo-model-deployment-{deploy_suffix}\"\n",
    "\n",
    "deployment_config = sdk.inference.deployment_configs.create(\n",
    "    workspace=\"default\",\n",
    "    name=DEPLOYMENT_CONFIG_NAME,\n",
    "    nim_deployment=NIMDeploymentParam(\n",
    "        image_name=\"nvcr.io/nim/nvidia/llm-nim\",\n",
    "        image_tag=\"1.13.1\",\n",
    "        gpu=1,\n",
    "        model_name=job.spec.output_model,  # ModelEntity name from training,\n",
    "        model_namespace=\"default\",  # Workspace where ModelEntity lives\n",
    "    )\n",
    ")\n",
    "\n",
    "# Deploy model using deployment_config created above\n",
    "deployment = sdk.inference.deployments.create(\n",
    "    workspace=\"default\",\n",
    "    name=DEPLOYMENT_NAME,\n",
    "    config=deployment_config.name\n",
    ")\n",
    "\n",
    "\n",
    "# Check deployment status\n",
    "deployment_status = sdk.inference.deployments.retrieve(\n",
    "    name=deployment.name,\n",
    "    workspace=\"default\"\n",
    ")\n",
    "\n",
    "print(f\"Deployment name: {deployment.name}\")\n",
    "print(f\"Deployment status: {deployment_status.status}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Monitor status of deployment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import clear_output\n",
    "\n",
    "# Poll deployment status every 15 seconds until ready\n",
    "TIMEOUT_MINUTES = 30\n",
    "start_time = time.time()\n",
    "timeout_seconds = TIMEOUT_MINUTES * 60\n",
    "\n",
    "print(f\"Monitoring deployment '{deployment.name}'...\")\n",
    "print(f\"Timeout: {TIMEOUT_MINUTES} minutes\\n\")\n",
    "\n",
    "while True:\n",
    "    deployment_status = sdk.inference.deployments.retrieve(\n",
    "        name=deployment.name,\n",
    "        workspace=\"default\"\n",
    "    )\n",
    "    \n",
    "    elapsed = time.time() - start_time\n",
    "    elapsed_min = int(elapsed // 60)\n",
    "    elapsed_sec = int(elapsed % 60)\n",
    "    \n",
    "    clear_output(wait=True)\n",
    "    print(f\"Deployment: {deployment.name}\")\n",
    "    print(f\"Status: {deployment_status.status}\")\n",
    "    print(f\"Elapsed time: {elapsed_min}m {elapsed_sec}s\")\n",
    "    \n",
    "    # Check if deployment is ready\n",
    "    if deployment_status.status == \"READY\":\n",
    "        print(\"\\nDeployment is ready!\")\n",
    "        break\n",
    "    \n",
    "    # Check for failure states\n",
    "    if deployment_status.status in (\"FAILED\", \"ERROR\", \"TERMINATED\", \"LOST\"):\n",
    "        print(f\"\\nDeployment failed with status: {deployment_status.status}\")\n",
    "        break\n",
    "    \n",
    "    # Check timeout\n",
    "    if elapsed > timeout_seconds:\n",
    "        print(f\"\\nTimeout reached ({TIMEOUT_MINUTES} minutes). Deployment may still be in progress.\")\n",
    "        print(\"You can continue to check status manually or wait longer.\")\n",
    "        break\n",
    "    \n",
    "    time.sleep(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The deployment service automatically:\n",
    "- Downloads model weights from the Files service\n",
    "- Provisions storage (PVC) for the weights\n",
    "- Configures and starts the NIM container\n",
    "\n",
    "**Multi-GPU Deployment:**\n",
    "\n",
    "For larger models requiring multiple GPUs, configure parallelism with environment variables:\n",
    "\n",
    "```python\n",
    "deployment_config = sdk.inference.deployment_configs.create(\n",
    "    workspace=\"default\",\n",
    "    name=\"sft-model-config-multigpu\",\n",
    "    \n",
    "    nim_deployment={\n",
    "        \"image_name\": \"nvcr.io/nim/nvidia/llm-nim\",\n",
    "        \"image_tag\": \"1.13.1\",\n",
    "        \"gpu\": 2,  # Total GPUs\n",
    "        \"additional_envs\": {\n",
    "            \"NIM_TENSOR_PARALLEL_SIZE\": \"2\",  # Tensor parallelism\n",
    "            \"NIM_PIPELINE_PARALLEL_SIZE\": \"1\"  # Pipeline parallelism\n",
    "        }\n",
    "    }\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Single-Node Constraint:** Model deployments are limited to a single node. The maximum `gpu` value depends on the total GPUs available on a single node in your cluster. Multi-node deployments are not supported.\n",
    "\n",
    "---\n",
    "\n",
    "#### GPU Parallelism\n",
    "\n",
    "By default, NIM uses all GPUs for tensor parallelism (TP). You can customize this behavior using the `NIM_TENSOR_PARALLEL_SIZE` and `NIM_PIPELINE_PARALLEL_SIZE` environment variables.\n",
    "\n",
    "| Strategy | Description | Best For |\n",
    "|----------|-------------|----------|\n",
    "| **Tensor Parallel (TP)** | Splits model layers across GPUs | Lowest latency |\n",
    "| **Pipeline Parallel (PP)** | Splits model depth across GPUs | Highest throughput |\n",
    "\n",
    "**Formula:** `gpu` = `NIM_TENSOR_PARALLEL_SIZE` \u00d7 `NIM_PIPELINE_PARALLEL_SIZE`\n",
    "\n",
    "---\n",
    "\n",
    "#### Example Configurations\n",
    "\n",
    "**Default (TP=8, PP=1) \u2014 Lowest Latency**\n",
    "```\n",
    "\"gpu\": 8\n",
    "# NIM automatically sets NIM_TENSOR_PARALLEL_SIZE=8\n",
    "```\n",
    "\n",
    "**Balanced (TP=4, PP=2)**\n",
    "```\n",
    "\"gpu\": 8,\n",
    "\"additional_envs\": {\n",
    "    \"NIM_TENSOR_PARALLEL_SIZE\": \"4\",\n",
    "    \"NIM_PIPELINE_PARALLEL_SIZE\": \"2\"\n",
    "}\n",
    "```\n",
    "\n",
    "**Throughput Optimized (TP=2, PP=4)**\n",
    "```\n",
    "\"gpu\": 8,\n",
    "\"additional_envs\": {\n",
    "    \"NIM_TENSOR_PARALLEL_SIZE\": \"2\",\n",
    "    \"NIM_PIPELINE_PARALLEL_SIZE\": \"4\"\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9. Evaluate Your Model\n",
    "\n",
    "After training, evaluate whether your model meets your requirements:\n",
    "\n",
    "#### Quick Manual Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wait for deployment to be ready, then test\n",
    "messages = [\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "    {\"role\": \"user\", \"content\": \"Write a short email to my colleague.\"}\n",
    "]\n",
    "\n",
    "response = sdk.inference.gateway.provider.post(\n",
    "    \"v1/chat/completions\",\n",
    "    name=deployment.name,\n",
    "    workspace=\"default\",\n",
    "    body={\n",
    "        \"model\": f\"default/{job.spec.output_model}\",  # Match the model_name from deployment config\n",
    "        \"messages\": messages,\n",
    "        \"temperature\": 0.7,\n",
    "        \"max_tokens\": 256\n",
    "    }\n",
    ")\n",
    "\n",
    "# Display prompt and completion\n",
    "print(\"=\" * 60)\n",
    "print(\"PROMPT\")\n",
    "print(\"=\" * 60)\n",
    "for msg in messages:\n",
    "    print(f\"[{msg['role'].upper()}]\")\n",
    "    print(msg[\"content\"])\n",
    "    print()\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(\"COMPLETION\")\n",
    "print(\"=\" * 60)\n",
    "print(\"[ASSISTANT]\")\n",
    "completion = response[\"choices\"][0][\"message\"][\"content\"]\n",
    "print(completion)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluation Best Practices\n",
    "\n",
    "**Manual Evaluation** (Recommended)\n",
    "- Test with real-world examples from your use case\n",
    "- Compare responses to base model and expected outputs\n",
    "- Verify the model exhibits desired behavior changes\n",
    "- Check edge cases and error handling\n",
    "\n",
    "**What to look for:**\n",
    "- \u2705 Model follows your desired output format\n",
    "- \u2705 Applies domain knowledge correctly\n",
    "- \u2705 Maintains general language capabilities\n",
    "- \u2705 Avoids unwanted behaviors or biases\n",
    "- \u274c Doesn't hallucinate facts not in training data\n",
    "- \u274c Doesn't produce repetitive or nonsensical outputs\n",
    "\n",
    "---\n",
    "\n",
    "## Hyperparameters\n",
    "\n",
    "For detailed information on all available hyperparameters, recommended values, and tuning guidance, see the [Hyperparameter Reference](../manage-customization-jobs/hyperparameters.md).\n",
    "\n",
    "---\n",
    "\n",
    "\n",
    "## Troubleshooting\n",
    "\n",
    "**Job fails during model download:**\n",
    "- Verify authentication secrets are configured (see [Managing Secrets](../../set-up/manage-secrets.md))\n",
    "- For gated HuggingFace models (Llama, Gemma), accept the license on the model page\n",
    "- Check the `model_uri` format is correct (`fileset://`)\n",
    "- Ensure you have accepted the model's terms of service on HuggingFace\n",
    "- Check job status and logs: `sdk.customization.jobs.retrieve(name=job.name, workspace=\"default\")`\n",
    "\n",
    "**Job fails with OOM (Out of Memory) error:**\n",
    "1. **First try:** Reduce `micro_batch_size` from 2 to 1\n",
    "2. **Still OOM:** Reduce `batch_size` from 16 to 8\n",
    "3. **Still OOM:** Reduce `max_seq_length` from 2048 to 1024 or 512\n",
    "4. **Last resort:** Increase GPU count and use `tensor_parallel_size` for model sharding\n",
    "\n",
    "**Loss curves not decreasing (underfitting):**\n",
    "- Increase training duration: `epochs: 5-10` instead of 3\n",
    "- Adjust learning rate: Try `1e-5` to `1e-4`\n",
    "- Add warmup: Set `warmup_steps` to ~10% of total training steps\n",
    "- Check data quality: Verify formatting, remove duplicates, ensure diversity\n",
    "\n",
    "**Training loss decreases but validation loss increases (overfitting):**\n",
    "- Reduce epochs: Try `epochs: 1-2` instead of 5+\n",
    "- Lower learning rate: Use `2e-5` or `1e-5`\n",
    "- Increase dataset size and diversity\n",
    "- Verify train/validation split has no data leakage\n",
    "\n",
    "**Model output quality is poor despite good training metrics:**\n",
    "- Training metrics optimize for loss, not your actual task\u2014evaluate on real use cases\n",
    "- Review data quality, format, and diversity\u2014metrics can be misleading with poor data\n",
    "- Try a different base model size or architecture\n",
    "- Adjust learning rate and batch size\n",
    "- Compare to baseline: Test base model to ensure fine-tuning improved performance\n",
    "\n",
    "**Deployment fails:**\n",
    "- Verify output model exists: `sdk.models.retrieve(name=job.spec.output_model, workspace=\"default\")`\n",
    "- Check deployment logs: `sdk.inference.deployments.get_logs(name=deployment.name, workspace=\"default\")`\n",
    "- Ensure sufficient GPU resources available for model size\n",
    "- Verify NIM image tag `1.13.1` is compatible with your model\n",
    "\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "- [Monitor training metrics](fine-tune-metrics) in detail\n",
    "- [Evaluate your fine-tuned model](../../evaluator/index) using the Evaluator service\n",
    "- Learn about [LoRA customization](./lora-customization-job) for resource-efficient fine-tuning\n",
    "- Explore [knowledge distillation](./distillation-customization-job) to compress larger models"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}