{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "# Full SFT Customization\n", "\n", "Learn how to fine-tune all model weights using supervised fine-tuning (SFT) to customize LLM behavior for your specific tasks.\n", "\n", "## About\n", "\n", "Supervised Fine-Tuning (SFT) customizes model behavior, injects new knowledge, and optimizes performance for specific domains and tasks. Full SFT modifies **all model weights** during training, providing maximum customization flexibility.\n", "\n", "**What you can achieve with SFT:**\n", "\n", "- \ud83c\udfaf **Specialize for domains:** Fine-tune models on legal texts, medical records, or financial data\n", "- \ud83d\udca1 **Inject knowledge:** Add new information not present in the base model\n", "- \ud83d\udcc8 **Improve accuracy:** Optimize for specific tasks like sentiment analysis, summarization, or code generation\n", "\n", "### SFT vs LoRA: Understanding the Trade-offs\n", "\n", "**Full SFT** trains all model parameters (e.g., all 70 billion weights in Llama 70B):\n", "\n", "- \u2705 Maximum model adaptation and knowledge injection\n", "- \u2705 Can fundamentally change model behavior\n", "- \u2705 Best for significant domain shifts or specialized tasks\n", "- \u274c Requires substantial GPU resources (4-8x more than LoRA)\n", "- \u274c Produces full model weights (~140GB for Llama 70B)\n", "- \u274c Longer training time\n", "\n", "**LoRA** trains only ~1% of weights by adding thin matrices to existing weights:\n", "\n", "- \u2705 75-95% less memory required\n", "- \u2705 Faster training (2-4x speedup)\n", "- \u2705 Produces small adapter files (~100-500MB)\n", "- \u2705 Multiple adapters can share one base model\n", "- \u274c Limited adaptation capability compared to full fine-tuning\n", "\n", "**When to choose Full SFT:**\n", "\n", "- Training small models (1B-8B) where resource cost is manageable\n", "- Need fundamental behavior changes (e.g., medical diagnosis, legal reasoning)\n", "- Injecting substantial new knowledge not in the base model\n", "\n", "**When to choose LoRA:** See the [LoRA tutorial](./lora-customization-job) for most use cases, especially with large models (70B+) or limited GPU resources." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Before starting this tutorial, ensure you have:\n", "\n", "1. **Completed the [Quickstart](../../get-started/installation.md)** to install and deploy NeMo Microservices locally\n", "2. **Installed the Python SDK** (included with `pip install nemo-microservices`)\n", "3. **Set up organizational entities** (namespaces and projects) if you're new to the platform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Start\n", "\n", "### 1. Initialize SDK\n", "\n", "The SDK needs to know your NMP server URL. By default, `http://localhost:8080` is used in accordance with the [Quickstart](../../get-started/installation.md) guide. If NMP is running at a custom location, you can override the URL by setting the `NMP_BASE_URL` environment variable:\n", "\n", "```sh\n", "export NMP_BASE_URL=\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from nemo_microservices import NeMoMicroservices, ConflictError\n", "\n", "NMP_BASE_URL = os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n", "sdk = NeMoMicroservices(\n", " base_url=NMP_BASE_URL,\n", " workspace=\"default\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Prepare Dataset\n", "\n", "Create your data in JSONL format\u2014one JSON object per line. The platform auto-detects your data format. Supported dataset formats are listed below.\n", "\n", "**Flexible Data Setup:**\n", "- **No validation file?** The platform automatically creates a 10% validation split\n", "- **Multiple files?** Upload to `training/` or `validation/` subdirectories\u2014they'll be automatically merged\n", "- **Format detection:** Your data format is auto-detected at training time\n", "\n", "In this tutorial the following dataset directory structure will be used:\n", "```\n", "my_dataset\n", "`-- training.jsonl\n", "`-- validation.jsonl\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple Prompt/Completion Format\n", "The simplest format with input prompt and expected completion:\n", "- **`prompt`**: The input prompt for the model\n", "- **`completion`**: The expected output response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```json\n", "{\"prompt\": \"Write an email to confirm our hotel reservation.\", \"completion\": \"Dear Hotel Team, I am writing to confirm our reservation for two guests...\"}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Chat Format (for conversational models)\n", "For multi-turn conversations, use the messages format:\n", "- **`messages`**: List of message objects with `role` and `content` fields\n", "- Roles: `system`, `user`, `assistant`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```json\n", "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is AI?\"}, {\"role\": \"assistant\", \"content\": \"AI is...\"}]}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Custom Format (specify columns in job)\n", "You can use custom field names and map them during job creation:\n", "- Define your own field names\n", "- Map them to prompt/completion in the job configuration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```json\n", "{\"question\": \"What is 2+2?\", \"answer\": \"4\"}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Create Dataset FileSet and Upload Training Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install huggingface datasets package to download public [rajpurkar/squad](https://huggingface.co/datasets/rajpurkar/squad) dataset if it's not installed in your Python environment:\n", "\n", "```sh\n", "pip install datasets\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download rajpurkar/squad Dataset\n", "\n", "SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer is a segment of text from the corresponding passage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from datasets import load_dataset, DatasetDict\n", "import json\n", "\n", "# Load the SQuAD dataset from Hugging Face\n", "print(\"Loading dataset rajpurkar/squad\")\n", "raw_dataset = load_dataset(\"rajpurkar/squad\")\n", "if not isinstance(raw_dataset, DatasetDict):\n", " raise ValueError(\"Dataset does not contain expected splits\")\n", "\n", "print(\"Loaded dataset\")\n", "\n", "# Configuration\n", "VALIDATION_PROPORTION = 0.05\n", "SEED = 1234\n", "\n", "# For the purpose of this tutorial, we'll use a subset of the dataset\n", "# The larger the datasets, the better the model will perform but longer the training will take\n", "training_size = 3000\n", "validation_size = 300\n", "DATASET_PATH = Path(\"sft-dataset\").absolute()\n", "\n", "# Create directory if it doesn't exist\n", "os.makedirs(DATASET_PATH, exist_ok=True)\n", "\n", "# Get the train split and create a validation split from it\n", "train_set = raw_dataset.get('train')\n", "split_dataset = train_set.train_test_split(test_size=VALIDATION_PROPORTION, seed=SEED)\n", "\n", "# Select subsets for the tutorial\n", "train_ds = split_dataset['train'].select(range(min(training_size, len(split_dataset['train']))))\n", "validation_ds = split_dataset['test'].select(range(min(validation_size, len(split_dataset['test']))))\n", "\n", "# Convert SQuAD format to prompt/completion format and save to JSONL\n", "def convert_squad_to_sft_format(example):\n", " \"\"\"Convert SQuAD format to prompt/completion format for SFT training.\"\"\"\n", " prompt = f\"Context: {example['context']} Question: {example['question']} Answer:\"\n", " completion = example[\"answers\"][\"text\"][0] # Take the first answer\n", " return {\"prompt\": prompt, \"completion\": completion}\n", "\n", "# Save training data\n", "with open(f\"{DATASET_PATH}/training.jsonl\", \"w\", encoding=\"utf-8\") as f:\n", " for example in train_ds:\n", " converted = convert_squad_to_sft_format(example)\n", " f.write(json.dumps(converted) + \"\\n\")\n", "\n", "# Save validation data\n", "with open(f\"{DATASET_PATH}/validation.jsonl\", \"w\", encoding=\"utf-8\") as f:\n", " for example in validation_ds:\n", " converted = convert_squad_to_sft_format(example)\n", " f.write(json.dumps(converted) + \"\\n\")\n", "\n", "print(f\"Saved training.jsonl with {len(train_ds)} rows\")\n", "print(f\"Saved validation.jsonl with {len(validation_ds)} rows\")\n", "\n", "# Show a sample from the training data\n", "print(\"\\nSample from training data:\")\n", "with open(f\"{DATASET_PATH}/training.jsonl\", 'r') as f:\n", " first_line = f.readline()\n", " sample = json.loads(first_line)\n", " print(f\"Prompt: {sample['prompt'][:200]}...\")\n", " print(f\"Completion: {sample['completion']}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create fileset to store SFT training data\n", "DATASET_NAME = \"sft-dataset\"\n", "\n", "try:\n", " sdk.filesets.create(\n", " workspace=\"default\",\n", " name=DATASET_NAME,\n", " description=\"SFT training data\"\n", " )\n", " print(f\"Created fileset: {DATASET_NAME}\")\n", "except ConflictError:\n", " print(f\"Fileset '{DATASET_NAME}' already exists, continuing...\")\n", "\n", "# Upload training data files individually to ensure correct structure\n", "sdk.filesets.fsspec.put(\n", " lpath=DATASET_PATH, # Local directory with your JSONL files\n", " rpath=f\"default/{DATASET_NAME}/\",\n", " recursive=True\n", ")\n", "\n", "# Validate training data is uploaded correctly\n", "print(\"Training data:\")\n", "print(sdk.filesets.list_files(name=DATASET_NAME, workspace=\"default\").model_dump_json(indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Secrets Setup\n", "\n", "If you plan to use NGC or HuggingFace models, you'll need to configure authentication:\n", "\n", "- **NGC models** (`ngc://` URIs): Requires NGC API key\n", "- **HuggingFace models** (`hf://` URIs): Requires HF token for gated/private models\n", "\n", "\n", "Configure these as secrets in your platform. See [Managing Secrets](../../set-up/manage-secrets.md) for detailed instructions.\n", "\n", "Get your credentials to access base models:\n", "- [NGC API Key](https://ngc.nvidia.com/) (Setup \u2192 Generate API Key)\n", "- [HuggingFace Token](https://huggingface.co/settings/tokens) (Create token with Read access)\n", "\n", "\n", "---\n", "\n", "#### Quick Setup Example\n", "\n", "In this tutorial we are going to work with [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) model from HuggingFace. Ensure that you have sufficient permissions to download the model. If you cannot see the files in the [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) Hugging Face page, request access\n", "\n", "**HuggingFace Authentication:**\n", "- For gated models (Llama, Gemma), you must provide a HuggingFace token via the `token_secret` parameter\n", "- Get your token from [HuggingFace Settings](https://huggingface.co/settings/tokens) (requires Read access)\n", "- Accept the model's terms on the HuggingFace model page before using it. Example: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main)\n", "- For public models, you can omit the `token_secret` parameter when creating a fileset for model in the next step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export the HF_TOKEN and NGC_API_KEY environment variables if they are not already set\n", "HF_TOKEN = os.getenv(\"HF_TOKEN\")\n", "NGC_API_KEY = os.getenv(\"NGC_API_KEY\")\n", "\n", "\n", "def create_or_get_secret(name: str, value: str | None, label: str):\n", " if not value:\n", " raise ValueError(f\"{label} is not set\")\n", " try:\n", " secret = sdk.secrets.create(\n", " name=name,\n", " workspace=\"default\",\n", " data=value,\n", " )\n", " print(f\"Created secret: {name}\")\n", " return secret\n", " except ConflictError:\n", " print(f\"Secret '{name}' already exists, continuing...\")\n", " return sdk.secrets.retrieve(name=name, workspace=\"default\")\n", "\n", "\n", "# Create HuggingFace token secret\n", "hf_secret = create_or_get_secret(\"hf-token\", HF_TOKEN, \"HF_TOKEN\")\n", "print(\"HF_TOKEN secret:\")\n", "print(hf_secret.model_dump_json(indent=2))\n", "\n", "# Create NGC API key secret\n", "# Uncomment the line below if you have NGC API Key and want to finetune NGC models\n", "# ngc_api_key = create_or_get_secret(\"ngc-api-key\", NGC_API_KEY, \"NGC_API_KEY\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Create Base Model FileSet\n", "\n", "Create a fileset pointing to [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main) model in HuggingFace that we will train with SFT. Model downloading will take place at the SFT finetuning job creation time. This step creates a pointer to the Hugging Face and does not download the model.\n", "\n", "Note: for public models, you can omit the `token_secret` parameter when creating a model fileset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a fileset pointing to the desired HuggingFace model\n", "from nemo_microservices.types.filesets import HuggingfaceStorageConfigParam\n", "\n", "HF_REPO_ID = \"meta-llama/Llama-3.2-1B-Instruct\"\n", "MODEL_NAME = \"llama-3-2-1b-base\"\n", "\n", "# Ensure you have a HuggingFace token secret created\n", "try:\n", " base_model = sdk.filesets.create(\n", " workspace=\"default\",\n", " name=MODEL_NAME,\n", " description=\"Llama 3.2 1B base model from HuggingFace\",\n", " storage=HuggingfaceStorageConfigParam(\n", " type=\"huggingface\",\n", " # repo_id is the full model name from Hugging Face\n", " repo_id=HF_REPO_ID,\n", " repo_type=\"model\",\n", " # we use the secret created in the previous step\n", " token_secret=hf_secret.name\n", " )\n", " )\n", "except ConflictError:\n", " print(f\"Base model fileset already exists. Skipping creation.\")\n", " base_model = sdk.filesets.retrieve(\n", " workspace=\"default\",\n", " name=\"llama-3-2-1b-base\",\n", " )\n", "\n", "print(f\"Base model fileset: fileset://default/{base_model.name}\")\n", "print(\"Base model fileset files list:\")\n", "print((sdk.filesets.list_files(name=\"llama-3-2-1b-base\", workspace=\"default\")).model_dump_json(indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Create SFT Finetuning Job\n", "Create a customization job with an inline target referencing the base model and dataset filesets created in previous steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Target `model_uri` Format:**\n", "\n", "Currently, `model_uri` must reference a FileSet:\n", "- **FileSet:** `fileset://workspace/fileset-name`\n", "\n", "Support for direct HuggingFace (`hf://`) and NGC (`ngc://`) URIs is coming soon. For now, create a fileset and upload your base model from these sources as shown in step 4.\n", "\n", "**GPU Requirements:**\n", "- 1B models: 1 GPU (24GB+ VRAM)\n", "- 3B models: 1-2 GPUs \n", "- 8B models: 2-4 GPUs\n", "- 70B models: 8+ GPUs \n", "\n", "Adjust `num_gpus_per_node` and `tensor_parallel_size` based on your model size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uuid\n", "from nemo_microservices.types.customization import (\n", " CustomizationJobInputParam,\n", " CustomizationTargetParamParam,\n", " HyperparametersParam,\n", ")\n", "\n", "job_suffix = uuid.uuid4().hex[:4]\n", "\n", "JOB_NAME = f\"my-sft-job-{job_suffix}\"\n", "\n", "job = sdk.customization.jobs.create(\n", " name=JOB_NAME,\n", " workspace=\"default\",\n", " spec=CustomizationJobInputParam(\n", " target=CustomizationTargetParamParam(\n", " workspace=\"default\",\n", " model_uri=f\"fileset://default/{base_model.name}\"\n", " ),\n", " dataset=f\"fileset://default/{DATASET_NAME}\",\n", " hyperparameters=HyperparametersParam(\n", " training_type=\"sft\",\n", " finetuning_type=\"all_weights\",\n", " epochs=2,\n", " batch_size=64,\n", " learning_rate=0.00005,\n", " max_seq_length=2048,\n", " # GPU and parallelism settings\n", " num_gpus_per_node=1,\n", " num_nodes=1,\n", " tensor_parallel_size=1,\n", " pipeline_parallel_size=1,\n", " micro_batch_size=1,\n", " )\n", " )\n", ")\n", "\n", "print(f\"Job ID: {job.name}\")\n", "print(f\"Output model: {job.spec.output_model}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Track Training Progress" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "from IPython.display import clear_output\n", "\n", "# Poll job status every 10 seconds until completed\n", "while True:\n", " status = sdk.audit.jobs.get_status(\n", " name=job.name,\n", " workspace=\"default\"\n", " )\n", " \n", " clear_output(wait=True)\n", " print(f\"Job Status: {status.status}\")\n", "\n", " # Extract training progress from nested steps structure\n", " step: int | None = None\n", " max_steps: int | None = None\n", " training_phase: str | None = None\n", "\n", " for job_step in status.steps or []:\n", " if job_step.name == \"customization-training-job\":\n", " for task in job_step.tasks or []:\n", " task_details = task.status_details or {}\n", " step = task_details.get(\"step\")\n", " max_steps = task_details.get(\"max_steps\")\n", " training_phase = task_details.get(\"phase\")\n", " break\n", " break\n", "\n", " if step is not None and max_steps is not None:\n", " progress_pct = (step / max_steps) * 100\n", " print(f\"Training Progress: Step {step}/{max_steps} ({progress_pct:.1f}%)\")\n", " if training_phase:\n", " print(f\"Training Phase: {training_phase}\")\n", " else:\n", " print(\"Training step not started yet or progress info not available\")\n", " \n", " # Exit loop when job is completed (or failed/cancelled)\n", " if status.status in (\"completed\", \"failed\", \"cancelled\"):\n", " print(f\"\\nJob finished with status: {status.status}\")\n", " break\n", " \n", " time.sleep(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Interpreting SFT Training Metrics:**\n", "\n", "Monitor the relationship between training and validation loss curves:\n", "\n", "| Scenario | Interpretation | Action |\n", "|----------|----------------|--------|\n", "| **Both decreasing together** | Model is learning well | Continue training |\n", "| **Training decreases, validation flat/increasing** | Overfitting | Reduce epochs, add data |\n", "| **Both flat/not decreasing** | Underfitting | Increase LR, check data |\n", "| **Sudden spikes** | Training instability | Lower learning rate |\n", "\n", "**Note:** Training metrics measure optimization progress, not final model quality. Always evaluate the deployed model on your specific use case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(ft-deploy-full-weight-model)=\n", "\n", "### 8. Deploy Fine-Tuned Model\n", "\n", "Once training completes, deploy using the Deployment Management Service:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Validate model entity exists\n", "model_entity = sdk.models.retrieve(workspace='default', name=job.spec.output_model)\n", "print(model_entity.model_dump_json(indent=2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nemo_microservices.types.inference import NIMDeploymentParam\n", "\n", "# Create deployment config\n", "deploy_suffix = uuid.uuid4().hex[:4]\n", "DEPLOYMENT_CONFIG_NAME = f\"sft-model-deployment-cfg-{deploy_suffix}\"\n", "DEPLOYMENT_NAME = f\"sft-model-deployment-{deploy_suffix}\"\n", "\n", "deployment_config = sdk.inference.deployment_configs.create(\n", " workspace=\"default\",\n", " name=DEPLOYMENT_CONFIG_NAME,\n", " nim_deployment=NIMDeploymentParam(\n", " image_name=\"nvcr.io/nim/nvidia/llm-nim\",\n", " image_tag=\"1.13.1\",\n", " gpu=1,\n", " model_name=job.spec.output_model, # ModelEntity name from training,\n", " model_namespace=\"default\", # Workspace where ModelEntity lives\n", " )\n", ")\n", "\n", "# Deploy model using deployment_config created above\n", "deployment = sdk.inference.deployments.create(\n", " workspace=\"default\",\n", " name=DEPLOYMENT_NAME,\n", " config=deployment_config.name\n", ")\n", "\n", "\n", "# Check deployment status\n", "deployment_status = sdk.inference.deployments.retrieve(\n", " name=deployment.name,\n", " workspace=\"default\"\n", ")\n", "\n", "print(f\"Deployment name: {deployment.name}\")\n", "print(f\"Deployment status: {deployment_status.status}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The deployment service automatically:\n", "- Downloads model weights from the Files service\n", "- Provisions storage (PVC) for the weights\n", "- Configures and starts the NIM container\n", "\n", "**Multi-GPU Deployment:**\n", "\n", "For larger models requiring multiple GPUs, configure parallelism with environment variables:\n", "\n", "```python\n", "deployment_config = sdk.inference.deployment_configs.create(\n", " workspace=\"default\",\n", " name=\"sft-model-config-multigpu\",\n", " \n", " nim_deployment={\n", " \"image_name\": \"nvcr.io/nim/nvidia/llm-nim\",\n", " \"image_tag\": \"1.13.1\",\n", " \"gpu\": 2, # Total GPUs\n", " \"additional_envs\": {\n", " \"NIM_TENSOR_PARALLEL_SIZE\": \"2\", # Tensor parallelism\n", " \"NIM_PIPELINE_PARALLEL_SIZE\": \"1\" # Pipeline parallelism\n", " }\n", " }\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Single-Node Constraint:** Model deployments are limited to a single node. The maximum `gpu` value depends on the total GPUs available on a single node in your cluster. Multi-node deployments are not supported.\n", "\n", "---\n", "\n", "#### GPU Parallelism\n", "\n", "By default, NIM uses all GPUs for tensor parallelism (TP). You can customize this behavior using the `NIM_TENSOR_PARALLEL_SIZE` and `NIM_PIPELINE_PARALLEL_SIZE` environment variables.\n", "\n", "| Strategy | Description | Best For |\n", "|----------|-------------|----------|\n", "| **Tensor Parallel (TP)** | Splits model layers across GPUs | Lowest latency |\n", "| **Pipeline Parallel (PP)** | Splits model depth across GPUs | Highest throughput |\n", "\n", "**Formula:** `gpu` = `NIM_TENSOR_PARALLEL_SIZE` \u00d7 `NIM_PIPELINE_PARALLEL_SIZE`\n", "\n", "---\n", "\n", "#### Example Configurations\n", "\n", "**Default (TP=8, PP=1) \u2014 Lowest Latency**\n", "```\n", "\"gpu\": 8\n", "# NIM automatically sets NIM_TENSOR_PARALLEL_SIZE=8\n", "```\n", "\n", "**Balanced (TP=4, PP=2)**\n", "```\n", "\"gpu\": 8,\n", "\"additional_envs\": {\n", " \"NIM_TENSOR_PARALLEL_SIZE\": \"4\",\n", " \"NIM_PIPELINE_PARALLEL_SIZE\": \"2\"\n", "}\n", "```\n", "\n", "**Throughput Optimized (TP=2, PP=4)**\n", "```\n", "\"gpu\": 8,\n", "\"additional_envs\": {\n", " \"NIM_TENSOR_PARALLEL_SIZE\": \"2\",\n", " \"NIM_PIPELINE_PARALLEL_SIZE\": \"4\"\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Track deployment status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "from IPython.display import clear_output\n", "\n", "# Poll deployment status every 15 seconds until ready\n", "TIMEOUT_MINUTES = 30\n", "start_time = time.time()\n", "timeout_seconds = TIMEOUT_MINUTES * 60\n", "\n", "print(f\"Monitoring deployment '{deployment.name}'...\")\n", "print(f\"Timeout: {TIMEOUT_MINUTES} minutes\\n\")\n", "\n", "while True:\n", " deployment_status = sdk.inference.deployments.retrieve(\n", " name=deployment.name,\n", " workspace=\"default\"\n", " )\n", " \n", " elapsed = time.time() - start_time\n", " elapsed_min = int(elapsed // 60)\n", " elapsed_sec = int(elapsed % 60)\n", " \n", " clear_output(wait=True)\n", " print(f\"Deployment: {deployment.name}\")\n", " print(f\"Status: {deployment_status.status}\")\n", " print(f\"Elapsed time: {elapsed_min}m {elapsed_sec}s\")\n", " \n", " # Check if deployment is ready\n", " if deployment_status.status == \"READY\":\n", " print(\"\\nDeployment is ready!\")\n", " break\n", " \n", " # Check for failure states\n", " if deployment_status.status in (\"FAILED\", \"ERROR\", \"TERMINATED\", \"LOST\"):\n", " print(f\"\\nDeployment failed with status: {deployment_status.status}\")\n", " break\n", " \n", " # Check timeout\n", " if elapsed > timeout_seconds:\n", " print(f\"\\nTimeout reached ({TIMEOUT_MINUTES} minutes). Deployment may still be in progress.\")\n", " print(\"You can continue to check status manually or wait longer.\")\n", " break\n", " \n", " time.sleep(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Evaluate Your Model\n", "\n", "After training, evaluate whether your model meets your requirements:\n", "\n", "#### Quick Manual Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Wait for deployment to be ready, then test\n", "# Test the fine-tuned model with a question answering prompt\n", "context = \"The Apollo 11 mission was the first manned mission to land on the Moon. It was launched on July 16, 1969, and Neil Armstrong became the first person to walk on the lunar surface on July 20, 1969. Buzz Aldrin joined him shortly after, while Michael Collins remained in lunar orbit.\"\n", "question = \"Who was the first person to walk on the Moon?\"\n", "\n", "messages = [\n", " {\"role\": \"user\", \"content\": f\"Based on the following context, answer the question.\\n\\nContext: {context}\\n\\nQuestion: {question}\"}\n", "]\n", "\n", "response = sdk.inference.gateway.provider.post(\n", " \"v1/chat/completions\",\n", " name=deployment.name,\n", " workspace=\"default\",\n", " body={\n", " \"model\": f\"default/{job.spec.output_model}\",\n", " \"messages\": messages,\n", " \"temperature\": 0,\n", " \"max_tokens\": 128\n", " }\n", ")\n", "\n", "print(\"=\" * 60)\n", "print(\"MODEL EVALUATION\")\n", "print(\"=\" * 60)\n", "print(f\"Question: {question}\")\n", "print(f\"Expected: Neil Armstrong\")\n", "print(f\"Model output: {response['choices'][0]['message']['content']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluation Best Practices\n", "\n", "**Manual Evaluation** (Recommended)\n", "- Test with real-world examples from your use case\n", "- Compare responses to base model and expected outputs\n", "- Verify the model exhibits desired behavior changes\n", "- Check edge cases and error handling\n", "\n", "**What to look for:**\n", "- \u2705 Model follows your desired output format\n", "- \u2705 Applies domain knowledge correctly\n", "- \u2705 Maintains general language capabilities\n", "- \u2705 Avoids unwanted behaviors or biases\n", "- \u274c Doesn't hallucinate facts not in training data\n", "- \u274c Doesn't produce repetitive or nonsensical outputs\n", "\n", "---\n", "\n", "## Hyperparameters\n", "\n", "For detailed information on all available hyperparameters, recommended values, and tuning guidance, see the [Hyperparameter Reference](../manage-customization-jobs/hyperparameters.md).\n", "\n", "---\n", "\n", "\n", "## Troubleshooting\n", "\n", "**Job fails during model download:**\n", "- Verify authentication secrets are configured (see [Managing Secrets](../../set-up/manage-secrets.md))\n", "- For gated HuggingFace models (Llama, Gemma), accept the license on the model page\n", "- Check the `model_uri` format is correct (`fileset://`)\n", "- Ensure you have accepted the model's terms of service on HuggingFace\n", "- Check job status and logs: `sdk.customization.jobs.retrieve(name=job.name, workspace=\"default\")`\n", "\n", "**Job fails with OOM (Out of Memory) error:**\n", "1. **First try:** Reduce `micro_batch_size` from 2 to 1\n", "2. **Still OOM:** Reduce `batch_size` from 4 to 2\n", "3. **Still OOM:** Reduce `max_seq_length` from 2048 to 1024 or 512\n", "4. **Last resort:** Increase GPU count and use `tensor_parallel_size` for model sharding\n", "\n", "**Loss curves not decreasing (underfitting):**\n", "- Increase training duration: `epochs: 5-10` instead of 3\n", "- Adjust learning rate: Try `1e-5` to `1e-4`\n", "- Add warmup: Set `warmup_steps` to ~10% of total training steps\n", "- Check data quality: Verify formatting, remove duplicates, ensure diversity\n", "\n", "**Training loss decreases but validation loss increases (overfitting):**\n", "- Reduce epochs: Try `epochs: 1-2` instead of 5+\n", "- Lower learning rate: Use `2e-5` or `1e-5`\n", "- Increase dataset size and diversity\n", "- Verify train/validation split has no data leakage\n", "\n", "**Model output quality is poor despite good training metrics:**\n", "- Training metrics optimize for loss, not your actual task\u2014evaluate on real use cases\n", "- Review data quality, format, and diversity\u2014metrics can be misleading with poor data\n", "- Try a different base model size or architecture\n", "- Adjust learning rate and batch size\n", "- Compare to baseline: Test base model to ensure fine-tuning improved performance\n", "\n", "**Deployment fails:**\n", "- Verify output model exists: `sdk.models.retrieve(name=job.spec.output_model, workspace=\"default\")`\n", "- Check deployment logs: `sdk.inference.deployments.get_logs(name=deployment.name, workspace=\"default\")`\n", "- Ensure sufficient GPU resources available for model size\n", "- Verify NIM image tag `1.13.1` is compatible with your model\n", "\n", "\n", "## Next Steps\n", "\n", "- [Monitor training metrics](fine-tune-metrics) in detail\n", "- [Evaluate your fine-tuned model](../../evaluator/index) using the Evaluator service\n", "- Learn about [LoRA customization](./lora-customization-job) for resource-efficient fine-tuning\n", "- Explore [knowledge distillation](./distillation-customization-job) to compress larger models" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }