{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- @nemo-nb: process -->\n",
    "<!-- @nemo-nb: download -->\n",
    "\n",
    "# Embedding Model Customization\n",
    "\n",
    "Learn how to fine-tune an embedding model to improve retrieval accuracy for your specific domain.\n",
    "\n",
    "## About\n",
    "\n",
    "Embedding models convert text into dense vector representations that capture semantic meaning. Fine-tuning these models on your domain data significantly improves retrieval accuracy\u2014in RAG pipelines, this means the LLM receives more relevant context and produces better answers.\n",
    "\n",
    "**What you can achieve with embedding fine-tuning:**\n",
    "\n",
    "- \ud83c\udfaf **Domain specialization:** Adapt general embeddings for legal, medical, scientific, or financial content\n",
    "- \ud83d\udcc8 **Improved retrieval:** Achieve 6-10% better recall on domain-specific benchmarks\n",
    "- \ud83d\udd0d **Semantic understanding:** Teach the model your domain's vocabulary and relationships\n",
    "\n",
    "**Recall@5** measures the fraction of relevant documents that appear in the top 5 search results.\n",
    "\n",
    "**About the baseline:** In retrieval benchmarks like SciDocs, the pretrained model achieves ~0.159 Recall@5. After fine-tuning on scientific paper triplets, you can expect 6-10% improvement (~0.17 Recall@5).\n",
    "\n",
    "### Dataset Format for Embedding Models\n",
    "\n",
    "Embedding models require **triplet format** for contrastive learning:\n",
    "\n",
    "```json\n",
    "{\"query\": \"What is machine learning?\", \"pos_doc\": \"Machine learning is a subset of AI...\", \"neg_doc\": [\"Gardening tips for beginners...\"]}\n",
    "```\n",
    "\n",
    "- **`query`**: The search query or question\n",
    "- **`pos_doc`**: A document relevant to the query (positive example)\n",
    "- **`neg_doc`**: List of hard negatives\u2014documents that share some overlap with the query but are not actually relevant (negative example)\n",
    "\n",
    "The model learns to maximize similarity between query and positive document while minimizing similarity with negative documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before starting this tutorial, ensure you have:\n",
    "\n",
    "1. **Completed the [Quickstart](../../get-started/installation.md)** to install and deploy NeMo Microservices locally\n",
    "2. **Installed the Python SDK** (included with `pip install nemo-microservices`)\n",
    "3. **HuggingFace token** with read access to download the SPECTER dataset (get one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens))\n",
    "4. **NGC API key** to pull NIM container images from nvcr.io (get one at [ngc.nvidia.com](https://ngc.nvidia.com/) \u2192 Setup \u2192 Generate API Key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quick Start\n",
    "\n",
    "### 1. Initialize SDK\n",
    "\n",
    "The SDK needs to know your NMP server URL. By default, `http://localhost:8080` is used in accordance with the [Quickstart](../../get-started/installation.md) guide. If NMP is running at a custom location, you can override the URL by setting the `NMP_BASE_URL` environment variable:\n",
    "\n",
    "```sh\n",
    "export NMP_BASE_URL=<YOUR_NMP_BASE_URL>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from nemo_microservices import NeMoMicroservices, ConflictError\n",
    "\n",
    "NMP_BASE_URL = os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n",
    "sdk = NeMoMicroservices(\n",
    "    base_url=NMP_BASE_URL,\n",
    "    workspace=\"default\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Establish Baseline Performance\n",
    "\n",
    "Before fine-tuning, let's establish baseline performance with the pretrained model. We'll deploy it, run a test query, and see where it struggles. After fine-tuning, we'll compare the results.\n",
    "\n",
    "**Scenario:** Searching scientific papers by meaning, not keywords.\n",
    "\n",
    "**Demo setup:**\n",
    "- **Query:** \"Conditional Random Fields\" (CRFs) - a method for sequence labeling in NLP\n",
    "- **Trap:** \"Random Forests\" shares the word \"random\" but is an unrelated tree-based algorithm\n",
    "- **Goal:** Can the model tell the difference?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "import numpy as np\n",
    "\n",
    "# Demo query and documents for baseline comparison\n",
    "DEMO_QUERY = \"Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\"\n",
    "\n",
    "DEMO_DOCS = [\n",
    "    \"Bidirectional LSTM-CRF Models for Sequence Tagging\",   # CRF-based paper\n",
    "    \"An Introduction to Conditional Random Fields\",         # CRF tutorial  \n",
    "    \"Random Forests\",                                       # Keyword trap! Unrelated.\n",
    "    \"Neural Architectures for Named Entity Recognition\",    # Related to sequence labeling; may use CRFs\n",
    "    \"Support Vector Machines for Classification\",           # Unrelated ML method\n",
    "]\n",
    "\n",
    "DEMO_LABELS = [\"BiLSTM-CRF\", \"CRF Tutorial\", \"Random Forest\", \"NER\", \"SVM\"]\n",
    "DEMO_RELEVANT = {0, 1, 3}  # Papers actually relevant to CRFs\n",
    "\n",
    "def cosine_similarity(a, b):\n",
    "    \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n",
    "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_microservices.types.inference import NIMDeploymentParam\n",
    "\n",
    "# NGC API key is required to pull NIM images from nvcr.io\n",
    "NGC_API_KEY = os.environ.get(\"NGC_API_KEY\")\n",
    "if not NGC_API_KEY:\n",
    "    raise ValueError(\"NGC_API_KEY environment variable is required. Get one at https://ngc.nvidia.com/ \u2192 Setup \u2192 Generate API Key\")\n",
    "\n",
    "# Create NGC secret for pulling NIM images\n",
    "NGC_SECRET_NAME = \"ngc-api-key\"\n",
    "try:\n",
    "    sdk.secrets.create(name=NGC_SECRET_NAME, workspace=\"default\", data=NGC_API_KEY)\n",
    "    print(f\"Created secret: {NGC_SECRET_NAME}\")\n",
    "except ConflictError:\n",
    "    print(f\"Secret '{NGC_SECRET_NAME}' already exists, continuing...\")\n",
    "\n",
    "# Deploy base model for baseline comparison\n",
    "BASE_MODEL_HF = \"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "NIM_IMAGE = \"nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "NIM_TAG = \"1.6.0\"\n",
    "\n",
    "baseline_suffix = uuid.uuid4().hex[:4]\n",
    "BASELINE_DEPLOYMENT_CONFIG = f\"baseline-embedding-cfg-{baseline_suffix}\"\n",
    "BASELINE_DEPLOYMENT_NAME = f\"baseline-embedding-{baseline_suffix}\"\n",
    "\n",
    "print(\"Creating baseline deployment config...\")\n",
    "baseline_config = sdk.inference.deployment_configs.create(\n",
    "    workspace=\"default\",\n",
    "    name=BASELINE_DEPLOYMENT_CONFIG,\n",
    "    nim_deployment=NIMDeploymentParam(\n",
    "        image_name=NIM_IMAGE,\n",
    "        image_tag=NIM_TAG,\n",
    "        gpu=1,\n",
    "        image_pull_secret=NGC_SECRET_NAME,\n",
    "    )\n",
    ")\n",
    "\n",
    "print(\"Deploying base model...\")\n",
    "baseline_deployment = sdk.inference.deployments.create(\n",
    "    workspace=\"default\",\n",
    "    name=BASELINE_DEPLOYMENT_NAME,\n",
    "    config=baseline_config.name\n",
    ")\n",
    "print(f\"Baseline deployment: {baseline_deployment.name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import clear_output\n",
    "\n",
    "# Wait for baseline deployment\n",
    "TIMEOUT_MINUTES = 15\n",
    "start_time = time.time()\n",
    "\n",
    "print(f\"Waiting for baseline deployment...\")\n",
    "while True:\n",
    "    status = sdk.inference.deployments.retrieve(\n",
    "        name=BASELINE_DEPLOYMENT_NAME,\n",
    "        workspace=\"default\"\n",
    "    )\n",
    "    \n",
    "    elapsed = time.time() - start_time\n",
    "    elapsed_str = f\"{int(elapsed//60)}m {int(elapsed%60)}s\"\n",
    "    \n",
    "    clear_output(wait=True)\n",
    "    print(f\"Baseline deployment: {status.status} | {elapsed_str}\")\n",
    "    \n",
    "    if status.status == \"READY\":\n",
    "        print(\"Baseline model ready!\")\n",
    "        break\n",
    "    if status.status in (\"FAILED\", \"ERROR\", \"TERMINATED\", \"LOST\"):\n",
    "        raise RuntimeError(f\"Baseline deployment failed: {status.status}\")\n",
    "    if elapsed > TIMEOUT_MINUTES * 60:\n",
    "        raise TimeoutError(\"Baseline deployment timeout\")\n",
    "    \n",
    "    time.sleep(10)\n",
    "\n",
    "# Wait for model initialization\n",
    "time.sleep(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run baseline ranking with the base model\n",
    "BASE_MODEL_ID = \"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "\n",
    "# Get query embedding\n",
    "query_response = sdk.inference.gateway.provider.post(\n",
    "    \"v1/embeddings\",\n",
    "    name=BASELINE_DEPLOYMENT_NAME,\n",
    "    workspace=\"default\",\n",
    "    body={\n",
    "        \"model\": BASE_MODEL_ID,\n",
    "        \"input\": [DEMO_QUERY],\n",
    "        \"input_type\": \"query\"\n",
    "    }\n",
    ")\n",
    "base_query_emb = query_response[\"data\"][0][\"embedding\"]\n",
    "\n",
    "# Get document embeddings\n",
    "doc_response = sdk.inference.gateway.provider.post(\n",
    "    \"v1/embeddings\",\n",
    "    name=BASELINE_DEPLOYMENT_NAME,\n",
    "    workspace=\"default\",\n",
    "    body={\n",
    "        \"model\": BASE_MODEL_ID,\n",
    "        \"input\": DEMO_DOCS,\n",
    "        \"input_type\": \"passage\"\n",
    "    }\n",
    ")\n",
    "base_doc_embs = [d[\"embedding\"] for d in doc_response[\"data\"]]\n",
    "\n",
    "# Calculate similarities and rank\n",
    "scores = [(i, cosine_similarity(base_query_emb, base_doc_embs[i])) for i in range(len(DEMO_DOCS))]\n",
    "BASELINE_RANKING = sorted(scores, key=lambda x: -x[1])\n",
    "\n",
    "# Display baseline results\n",
    "print(f\"Query: \\\"{DEMO_QUERY}\\\"\\n\")\n",
    "print(\"Base Model Ranking:\")\n",
    "print(\"-\" * 55)\n",
    "for rank, (idx, score) in enumerate(BASELINE_RANKING, 1):\n",
    "    marker = \" <-- relevant\" if idx in DEMO_RELEVANT else \"\"\n",
    "    print(f\"  #{rank}  [{score:.3f}]  {DEMO_LABELS[idx]}{marker}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete baseline deployment to free GPU for training\n",
    "print(\"Deleting baseline deployment to free GPU...\")\n",
    "sdk.inference.deployments.delete(name=BASELINE_DEPLOYMENT_NAME, workspace=\"default\")\n",
    "\n",
    "# Wait for deployment to be fully deleted before deleting config\n",
    "print(\"Waiting for deployment deletion...\")\n",
    "while True:\n",
    "    try:\n",
    "        status = sdk.inference.deployments.retrieve(name=BASELINE_DEPLOYMENT_NAME, workspace=\"default\")\n",
    "        if status.status == \"DELETED\":\n",
    "            break\n",
    "        print(f\"  Status: {status.status}\")\n",
    "        time.sleep(5)\n",
    "    except Exception:\n",
    "        # Deployment no longer exists\n",
    "        break\n",
    "\n",
    "# Now safe to delete the config\n",
    "sdk.inference.deployment_configs.delete(name=BASELINE_DEPLOYMENT_CONFIG, workspace=\"default\")\n",
    "print(\"GPU freed. Now let's fine-tune and see if we can improve these rankings.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Prepare Dataset\n",
    "\n",
    "We'll use the [SPECTER dataset](https://huggingface.co/datasets/embedding-data/SPECTER) from HuggingFace\u2014a collection of scientific paper triplets where papers that cite each other are considered related.\n",
    "\n",
    "**Dataset structure:**\n",
    "- ~684K scientific paper triplets (we'll use 10%)\n",
    "- Each triplet: (query paper, positive/related paper, negative/unrelated paper)\n",
    "- Papers that cite each other are marked as \"related\"\n",
    "\n",
    "In this tutorial the following dataset directory structure will be used:\n",
    "```\n",
    "embedding-dataset\n",
    "`-- training.jsonl\n",
    "`-- validation.jsonl\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages for dataset preparation\n",
    "%pip install -q datasets huggingface_hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Download and Format SPECTER Dataset\n",
    "\n",
    "The SPECTER dataset requires conversion to the triplet format required for fine-tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from datasets import load_dataset\n",
    "import json\n",
    "\n",
    "# HuggingFace token for dataset access\n",
    "HF_TOKEN = os.environ.get(\"HF_TOKEN\")\n",
    "if not HF_TOKEN:\n",
    "    raise ValueError(\"HF_TOKEN environment variable is required. Get one at https://huggingface.co/settings/tokens\")\n",
    "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
    "\n",
    "# Configuration\n",
    "DATASET_SIZE = 3000        # Number of triplets (increase for better results, max ~684K)\n",
    "VALIDATION_SPLIT = 0.05   # 5% held out for validation\n",
    "SEED = 42\n",
    "DATASET_PATH = Path(\"embedding-dataset\").absolute()\n",
    "\n",
    "# Create directory\n",
    "os.makedirs(DATASET_PATH, exist_ok=True)\n",
    "\n",
    "# Download SPECTER dataset\n",
    "print(\"Downloading SPECTER dataset...\")\n",
    "data = load_dataset(\"embedding-data/SPECTER\")[\"train\"].shuffle(seed=SEED).select(range(DATASET_SIZE))\n",
    "\n",
    "# Split into train/validation\n",
    "print(\"Splitting into train/validation...\")\n",
    "splits = data.train_test_split(test_size=VALIDATION_SPLIT, seed=SEED)\n",
    "train_data = splits[\"train\"]\n",
    "validation_data = splits[\"test\"]\n",
    "\n",
    "# Convert to triplet JSONL format\n",
    "print(\"Saving to JSONL...\")\n",
    "for name, dataset in [(\"training\", train_data), (\"validation\", validation_data)]:\n",
    "    with open(f\"{DATASET_PATH}/{name}.jsonl\", \"w\") as f:\n",
    "        for row in dataset:\n",
    "            # SPECTER format: row['set'] = [query, positive, negative]\n",
    "            triplet = {\n",
    "                \"query\": row[\"set\"][0],\n",
    "                \"pos_doc\": row[\"set\"][1],\n",
    "                \"neg_doc\": [row[\"set\"][2]]  # List of negative documents\n",
    "            }\n",
    "            f.write(json.dumps(triplet) + \"\\n\")\n",
    "\n",
    "print(f\"\\nPrepared {len(train_data):,} training, {len(validation_data):,} validation samples\")\n",
    "print(f\"\\nExample triplet:\")\n",
    "print(f\"  Query:    {train_data[0]['set'][0][:100]}...\")\n",
    "print(f\"  Positive: {train_data[0]['set'][1][:100]}...\")\n",
    "print(f\"  Negative: {train_data[0]['set'][2][:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Create Dataset FileSet and Upload Training Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create fileset to store embedding training data\n",
    "DATASET_NAME = \"embedding-dataset\"\n",
    "\n",
    "try:\n",
    "    sdk.filesets.create(\n",
    "        workspace=\"default\",\n",
    "        name=DATASET_NAME,\n",
    "        description=\"SPECTER embedding training data (scientific paper triplets)\"\n",
    "    )\n",
    "    print(f\"Created fileset: {DATASET_NAME}\")\n",
    "except ConflictError:\n",
    "    print(f\"Fileset '{DATASET_NAME}' already exists, continuing...\")\n",
    "\n",
    "# Upload training data files\n",
    "sdk.filesets.fsspec.put(\n",
    "    lpath=DATASET_PATH,\n",
    "    rpath=f\"default/{DATASET_NAME}/\",\n",
    "    recursive=True\n",
    ")\n",
    "\n",
    "# Validate upload\n",
    "print(\"\\nUploaded files:\")\n",
    "print(sdk.filesets.list_files(name=DATASET_NAME, workspace=\"default\").model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Secrets Setup\n",
    "\n",
    "Configure authentication for accessing base models:\n",
    "\n",
    "- **NGC models** (`ngc://` URIs): Requires NGC API key\n",
    "- **HuggingFace models** (`hf://` URIs): Requires HF token for gated/private models\n",
    "\n",
    "Get your credentials:\n",
    "- [NGC API Key](https://ngc.nvidia.com/) (Setup \u2192 Generate API Key)\n",
    "- [HuggingFace Token](https://huggingface.co/settings/tokens) (Create token with Read access)\n",
    "\n",
    "---\n",
    "\n",
    "#### Quick Setup Example\n",
    "\n",
    "In this tutorial we'll fine-tune [nvidia/llama-3.2-nv-embedqa-1b-v2](https://huggingface.co/nvidia/llama-3.2-nv-embedqa-1b-v2), NVIDIA's embedding model optimized for question-answering and retrieval tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create secrets for model access\n",
    "# Note: NGC_API_KEY secret was already created in the baseline step (Step 2)\n",
    "HF_TOKEN = os.getenv(\"HF_TOKEN\")\n",
    "\n",
    "\n",
    "def create_or_get_secret(name: str, value: str | None, label: str):\n",
    "    if not value:\n",
    "        raise ValueError(f\"{label} is not set\")\n",
    "    try:\n",
    "        secret = sdk.secrets.create(\n",
    "            name=name,\n",
    "            workspace=\"default\",\n",
    "            data=value,\n",
    "        )\n",
    "        print(f\"Created secret: {name}\")\n",
    "        return secret\n",
    "    except ConflictError:\n",
    "        print(f\"Secret '{name}' already exists, continuing...\")\n",
    "        return sdk.secrets.retrieve(name=name, workspace=\"default\")\n",
    "\n",
    "\n",
    "# Create HuggingFace token secret (for downloading model from HF during training)\n",
    "hf_secret = create_or_get_secret(\"hf-token\", HF_TOKEN, \"HF_TOKEN\")\n",
    "print(f\"HF_TOKEN secret: {hf_secret.name}\")\n",
    "\n",
    "# NGC secret was already created in baseline step\n",
    "print(f\"NGC_API_KEY secret: {NGC_SECRET_NAME} (created in Step 2)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. Create Base Model FileSet\n",
    "\n",
    "Create a fileset pointing to the [nvidia/llama-3.2-nv-embedqa-1b-v2](https://huggingface.co/nvidia/llama-3.2-nv-embedqa-1b-v2) embedding model from HuggingFace. This creates a pointer to HuggingFace\u2014the model is downloaded at training time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_microservices.types.filesets import HuggingfaceStorageConfigParam\n",
    "\n",
    "HF_REPO_ID = \"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "MODEL_NAME = \"nv-embedqa-1b-base\"\n",
    "\n",
    "try:\n",
    "    base_model = sdk.filesets.create(\n",
    "        workspace=\"default\",\n",
    "        name=MODEL_NAME,\n",
    "        description=\"NVIDIA Llama 3.2 NV EmbedQA 1B v2 embedding model\",\n",
    "        storage=HuggingfaceStorageConfigParam(\n",
    "            type=\"huggingface\",\n",
    "            repo_id=HF_REPO_ID,\n",
    "            repo_type=\"model\",\n",
    "            token_secret=hf_secret.name\n",
    "        )\n",
    "    )\n",
    "    print(f\"Created base model fileset: {MODEL_NAME}\")\n",
    "except ConflictError:\n",
    "    print(f\"Base model fileset already exists. Skipping creation.\")\n",
    "    base_model = sdk.filesets.retrieve(\n",
    "        workspace=\"default\",\n",
    "        name=MODEL_NAME,\n",
    "    )\n",
    "\n",
    "print(f\"\\nBase model fileset: fileset://default/{base_model.name}\")\n",
    "print(\"\\nBase model files:\")\n",
    "print(sdk.filesets.list_files(name=MODEL_NAME, workspace=\"default\").model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8. Create Embedding Fine-tuning Job\n",
    "\n",
    "Create a customization job to fine-tune the embedding model using contrastive learning on the SPECTER dataset.\n",
    "\n",
    "**Key hyperparameters for embedding fine-tuning:**\n",
    "- **`training_type`**: `sft` (supervised fine-tuning)\n",
    "- **`finetuning_type`**: `all_weights` for full fine-tuning, or `lora_merged` for efficient training\n",
    "- **`learning_rate`**: Lower values (1e-6 to 5e-6) work well for embedding models\n",
    "- **`batch_size`**: Larger batches improve contrastive learning (128-256 recommended)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_microservices.types.customization import (\n",
    "    CustomizationJobInputParam,\n",
    "    CustomizationTargetParamParam,\n",
    "    HyperparametersParam,\n",
    ")\n",
    "\n",
    "job_suffix = uuid.uuid4().hex[:4]\n",
    "JOB_NAME = f\"embedding-finetune-job-{job_suffix}\"\n",
    "\n",
    "# Hyperparameters optimized for embedding fine-tuning\n",
    "EPOCHS = 1\n",
    "BATCH_SIZE = 128          # Larger batches help contrastive learning\n",
    "LEARNING_RATE = 5e-6      # Lower LR for embedding models\n",
    "MAX_SEQ_LENGTH = 512      # Typical for embedding models\n",
    "\n",
    "# Note: The 'name' field must contain 'embed' for the customizer to detect this as an embedding model\n",
    "job = sdk.customization.jobs.create(\n",
    "    name=JOB_NAME,\n",
    "    workspace=\"default\",\n",
    "    spec=CustomizationJobInputParam(\n",
    "        target=CustomizationTargetParamParam(\n",
    "            workspace=\"default\",\n",
    "            name=\"nvidia/llama-3.2-nv-embedqa-1b-v2\",  # Must contain 'embed' for embedding model detection\n",
    "            model_uri=f\"fileset://default/{base_model.name}\"\n",
    "        ),\n",
    "        dataset=f\"fileset://default/{DATASET_NAME}\",\n",
    "        hyperparameters=HyperparametersParam(\n",
    "            training_type=\"sft\",\n",
    "            finetuning_type=\"all_weights\",\n",
    "            epochs=EPOCHS,\n",
    "            batch_size=BATCH_SIZE,\n",
    "            learning_rate=LEARNING_RATE,\n",
    "            max_seq_length=MAX_SEQ_LENGTH,\n",
    "            # GPU and parallelism settings\n",
    "            num_gpus_per_node=1,\n",
    "            num_nodes=1,\n",
    "            tensor_parallel_size=1,\n",
    "            pipeline_parallel_size=1,\n",
    "            micro_batch_size=1,\n",
    "        )\n",
    "    )\n",
    ")\n",
    "\n",
    "print(f\"Job ID: {job.name}\")\n",
    "print(f\"Output model: {job.spec.output_model}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9. Track Training Progress"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import clear_output\n",
    "\n",
    "# Poll job status every 10 seconds until completed\n",
    "while True:\n",
    "    status = sdk.audit.jobs.get_status(\n",
    "        name=job.name,\n",
    "        workspace=\"default\"\n",
    "    )\n",
    "    \n",
    "    clear_output(wait=True)\n",
    "    print(f\"Job Status: {status.status}\")\n",
    "\n",
    "    # Extract training progress from nested steps structure\n",
    "    step: int | None = None\n",
    "    max_steps: int | None = None\n",
    "    training_phase: str | None = None\n",
    "\n",
    "    for job_step in status.steps or []:\n",
    "        if job_step.name == \"customization-training-job\":\n",
    "            for task in job_step.tasks or []:\n",
    "                task_details = task.status_details or {}\n",
    "                step = task_details.get(\"step\")\n",
    "                max_steps = task_details.get(\"max_steps\")\n",
    "                training_phase = task_details.get(\"phase\")\n",
    "                break\n",
    "            break\n",
    "\n",
    "    if step is not None and max_steps is not None:\n",
    "        progress_pct = (step / max_steps) * 100\n",
    "        print(f\"Training Progress: Step {step}/{max_steps} ({progress_pct:.1f}%)\")\n",
    "        if training_phase:\n",
    "            print(f\"Training Phase: {training_phase}\")\n",
    "    else:\n",
    "        print(\"Training step not started yet or progress info not available\")\n",
    "    \n",
    "    # Exit loop when job is completed (or failed/cancelled)\n",
    "    if status.status in (\"completed\", \"failed\", \"cancelled\"):\n",
    "        print(f\"\\nJob finished with status: {status.status}\")\n",
    "        break\n",
    "    \n",
    "    time.sleep(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpreting Embedding Training Metrics:**\n",
    "\n",
    "Embedding models use contrastive loss\u2014lower values indicate better separation between similar and dissimilar pairs:\n",
    "\n",
    "| Scenario | Interpretation | Action |\n",
    "|----------|----------------|--------|\n",
    "| **Loss steadily decreasing** | Model learning semantic relationships | Continue training |\n",
    "| **Loss plateaus early** | May need more data or epochs | Increase dataset/epochs |\n",
    "| **Loss spikes** | Training instability | Lower learning rate |\n",
    "| **Validation loss increasing** | Overfitting | Reduce epochs, add data |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10. Deploy Fine-Tuned Embedding Model\n",
    "\n",
    "Once training completes, deploy the embedding model using the Deployment Management Service:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Validate model entity exists\n",
    "model_entity = sdk.models.retrieve(workspace=\"default\", name=job.spec.output_model)\n",
    "print(model_entity.model_dump_json(indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nemo_microservices.types.inference import NIMDeploymentParam\n",
    "\n",
    "# Create deployment config for embedding model\n",
    "deploy_suffix = uuid.uuid4().hex[:4]\n",
    "DEPLOYMENT_CONFIG_NAME = f\"embedding-model-deployment-cfg-{deploy_suffix}\"\n",
    "DEPLOYMENT_NAME = f\"embedding-model-deployment-{deploy_suffix}\"\n",
    "\n",
    "# Embedding NIM image\n",
    "NIM_IMAGE = \"nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2\"\n",
    "NIM_TAG = \"1.6.0\"  # Update if using newer NIM release\n",
    "\n",
    "deployment_config = sdk.inference.deployment_configs.create(\n",
    "    workspace=\"default\",\n",
    "    name=DEPLOYMENT_CONFIG_NAME,\n",
    "    nim_deployment=NIMDeploymentParam(\n",
    "        image_name=NIM_IMAGE,\n",
    "        image_tag=NIM_TAG,\n",
    "        gpu=1,\n",
    "        model_name=job.spec.output_model,\n",
    "        model_namespace=\"default\",\n",
    "        image_pull_secret=NGC_SECRET_NAME,  # NGC secret created in baseline step\n",
    "    )\n",
    ")\n",
    "\n",
    "# Deploy model\n",
    "deployment = sdk.inference.deployments.create(\n",
    "    workspace=\"default\",\n",
    "    name=DEPLOYMENT_NAME,\n",
    "    config=deployment_config.name\n",
    ")\n",
    "\n",
    "print(f\"Deployment name: {deployment.name}\")\n",
    "print(f\"Deployment status: {sdk.inference.deployments.retrieve(name=deployment.name, workspace='default').status}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Track Deployment Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from IPython.display import clear_output\n",
    "\n",
    "# Poll deployment status every 15 seconds until ready\n",
    "TIMEOUT_MINUTES = 30\n",
    "start_time = time.time()\n",
    "timeout_seconds = TIMEOUT_MINUTES * 60\n",
    "\n",
    "print(f\"Monitoring deployment '{deployment.name}'...\")\n",
    "print(f\"Timeout: {TIMEOUT_MINUTES} minutes\\n\")\n",
    "\n",
    "while True:\n",
    "    deployment_status = sdk.inference.deployments.retrieve(\n",
    "        name=deployment.name,\n",
    "        workspace=\"default\"\n",
    "    )\n",
    "    \n",
    "    elapsed = time.time() - start_time\n",
    "    elapsed_min = int(elapsed // 60)\n",
    "    elapsed_sec = int(elapsed % 60)\n",
    "    \n",
    "    clear_output(wait=True)\n",
    "    print(f\"Deployment: {deployment.name}\")\n",
    "    print(f\"Status: {deployment_status.status}\")\n",
    "    print(f\"Elapsed time: {elapsed_min}m {elapsed_sec}s\")\n",
    "    \n",
    "    # Check if deployment is ready\n",
    "    if deployment_status.status == \"READY\":\n",
    "        print(\"\\nDeployment is ready!\")\n",
    "        break\n",
    "    \n",
    "    # Check for failure states\n",
    "    if deployment_status.status in (\"FAILED\", \"ERROR\", \"TERMINATED\", \"LOST\"):\n",
    "        print(f\"\\nDeployment failed with status: {deployment_status.status}\")\n",
    "        break\n",
    "    \n",
    "    # Check timeout\n",
    "    if elapsed > timeout_seconds:\n",
    "        print(f\"\\nTimeout reached ({TIMEOUT_MINUTES} minutes). Deployment may still be in progress.\")\n",
    "        break\n",
    "    \n",
    "    time.sleep(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 11. See the Improvement\n",
    "\n",
    "Now let's run the same query against the fine-tuned model and compare to the baseline we saw earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare: same query, base model vs fine-tuned\n",
    "# Using the same DEMO_QUERY and DEMO_DOCS from the baseline test\n",
    "MODEL_ID = f\"default/{job.spec.output_model}\"\n",
    "\n",
    "# Get query embedding from fine-tuned model\n",
    "query_response = sdk.inference.gateway.provider.post(\n",
    "    \"v1/embeddings\",\n",
    "    name=deployment.name,\n",
    "    workspace=\"default\",\n",
    "    body={\n",
    "        \"model\": MODEL_ID,\n",
    "        \"input\": [DEMO_QUERY],\n",
    "        \"input_type\": \"query\"\n",
    "    }\n",
    ")\n",
    "query_embedding = query_response[\"data\"][0][\"embedding\"]\n",
    "\n",
    "# Get document embeddings from fine-tuned model\n",
    "doc_response = sdk.inference.gateway.provider.post(\n",
    "    \"v1/embeddings\",\n",
    "    name=deployment.name,\n",
    "    workspace=\"default\",\n",
    "    body={\n",
    "        \"model\": MODEL_ID,\n",
    "        \"input\": DEMO_DOCS,\n",
    "        \"input_type\": \"passage\"\n",
    "    }\n",
    ")\n",
    "doc_embeddings = [d[\"embedding\"] for d in doc_response[\"data\"]]\n",
    "\n",
    "# Calculate similarities and rank\n",
    "scores = [(i, cosine_similarity(query_embedding, doc_embeddings[i])) for i in range(len(DEMO_DOCS))]\n",
    "FINETUNED_RANKING = sorted(scores, key=lambda x: -x[1])\n",
    "\n",
    "# Display side-by-side comparison\n",
    "print(f\"Query: \\\"{DEMO_QUERY}\\\"\\n\")\n",
    "print(f\"{'Rank':<6} {'Base Model':<30} {'Fine-tuned Model':<30}\")\n",
    "print(\"-\" * 66)\n",
    "\n",
    "for rank in range(len(DEMO_DOCS)):\n",
    "    b_idx, b_score = BASELINE_RANKING[rank]\n",
    "    f_idx, f_score = FINETUNED_RANKING[rank]\n",
    "    \n",
    "    b_label = f\"{DEMO_LABELS[b_idx]} [{b_score:.3f}]\" + (\" *\" if b_idx in DEMO_RELEVANT else \"\")\n",
    "    f_label = f\"{DEMO_LABELS[f_idx]} [{f_score:.3f}]\" + (\" *\" if f_idx in DEMO_RELEVANT else \"\")\n",
    "    \n",
    "    print(f\"#{rank+1:<5} {b_label:<30} {f_label:<30}\")\n",
    "\n",
    "print(\"\\n* = relevant paper\")\n",
    "print(\"\\nThe fine-tuned model pushes 'Random Forest' down and ranks CRF papers higher.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation Best Practices\n",
    "\n",
    "**Manual Evaluation** (Recommended)\n",
    "- Test with real-world queries from your domain\n",
    "- Compare retrieval rankings before and after fine-tuning\n",
    "- Check that semantically similar items rank higher than keyword matches\n",
    "\n",
    "**What to look for:**\n",
    "- \u2705 Relevant documents consistently rank in top positions\n",
    "- \u2705 Keyword traps (like \"Random Forest\" vs \"Random Fields\") are handled correctly\n",
    "- \u2705 Domain-specific terminology is understood\n",
    "- \u274c Unrelated documents with matching keywords don't rank high\n",
    "\n",
    "**Benchmark Evaluation**\n",
    "\n",
    "For systematic evaluation, use the NeMo Evaluator service with retrieval benchmarks like SciDocs, BEIR, or MTEB. See the [Evaluator documentation](../../evaluator/index) for details.\n",
    "\n",
    "---\n",
    "\n",
    "## Hyperparameters\n",
    "\n",
    "For detailed information on all available hyperparameters, recommended values, and tuning guidance, see the [Hyperparameter Reference](../manage-customization-jobs/hyperparameters.md).\n",
    "\n",
    "**Embedding-Specific Recommendations:**\n",
    "\n",
    "| Parameter | Recommended | Notes |\n",
    "|-----------|-------------|-------|\n",
    "| `learning_rate` | 1e-6 to 5e-6 | Lower than standard SFT |\n",
    "| `batch_size` | 128-256 | Larger batches improve contrastive learning |\n",
    "| `max_seq_length` | 512 | Typical for embedding models |\n",
    "| `epochs` | 1-3 | Start small, increase if needed |\n",
    "\n",
    "---\n",
    "\n",
    "## Troubleshooting\n",
    "\n",
    "**Embeddings don't show improved retrieval:**\n",
    "- Verify dataset quality: triplets should have clear positive/negative distinctions\n",
    "- Use hard negatives: negatives should share some overlap with the query but not be relevant (easy negatives don't teach the model much)\n",
    "- Increase dataset size: 10K+ triplets recommended for meaningful improvement\n",
    "- Try more epochs: embedding models often need multiple passes\n",
    "- Lower learning rate: embedding models are sensitive to LR\n",
    "\n",
    "**Training loss not decreasing:**\n",
    "- Check triplet format: ensure `neg_doc` is a list even for single negatives\n",
    "- Verify hard negative quality: negatives should be challenging but clearly non-relevant\n",
    "- Increase batch size: contrastive learning benefits from larger batches\n",
    "\n",
    "**Deployment fails:**\n",
    "- Ensure you're using the correct NIM image for embedding models\n",
    "- Verify sufficient GPU memory for the model size\n",
    "- Check deployment logs: `sdk.inference.deployments.get_logs(name=deployment.name, workspace=\"default\")`\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "- [Monitor training metrics](fine-tune-metrics) in detail\n",
    "- [Evaluate your model](../../evaluator/index) with retrieval benchmarks\n",
    "- Integrate the fine-tuned embedding model into your RAG pipeline\n",
    "- Scale up training with the full SPECTER dataset (~684K triplets) for better results"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}