Download this tutorial as a Jupyter notebook

Embedding Fine-tuning with NeMo Microservices#

Fine-tune an embedding model and improve retrieval by 6-10% in ~1 hour.

Prerequisites#

  • Platform GPU: 1 GPU allocated per job (training and deployment run on the NeMo Microservices cluster)

  • HuggingFace token: Required to download the SPECTER dataset. Get one at https://huggingface.co/settings/tokens (read access is sufficient). Set HF_TOKEN env var or enter when prompted.

Tip: Cleanup cells at the end of the notebook can be uncommented to let you delete resources if needed.

Overview#

Use case: Adapt a general embedding model to find related scientific papers.

Fine-tuning an embedding model on your domain data improves retrieval accuracy. In a Retrieval-Augmented Generation (RAG) pipeline, this means the LLM receives more relevant context, producing better answers. For search applications, users find what they need more often.

This notebook walks through the complete workflow: fine-tune a base embedding model on scientific paper data, deploy it as a production NVIDIA Inference Microservice (NIM), and measure the improvement.

Objectives#

By the end of this notebook, you will:

  • Test the baseline model on a retrieval task

  • Fine-tune nvidia/llama-3.2-nv-embedqa-1b-v2 on 65K scientific paper triplets from SPECTER dataset

  • Deploy the fine-tuned model as a production-ready NIM inference service

  • Compare before/after retrieval rankings on your original task

  • Measure aggregate improvement on SciDocs benchmark: Recall@5 improves from 0.159 to ~0.17 (+6-10%)

Recall@5 measures the fraction of relevant documents that appear in the top 5 search results.

About the baseline: The 0.159 baseline was measured by running the same SciDocs evaluation on the pretrained model. In Step 6, you can set EVALUATE_BASELINE = True to run this evaluation yourself as long as you ensure the base model from Step 0 is deployed.

Note: Time estimates are approximate and depend on cluster configuration and GPU type(s).

# Install required packages
%pip install -q datasets huggingface_hub openai nemo-microservices ipywidgets
# Imports
import json, requests, os
import numpy as np
from time import sleep, time
from datasets import load_dataset
from getpass import getpass
from nemo_microservices import NeMoMicroservices
from huggingface_hub import HfApi
from openai import OpenAI
# Configuration
NDS_URL = "https://datastore.aire.nvidia.com"
NEMO_URL = "https://nmp.aire.nvidia.com"
NIM_URL = "https://nim.aire.nvidia.com"
EVAL_NIM_URL = "http://nemo-nim-proxy:8000"  # Internal URL for evaluation (avoids cluster SSL issues)

# Credentials - prompts if not set via env var
HF_TOKEN = os.environ.get("HF_TOKEN") or getpass("HuggingFace token (https://huggingface.co/settings/tokens): ")
NAMESPACE = os.environ.get("NAMESPACE") or input("Namespace (e.g. yourname_embedding): ")
# Initialize NeMo client
nemo = NeMoMicroservices(base_url=NEMO_URL, inference_base_url=NIM_URL)
print("NeMo client initialized")

Step 0: Identify the Opportunity#

Let’s start with a real-world scenario: searching scientific papers by meaning, not keywords.

We’ll deploy the base embedding model, run a test query, and see where it struggles. Then we’ll fine-tune on scientific paper data and measure the improvement.

# Deploy base model (~2 mins), run baseline ranking, then clean up
BASE_MODEL = "nvidia/llama-3.2-nv-embedqa-1b-v2"
BASE_DEPLOYMENT = f"{NAMESPACE}-baseline"
NIM_IMAGE = "nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2" # NIM container from NGC (https://www.nvidia.com/en-us/gpu-cloud/)
NIM_TAG = "1.6.0" # Update if using newer NIM release

# Check if already deployed, otherwise create
try:
    existing = nemo.deployment.model_deployments.retrieve(deployment_name=BASE_DEPLOYMENT, namespace=NAMESPACE)
    print(f"Base model already deployed (status: {existing.status_details.status})")
except:
    print("Deploying base model...")
    nemo.deployment.model_deployments.create(
        name=BASE_DEPLOYMENT, namespace=NAMESPACE,
        config={"model": BASE_MODEL, "nim_deployment": {"image_name": NIM_IMAGE, "image_tag": NIM_TAG, "gpu": 1, "disable_lora_support": True}})
# Wait for base model to deploy
POLL_INTERVAL = 10
INIT_WAIT = 30

start = time()
while True:
    status = nemo.deployment.model_deployments.retrieve(deployment_name=BASE_DEPLOYMENT, namespace=NAMESPACE)
    if status.status_details.status == 'ready':
        break
    print(f"\rStatus: {status.status_details.status} | {int(time()-start)//60}m {int(time()-start)%60}s", end="")
    sleep(POLL_INTERVAL)
print(f"\nReady | {int(time()-start)//60}m")
sleep(INIT_WAIT)  # Wait for model to initialize
# Demo: searching scientific papers where keyword matching fails
# 
# Query: "Conditional Random Fields" (CRFs) - a method for sequence labeling in NLP.
# Trap: "Random Forests" shares the word "random" but is an unrelated tree-based algorithm.
# Can the model tell the difference?

DEMO_QUERY = "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data"

DEMO_DOCS = [
    "Bidirectional LSTM-CRF Models for Sequence Tagging",   # CRF-based paper
    "An Introduction to Conditional Random Fields",         # CRF tutorial  
    "Random Forests",                                       # Keyword trap! Unrelated.
    "Neural Architectures for Named Entity Recognition",    # Related to sequence labeling; may use CRFs
    "Support Vector Machines for Classification",           # Unrelated ML method
]

DEMO_LABELS = ["BiLSTM-CRF", "CRF Tutorial", "Random Forest", "NER-CRF", "SVM"]
DEMO_RELEVANT = {0, 1, 3}  # Papers actually relevant to CRFs

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Run baseline ranking
base_client = OpenAI(base_url=f"{NIM_URL}/v1", api_key="None")
query_emb = base_client.embeddings.create(input=[DEMO_QUERY], model=BASE_MODEL, extra_body={"input_type": "query"}).data[0].embedding
doc_embs = [base_client.embeddings.create(input=[d], model=BASE_MODEL, extra_body={"input_type": "passage"}).data[0].embedding for d in DEMO_DOCS]
scores = [(i, cosine_similarity(query_emb, doc_embs[i])) for i in range(len(DEMO_DOCS))]
BASELINE_RANKING = sorted(scores, key=lambda x: -x[1])

print(f"Query: \"{DEMO_QUERY}\"\n")
print("Base Model Ranking:")
print("-" * 55)
for rank, (idx, score) in enumerate(BASELINE_RANKING, 1):
    marker = " <-- relevant" if idx in DEMO_RELEVANT else ""
    print(f"  #{rank}  [{score:.3f}]  {DEMO_LABELS[idx]}{marker}")
# Delete base model deployment to free GPU for training
# (Skip this cell if you want to run baseline evaluation in Step 6)
print("Deleting base model deployment...")
nemo.deployment.model_deployments.delete(deployment_name=BASE_DEPLOYMENT, namespace=NAMESPACE)
print("GPU freed. Now let's fine-tune and see if we can improve these rankings.")

Step 1: Prepare Data#

Download 10% of the SPECTER dataset containing ~684K scientific paper triplets (query, positive, negative) and format for embedding fine-tuning.

Dataset format: Each triplet teaches the model via contrastive learning to maximize similarity between query and positive document while minimizing similarity between query and negative document.

# Download and prepare training data
DATASET_SIZE = 68400      # 10% of full dataset (684K triplets) - increase for better results
VALIDATION_SPLIT = 0.05   # 5% held out for validation

print("Downloading SPECTER dataset...")
os.environ["HF_TOKEN"] = HF_TOKEN
data = load_dataset("embedding-data/SPECTER")['train'].shuffle(seed=42).select(range(DATASET_SIZE))

print("Splitting into train/validation...")
splits = data.train_test_split(test_size=VALIDATION_SPLIT, seed=42)
train_data = splits['train']
validation_data = splits['test']

# Save as JSONL (required format for Customizer)
# More details: https://docs.nvidia.com/nemo/microservices/latest/fine-tune/tutorials/format-training-dataset.html
print("Saving to JSONL...")
os.makedirs("data", exist_ok=True)
for name, dataset in [("training", train_data), ("validation", validation_data)]:
    with open(f"data/{name}.jsonl", "w") as f:
        for row in dataset:
            f.write(json.dumps({"query": row['set'][0], "pos_doc": row['set'][1], "neg_doc": [row['set'][2]]}) + "\n")

print(f"Prepared {len(train_data):,} training, {len(validation_data):,} validation samples")
print(f"\nExample triplet:")
print(f"  Query:    {train_data[0]['set'][0][:100]}...")
print(f"  Positive: {train_data[0]['set'][1][:100]}...")
print(f"  Negative: {train_data[0]['set'][2][:100]}...")

Step 2: Upload to NeMo Data Store#

NeMo Data Store holds datasets for training and evaluation. It exposes a HuggingFace-compatible API, so you can use familiar huggingface_hub methods - just pointed at a different endpoint.

print("Creating namespace...")
nemo.namespaces.create(id=NAMESPACE)  # For job management
requests.post(f"{NDS_URL}/v1/datastore/namespaces", data={"namespace": NAMESPACE})  # NeMo Data Store namespace (separate service)

print("Creating repository...")
hf = HfApi(endpoint=f"{NDS_URL}/v1/hf", token=None)
hf.create_repo(f"{NAMESPACE}/data", repo_type='dataset')

print("Uploading files...")
hf.upload_file(path_or_fileobj="data/training.jsonl", path_in_repo="training/training.jsonl", repo_id=f"{NAMESPACE}/data", repo_type='dataset')
hf.upload_file(path_or_fileobj="data/validation.jsonl", path_in_repo="validation/validation.jsonl", repo_id=f"{NAMESPACE}/data", repo_type='dataset')

print("Registering dataset...")
nemo.datasets.create(name="data", namespace=NAMESPACE, files_url=f"hf://datasets/{NAMESPACE}/data")
print("Upload complete")

Step 3: Train Model#

Fine-tune using supervised contrastive learning (model learns to pull query-positive pairs closer while pushing query-negative pairs apart).

Config vs Job: A config defines the training template (base model, GPU settings). A job runs training with that config + dataset + hyperparameters.

BASE_MODEL = "nvidia/llama-3.2-nv-embedqa-1b@v2"
NUM_GPUS = 1
MICRO_BATCH_SIZE = 8
MAX_SEQ_LENGTH = 2048

print("Creating config...")
nemo.customization.configs.create(
    name="embedding-config@v1", 
    namespace=NAMESPACE, 
    target=BASE_MODEL,
    training_options=[{
        "training_type": "sft",
        "finetuning_type": "all_weights",
        "num_gpus": NUM_GPUS,
        "micro_batch_size": MICRO_BATCH_SIZE
    }], 
    max_seq_length=MAX_SEQ_LENGTH)
# Start training job (~25 min)
EPOCHS = 1
BATCH_SIZE = 256
LEARNING_RATE = 5e-6

print("Starting training...")
training_job = nemo.customization.jobs.create(
    name="embedding-training", 
    config=f"{NAMESPACE}/embedding-config@v1", 
    dataset={"namespace": NAMESPACE, "name": "data"},
    hyperparameters={
        "finetuning_type": "all_weights",
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "learning_rate": LEARNING_RATE
    }, 
    output_model=f"{NAMESPACE}/embedding-model")

print(f"Job ID: {training_job.id}")
print(f"Studio link: {NEMO_URL}/studio/projects/_/_/customizations/{training_job.id}")
POLL_INTERVAL = 10
start = time()
last_step = -1

while True:
    status = nemo.customization.jobs.retrieve(training_job.id)
    if status.status not in ["pending", "created", "running"]:
        break
    
    d = status.status_details
    elapsed = int(time() - start)
    elapsed_str = f"{elapsed//60}m {elapsed%60}s"
    
    if d.epochs_completed >= 1:
        print(f"\rSaving model... | {elapsed_str}", end="")
    elif d.metrics and d.metrics.metrics.train_loss:
        step = d.metrics.metrics.train_loss[-1].step
        if step != last_step:
            if last_step == -1: print()
            loss = d.metrics.metrics.train_loss[-1].value
            pct = int(step / d.steps_per_epoch * 100) if d.steps_per_epoch else 0
            print(f"{pct:3d}% | Step {step} | Loss: {loss:.4f} | {elapsed_str}")
            last_step = step
    else:
        print(f"\r{status.status.capitalize()}... | {elapsed_str}", end="")
    sleep(POLL_INTERVAL)

print(f"\n\nTraining complete | {int(time() - start)//60}m")

Step 4: Deploy Model#

NeMo Deployment serves your fine-tuned model as a NIM (NVIDIA Inference Microservice). Once deployed, you can query it via the standard OpenAI-compatible embeddings API.

# Deploy as NIM (~5 min)
DEPLOYMENT_NAME = f"{NAMESPACE}-embedding"
NIM_IMAGE = "nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2"  # NIM container from NGC (https://www.nvidia.com/en-us/gpu-cloud/)
NIM_IMAGE_TAG = "1.6.0"  # Update if using newer NIM release
DEPLOYMENT_GPUS = 1

print("Deploying model...")
try:
    existing = nemo.deployment.model_deployments.retrieve(deployment_name=DEPLOYMENT_NAME, namespace=NAMESPACE)
    print(f"Deployment exists (status: {existing.status_details.status})")
except:
    nemo.deployment.model_deployments.create(
        name=DEPLOYMENT_NAME,
        namespace=NAMESPACE,
        config={
            "model": f"{NAMESPACE}/embedding-model@{training_job.id}",
            "nim_deployment": {
                "image_name": NIM_IMAGE,
                "image_tag": NIM_IMAGE_TAG,
                "gpu": DEPLOYMENT_GPUS,
                "disable_lora_support": True  # Optimizes NIM when not using LoRA adapters
            }
        }
    )
    print("Created, waiting...")
POLL_INTERVAL = 10

start = time()
while True:
    deployment = nemo.deployment.model_deployments.retrieve(deployment_name=DEPLOYMENT_NAME, namespace=NAMESPACE)
    if deployment.status_details.status == 'ready':
        break
    elapsed = int(time() - start)
    print(f"\rStatus: {deployment.status_details.status} | {elapsed//60}m {elapsed%60}s", end="")
    sleep(POLL_INTERVAL)

print(f"\nDeployed | {int(time() - start)//60}m")

Health Check#

Verify the deployed model responds to requests.

# Wait 30 seconds to ensure model fully initialized
INIT_WAIT = 30
sleep(INIT_WAIT)

client = OpenAI(base_url=f"{NIM_URL}/v1", api_key="None")
response = client.embeddings.create(
    input=["Deep learning for computer vision"], 
    model=f"{NAMESPACE}/embedding-model", 
    extra_body={"input_type": "query"})

print(f"Inference OK | Embedding dim: {len(response.data[0].embedding)}")

Step 5: See the Improvement#

Now let’s run the same query against your fine-tuned model and compare to the baseline we saw earlier.

# Compare: same query, base model vs fine-tuned
query_emb = client.embeddings.create(input=[DEMO_QUERY], model=f"{NAMESPACE}/embedding-model", extra_body={"input_type": "query"}).data[0].embedding
doc_embs = [client.embeddings.create(input=[d], model=f"{NAMESPACE}/embedding-model", extra_body={"input_type": "passage"}).data[0].embedding for d in DEMO_DOCS]
scores = [(i, cosine_similarity(query_emb, doc_embs[i])) for i in range(len(DEMO_DOCS))]
FINETUNED_RANKING = sorted(scores, key=lambda x: -x[1])

print(f"Query: \"{DEMO_QUERY}\"\n")
print(f"{'Rank':<6} {'Base Model':<30} {'Fine-tuned Model':<30}")
print("-" * 66)

for rank in range(len(DEMO_DOCS)):
    b_idx, b_score = BASELINE_RANKING[rank]
    f_idx, f_score = FINETUNED_RANKING[rank]
    
    b_label = f"{DEMO_LABELS[b_idx]} [{b_score:.3f}]" + (" *" if b_idx in DEMO_RELEVANT else "")
    f_label = f"{DEMO_LABELS[f_idx]} [{f_score:.3f}]" + (" *" if f_idx in DEMO_RELEVANT else "")
    
    print(f"#{rank+1:<5} {b_label:<30} {f_label:<30}")

print("\n* = relevant paper")
print("\nThe fine-tuned model pushes 'Random Forest' down and ranks CRF papers higher.")

Step 6: Evaluate Performance#

NeMo Evaluator runs standardized benchmarks against your deployed model. Here we use SciDocs, a retrieval benchmark for scientific papers.

# Run evaluation on SciDocs (~10 min)
EVALUATE_BASELINE = False  # Set to True to evaluate base model instead of fine-tuned
TOP_K = 10  # Retrieve top 10, Recall@5 checks first 5

model_to_eval = BASE_MODEL if EVALUATE_BASELINE else f"{NAMESPACE}/embedding-model"
print(f"Evaluating: {model_to_eval}")

# Config: what benchmark to run and what metrics to compute
eval_config = {
    "type": "retriever",  # Evaluating retrieval (vs generation, classification, etc.)
    "namespace": NAMESPACE,
    "tasks": {
        "scidocs": {
            "type": "beir",  # BEIR: standard benchmark format for retrieval
            "dataset": {"files_url": "file://scidocs/"},  # Pre-loaded on cluster
            "metrics": {"recall_5": {"type": "recall_5"}}}}}

# Target: which model to evaluate and how to call it
eval_target = {
    "type": "retriever",
    "retriever": {
        "pipeline": {
            # Same model encodes both queries and documents
            "query_embedding_model": {
                "api_endpoint": {"url": f"{EVAL_NIM_URL}/v1/embeddings", "model_id": model_to_eval}},
            "index_embedding_model": {
                "api_endpoint": {"url": f"{EVAL_NIM_URL}/v1/embeddings", "model_id": model_to_eval}},
            "top_k": TOP_K}}}

eval_job = nemo.evaluation.jobs.create(config=eval_config, target=eval_target)
print(f"Job ID: {eval_job.id}")
print(f"Studio link: {NEMO_URL}/studio/projects/_/_/evaluations/{eval_job.id}")
POLL_INTERVAL = 10

start = time()
while True:
    status = nemo.evaluation.jobs.retrieve(eval_job.id)
    if status.status not in ["pending", "created", "running"]:
        break
    elapsed = int(time() - start)
    print(f"\rStatus: {status.status} | {elapsed//60}m {elapsed%60}s", end="")
    sleep(POLL_INTERVAL)

print(f"\nComplete | {int(time() - start)//60}m")
results = nemo.evaluation.jobs.results(eval_job.id)

Step 7: Results#

Compare your fine-tuned model against the pretrained baseline.

# Display results
BASELINE_RECALL = 0.159  # Pretrained model on SciDocs
finetuned_recall = results.tasks['scidocs'].metrics['retriever.recall_5'].scores['recall_5'].value
improvement = ((finetuned_recall / BASELINE_RECALL) - 1) * 100

print("=" * 60)
print("RESULTS: SciDocs Retrieval Benchmark")
print("=" * 60)
print(f"Metric: Recall@5 (relevant docs found in top 5 results)")
print()
print(f"Baseline (pretrained):  {BASELINE_RECALL:.3f}")
print(f"Fine-tuned model:       {finetuned_recall:.3f}")
print(f"Improvement:           +{improvement:.1f}%")
print("=" * 60)
print(f"\nEndpoint: {NIM_URL}/v1/embeddings")
print(f"Model: {NAMESPACE}/embedding-model")

Summary#

You fine-tuned NVIDIA’s llama-3.2-nv-embedqa-1b-v2 embedding model on 65K scientific paper triplets from SPECTER - a dataset where papers that cite each other are marked as “related.”

The base model matched documents by keyword overlap. After fine-tuning, it learned scientific paper neighborhoods: which papers actually cite each other, regardless of surface-level word matches. The demo showed this - “Random Forests” dropped in ranking because it’s unrelated to “Conditional Random Fields,” despite sharing the word “random.”

SciDocs tests retrieval across thousands of scientific queries. Recall@5 asks: “Of all relevant papers, how many appear in the top 5 results?” Your model improved from 0.159 to ~0.17, meaning 6-10% more relevant papers now surface in the top 5.

In a RAG pipeline, better retrieval means better context for the LLM and more accurate answers. Your model is deployed and ready to use.

Next Steps#

Scale Up:

  • Train on full SPECTER dataset for additional improvement

  • Increase to 3 epochs for better convergence

Apply to Your Domain:

Learn More:

Cleanup#

Uncomment cleanup cells as needed to delete resources.

# # Delete training jobs (PERMANENT)
# print("Deleting training jobs...")
# for j in nemo.customization.jobs.list(filter={"namespace": NAMESPACE}).data:
#     nemo.customization.jobs.delete(job_id=j.id)
# print("Training jobs deleted")
# # Delete deployment (frees GPU, keeps model for later redeployment)
# print("Deleting deployment...")
# nemo.deployment.model_deployments.delete(deployment_name=DEPLOYMENT_NAME, namespace=NAMESPACE)
# print("Deployment deleted - GPU freed")
# # Delete model (PERMANENT - must retrain to recover)
# print("Deleting model...")
# for m in nemo.models.list(filter={"namespace": NAMESPACE}).data:
#     nemo.models.delete(namespace=NAMESPACE, model_name=m.name.split('/')[-1])
# print("Model deleted")
# # Delete dataset (PERMANENT)
# print("Deleting dataset...")
# nemo.datasets.delete(namespace=NAMESPACE, dataset_name="data")
# hf.delete_repo(f"{NAMESPACE}/data", repo_type='dataset')
# print("Dataset deleted")
# # Delete configs (PERMANENT)
# print("Deleting configs...")
# nemo.customization.configs.delete(config_name="embedding-config@v1", namespace=NAMESPACE)
# for cfg in nemo.evaluation.configs.list(filter={"namespace": NAMESPACE}).data:
#     nemo.evaluation.configs.delete(config_id=cfg.id)
# print("Configs deleted")
# # Delete eval jobs (PERMANENT)
# print("Deleting eval jobs...")
# for j in nemo.evaluation.jobs.list(filter={"namespace": NAMESPACE}).data:
#     nemo.evaluation.jobs.delete(job_id=j.id)
# print("Eval jobs deleted")
# # Delete namespace (PERMANENT - deletes everything in namespace)
# print("Deleting namespace...")
# nemo.namespaces.delete(NAMESPACE)
# requests.delete(f"{NDS_URL}/v1/datastore/namespaces/{NAMESPACE}")
# print(f"Namespace '{NAMESPACE}' deleted")