Seeding with External Datasets#

This tutorial demonstrates how to use external datasets as seed data for synthetic data generation in Data Designer.

For more detail, see the open-source library’s version of this tutorial.

Seed Sources#

The Data Designer service supports two types of seed sources:

Seed Source

Description

Use Case

HuggingFace

Public or private datasets from HuggingFace

Publicly available datasets or your private HF datasets

NMP Filesets

Files uploaded to the NMP Files service

Your own data files (CSV, Parquet, etc.)

Note

The standalone library’s LocalFileSeedSource and DataFrameSeedSource are not supported by the NMP service. Upload your data to NMP Filesets instead.

HuggingFace Datasets#

Use HuggingFaceSeedSource to load data from HuggingFace:

import data_designer.config as dd

# Public dataset
seed_source = dd.HuggingFaceSeedSource(
    path="datasets/username/dataset/data/*.parquet"
)

# Private dataset (requires token)
seed_source = dd.HuggingFaceSeedSource(
    path="datasets/username/dataset/data/*.parquet",
    token="default/hf-token"  # Reference to NMP secret
)

config_builder.with_seed_dataset(seed_source)

For private datasets: First create a secret with your HuggingFace token:

sdk.secrets.create(
    name="hf-token",
    data="<your-huggingface-token>",
    description="HuggingFace access token"
)

NMP Filesets#

Use FilesetFileSeedSource to load data from the NMP Files service:

from nemo_microservices.data_designer.plugins.fileset_file_seed_source import FilesetFileSeedSource

seed_source = FilesetFileSeedSource(
    path="default/my-fileset#data.parquet"  # Format: workspace/fileset#file-path
)

config_builder.with_seed_dataset(seed_source)

Path format:

  • Fully qualified: workspace/fileset-name#file-path (recommended)

  • Implicit workspace: fileset-name#file-path (uses client’s workspace)

Prerequisites#

Ensure you have set up inference by creating a model provider for build.nvidia.com.

Example: Medical Notes from Symptom Data#

This example generates realistic patient medical notes by seeding with publicly available symptom-to-diagnosis data.

Step 1: Upload Seed Data#

Upload the symptom-to-diagnosis dataset to an NMP Fileset:

import os
import tempfile
import urllib.request
from nemo_microservices import NeMoMicroservices

WORKSPACE = "default"
FILESET_NAME = "seed-data"
FILE_PATH = "symptom_to_diagnosis.csv"

sdk = NeMoMicroservices(base_url=os.environ["NMP_BASE_URL"], workspace=WORKSPACE)

# Create fileset
sdk.filesets.create(name=FILESET_NAME)

# Download and upload seed data
with tempfile.NamedTemporaryFile(suffix=".csv") as tmpfile:
    url = "https://raw.githubusercontent.com/NVIDIA/GenerativeAIExamples/refs/heads/main/nemo/NeMo-Data-Designer/data/gretelai_symptom_to_diagnosis.csv"
    urllib.request.urlretrieve(url, tmpfile.name)

    sdk.filesets.files.upload(
        fileset=FILESET_NAME,
        local_path=tmpfile.name,
        remote_path=FILE_PATH,
    )

Step 2: Build Configuration#

Define your models and create a config builder:

import data_designer.config as dd

MODEL_ALIAS = "text"

model_configs = [
    dd.ModelConfig(
        provider="default/build-nvidia",
        model="nvidia/nemotron-3-nano-30b-a3b",
        alias=MODEL_ALIAS,
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=1.0,
            top_p=1.0,
        ),
    )
]

config_builder = dd.DataDesignerConfigBuilder(model_configs)

Step 3: Configure Seed Dataset#

Add the seed data to your configuration:

from nemo_microservices.data_designer.plugins.fileset_file_seed_source import FilesetFileSeedSource

config_builder.with_seed_dataset(
    FilesetFileSeedSource(path=f"{WORKSPACE}/{FILESET_NAME}#{FILE_PATH}")
)

What this does: The seed dataset’s columns (diagnosis, patient_summary, etc.) are automatically added to your dataset and available for use in other columns.

Step 4: Add Synthetic Columns#

Add columns that reference and extend the seed data:

# Patient details
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="patient_sampler",
        sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
        params=dd.PersonFromFakerSamplerParams(),
    )
)

# Doctor details
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="doctor_sampler",
        sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
        params=dd.PersonFromFakerSamplerParams(),
    )
)

# Patient ID
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="patient_id",
        sampler_type=dd.SamplerType.UUID,
        params=dd.UUIDSamplerParams(
            prefix="PT-",
            short_form=True,
            uppercase=True,
        ),
    )
)

# Extract patient name from sampler
config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="first_name",
        expr="{{ patient_sampler.first_name }}"
    )
)

config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="last_name",
        expr="{{ patient_sampler.last_name }}"
    )
)

config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="dob",
        expr="{{ patient_sampler.birth_date }}"
    )
)

# Symptom onset date
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="symptom_onset_date",
        sampler_type=dd.SamplerType.DATETIME,
        params=dd.DatetimeSamplerParams(start="2024-01-01", end="2024-12-31"),
    )
)

# Visit date (1-30 days after symptom onset)
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="date_of_visit",
        sampler_type=dd.SamplerType.TIMEDELTA,
        params=dd.TimeDeltaSamplerParams(
            dt_min=1,
            dt_max=30,
            reference_column_name="symptom_onset_date"
        ),
    )
)

# Physician name
config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="physician",
        expr="Dr. {{ doctor_sampler.last_name }}"
    )
)

# LLM-generated physician notes
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="physician_notes",
        prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.
The date of today's visit is {{ date_of_visit }}.

{{ patient_summary }}

Write careful notes about your visit with {{ first_name }},
as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.

Format the notes as a busy doctor might.
Respond with only the notes, no other text.
""",
        model_alias=MODEL_ALIAS,
    )
)

Note: The {{ diagnosis }} and {{ patient_summary }} variables come from the seed dataset columns.

Step 5: Execute on NMP#

Create a client:

from nemo_microservices.data_designer.client import NeMoDataDesignerClient

# Using the sdk instance from Step 1
client = NeMoDataDesignerClient(sdk=sdk)

Previewing the Dataset#

Use the preview method for rapid iteration:

preview = client.preview(config_builder)

# Display a random sample record
preview.display_sample_record()

# Access the full preview dataset as a pandas DataFrame
df = preview.dataset
print(df.head())

# View statistical analysis
preview.analysis.to_report()

Generating the Full Dataset#

When you’re satisfied with the preview, submit a larger generation job:

job = client.create(config_builder, num_records=500)

# Block until the job completes
job.wait_until_done()

# Load the generated dataset as a pandas DataFrame
dataset = job.load_dataset()
print(dataset.head())

# Load the full analysis report
analysis = job.load_analysis()
analysis.to_report()

How Seeding Works#

When you configure a seed dataset:

  1. Automatic Column Addition: All columns from the seed data are automatically added to your dataset schema

  2. Dependency Resolution: Data Designer resolves dependencies between seed columns and synthetic columns

  3. Execution Order: Seed data is loaded first, then synthetic columns are generated row-by-row

  4. Row Alignment: Each generated row corresponds to one row from the seed dataset

Example: If your seed data has 100 rows with columns diagnosis and patient_summary, and you request 100 records, each generated record will include the seed columns plus any synthetic columns you defined.

Next Steps#