Seeding with External Datasets#
This tutorial demonstrates how to use external datasets as seed data for synthetic data generation in Data Designer.
For more detail, see the open-source library’s version of this tutorial.
Seed Sources#
The Data Designer service supports two types of seed sources:
Seed Source |
Description |
Use Case |
|---|---|---|
HuggingFace |
Public or private datasets from HuggingFace |
Publicly available datasets or your private HF datasets |
NMP Filesets |
Files uploaded to the NMP Files service |
Your own data files (CSV, Parquet, etc.) |
Note
The standalone library’s LocalFileSeedSource and DataFrameSeedSource are not supported by the NMP service. Upload your data to NMP Filesets instead.
HuggingFace Datasets#
Use HuggingFaceSeedSource to load data from HuggingFace:
import data_designer.config as dd
# Public dataset
seed_source = dd.HuggingFaceSeedSource(
path="datasets/username/dataset/data/*.parquet"
)
# Private dataset (requires token)
seed_source = dd.HuggingFaceSeedSource(
path="datasets/username/dataset/data/*.parquet",
token="default/hf-token" # Reference to NMP secret
)
config_builder.with_seed_dataset(seed_source)
For private datasets: First create a secret with your HuggingFace token:
sdk.secrets.create(
name="hf-token",
data="<your-huggingface-token>",
description="HuggingFace access token"
)
NMP Filesets#
Use FilesetFileSeedSource to load data from the NMP Files service:
from nemo_microservices.data_designer.plugins.fileset_file_seed_source import FilesetFileSeedSource
seed_source = FilesetFileSeedSource(
path="default/my-fileset#data.parquet" # Format: workspace/fileset#file-path
)
config_builder.with_seed_dataset(seed_source)
Path format:
Fully qualified:
workspace/fileset-name#file-path(recommended)Implicit workspace:
fileset-name#file-path(uses client’s workspace)
Prerequisites#
Ensure you have set up inference by creating a model provider for build.nvidia.com.
Example: Medical Notes from Symptom Data#
This example generates realistic patient medical notes by seeding with publicly available symptom-to-diagnosis data.
Step 1: Upload Seed Data#
Upload the symptom-to-diagnosis dataset to an NMP Fileset:
import os
import tempfile
import urllib.request
from nemo_microservices import NeMoMicroservices
WORKSPACE = "default"
FILESET_NAME = "seed-data"
FILE_PATH = "symptom_to_diagnosis.csv"
sdk = NeMoMicroservices(base_url=os.environ["NMP_BASE_URL"], workspace=WORKSPACE)
# Create fileset
sdk.filesets.create(name=FILESET_NAME)
# Download and upload seed data
with tempfile.NamedTemporaryFile(suffix=".csv") as tmpfile:
url = "https://raw.githubusercontent.com/NVIDIA/GenerativeAIExamples/refs/heads/main/nemo/NeMo-Data-Designer/data/gretelai_symptom_to_diagnosis.csv"
urllib.request.urlretrieve(url, tmpfile.name)
sdk.filesets.files.upload(
fileset=FILESET_NAME,
local_path=tmpfile.name,
remote_path=FILE_PATH,
)
Step 2: Build Configuration#
Define your models and create a config builder:
import data_designer.config as dd
MODEL_ALIAS = "text"
model_configs = [
dd.ModelConfig(
provider="default/build-nvidia",
model="nvidia/nemotron-3-nano-30b-a3b",
alias=MODEL_ALIAS,
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=1.0,
top_p=1.0,
),
)
]
config_builder = dd.DataDesignerConfigBuilder(model_configs)
Step 3: Configure Seed Dataset#
Add the seed data to your configuration:
from nemo_microservices.data_designer.plugins.fileset_file_seed_source import FilesetFileSeedSource
config_builder.with_seed_dataset(
FilesetFileSeedSource(path=f"{WORKSPACE}/{FILESET_NAME}#{FILE_PATH}")
)
What this does: The seed dataset’s columns (diagnosis, patient_summary, etc.) are automatically added to your dataset and available for use in other columns.
Step 4: Add Synthetic Columns#
Add columns that reference and extend the seed data:
# Patient details
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient_sampler",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(),
)
)
# Doctor details
config_builder.add_column(
dd.SamplerColumnConfig(
name="doctor_sampler",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(),
)
)
# Patient ID
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient_id",
sampler_type=dd.SamplerType.UUID,
params=dd.UUIDSamplerParams(
prefix="PT-",
short_form=True,
uppercase=True,
),
)
)
# Extract patient name from sampler
config_builder.add_column(
dd.ExpressionColumnConfig(
name="first_name",
expr="{{ patient_sampler.first_name }}"
)
)
config_builder.add_column(
dd.ExpressionColumnConfig(
name="last_name",
expr="{{ patient_sampler.last_name }}"
)
)
config_builder.add_column(
dd.ExpressionColumnConfig(
name="dob",
expr="{{ patient_sampler.birth_date }}"
)
)
# Symptom onset date
config_builder.add_column(
dd.SamplerColumnConfig(
name="symptom_onset_date",
sampler_type=dd.SamplerType.DATETIME,
params=dd.DatetimeSamplerParams(start="2024-01-01", end="2024-12-31"),
)
)
# Visit date (1-30 days after symptom onset)
config_builder.add_column(
dd.SamplerColumnConfig(
name="date_of_visit",
sampler_type=dd.SamplerType.TIMEDELTA,
params=dd.TimeDeltaSamplerParams(
dt_min=1,
dt_max=30,
reference_column_name="symptom_onset_date"
),
)
)
# Physician name
config_builder.add_column(
dd.ExpressionColumnConfig(
name="physician",
expr="Dr. {{ doctor_sampler.last_name }}"
)
)
# LLM-generated physician notes
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.
The date of today's visit is {{ date_of_visit }}.
{{ patient_summary }}
Write careful notes about your visit with {{ first_name }},
as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.
Format the notes as a busy doctor might.
Respond with only the notes, no other text.
""",
model_alias=MODEL_ALIAS,
)
)
Note: The {{ diagnosis }} and {{ patient_summary }} variables come from the seed dataset columns.
Step 5: Execute on NMP#
Create a client:
from nemo_microservices.data_designer.client import NeMoDataDesignerClient
# Using the sdk instance from Step 1
client = NeMoDataDesignerClient(sdk=sdk)
Previewing the Dataset#
Use the preview method for rapid iteration:
preview = client.preview(config_builder)
# Display a random sample record
preview.display_sample_record()
# Access the full preview dataset as a pandas DataFrame
df = preview.dataset
print(df.head())
# View statistical analysis
preview.analysis.to_report()
Generating the Full Dataset#
When you’re satisfied with the preview, submit a larger generation job:
job = client.create(config_builder, num_records=500)
# Block until the job completes
job.wait_until_done()
# Load the generated dataset as a pandas DataFrame
dataset = job.load_dataset()
print(dataset.head())
# Load the full analysis report
analysis = job.load_analysis()
analysis.to_report()
How Seeding Works#
When you configure a seed dataset:
Automatic Column Addition: All columns from the seed data are automatically added to your dataset schema
Dependency Resolution: Data Designer resolves dependencies between seed columns and synthetic columns
Execution Order: Seed data is loaded first, then synthetic columns are generated row-by-row
Row Alignment: Each generated row corresponds to one row from the seed dataset
Example: If your seed data has 100 rows with columns diagnosis and patient_summary, and you request 100 records, each generated record will include the seed columns plus any synthetic columns you defined.
Next Steps#
Column types: Explore all available column types in the library documentation
Processors: Transform your data with processors in the library documentation