Tutorials#

These tutorials demonstrate how to use the Data Designer service on NMP.

Library vs. Service#

Data Designer separates configuration (building dataset schemas) from execution (generating the data).

Part 1: Build Configs (Library)

Use data_designer.config to define your dataset. See the library documentation for comprehensive guides on column types, constraints, and processors.

import data_designer.config as dd

config_builder = dd.DataDesignerConfigBuilder(model_configs)
config_builder.add_column(dd.SamplerColumnConfig(...))
config_builder.add_column(dd.LLMTextColumnConfig(...))

Part 2: Execute (Microservice)

Submit your configuration to the Data Designer service for execution:

from nemo_microservices.data_designer.client import NeMoDataDesignerClient

client = NeMoDataDesignerClient(base_url="...", workspace="default")
preview = client.preview(config_builder)
job = client.create(config_builder, num_records=1000)

Tip

Migration is simple: Already using the standalone library? Your configuration code stays identical. Only the execution client changes. See the migration guide for details.

Service-Specific Considerations#

When using Data Designer as an NMP service:

Feature

Difference

Details

Inference

Routes through Inference Gateway

Configure model providers once, reference by name

Seed data

Remote sources only

Use HuggingFace or NMP Filesets (no local files/DataFrames)

Validators

Code & HTTP only

Custom Python function validators not supported

Artifacts

NMP artifact storage

Results stored in NMP, not local filesystem

Prerequisites#

Before starting these tutorials, complete the Quick Start guide to:

  • Install the NMP SDK

  • Set up inference with a model provider

  • Understand the basic workflow

Tutorials#

The Basics

Generate a product review dataset using samplers and LLM-generated text. Learn the fundamentals of building configurations and executing jobs.

The Basics
Seeding

Use external datasets to ground synthetic data generation. Generate realistic patient medical notes from symptom-to-diagnosis data.

Seeding with External Datasets