The Basics#
This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.
For more detail, see the open-source library’s version of this tutorial.
Prerequisites#
Ensure you have set up inference by creating a model provider for build.nvidia.com.
Part 1: Build the Configuration#
Use the data_designer.config package to define your dataset schema. This code is identical whether you’re using the standalone library or the NMP service.
Tip
Already using the standalone library? This configuration code is identical. You can copy your existing config_builder code directly - only the execution step (Part 2) differs.
Define Models#
Start by defining the models you want to use:
import data_designer.config as dd
MODEL_ALIAS = "text"
model_configs = [
dd.ModelConfig(
provider="default/build-nvidia", # Reference NMP model provider
model="nvidia/nemotron-3-nano-30b-a3b",
alias=MODEL_ALIAS,
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=1.0,
top_p=1.0,
),
)
]
config_builder = dd.DataDesignerConfigBuilder(model_configs)
Add Columns#
Define the columns for your dataset. The library documentation explains these column types in detail.
# Product category sampler
config_builder.add_column(
dd.SamplerColumnConfig(
name="product_category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=[
"Electronics",
"Clothing",
"Home & Kitchen",
"Books",
"Home Office",
],
),
)
)
# Product subcategory sampler (conditional on category)
config_builder.add_column(
dd.SamplerColumnConfig(
name="product_subcategory",
sampler_type=dd.SamplerType.SUBCATEGORY,
params=dd.SubcategorySamplerParams(
category="product_category",
values={
"Electronics": [
"Smartphones",
"Laptops",
"Headphones",
"Cameras",
"Accessories",
],
"Clothing": [
"Men's Clothing",
"Women's Clothing",
"Winter Coats",
"Activewear",
"Accessories",
],
"Home & Kitchen": [
"Appliances",
"Cookware",
"Furniture",
"Decor",
"Organization",
],
"Books": [
"Fiction",
"Non-Fiction",
"Self-Help",
"Textbooks",
"Classics",
],
"Home Office": [
"Desks",
"Chairs",
"Storage",
"Office Supplies",
"Lighting",
],
},
),
)
)
# Target age range
config_builder.add_column(
dd.SamplerColumnConfig(
name="target_age_range",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["18-25", "25-35", "35-50", "50-65", "65+"]),
)
)
# Customer details using Faker
config_builder.add_column(
dd.SamplerColumnConfig(
name="customer",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
)
)
# Star rating
config_builder.add_column(
dd.SamplerColumnConfig(
name="number_of_stars",
sampler_type=dd.SamplerType.UNIFORM,
params=dd.UniformSamplerParams(low=1, high=5),
convert_to="int", # Convert the sampled float to an integer
)
)
# Review style
config_builder.add_column(
dd.SamplerColumnConfig(
name="review_style",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=["rambling", "brief", "detailed", "structured with bullet points"],
weights=[1, 2, 2, 1],
),
)
)
# LLM-generated product name
config_builder.add_column(
dd.LLMTextColumnConfig(
name="product_name",
prompt=(
"You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
"{{ target_age_range }} years old. Respond with only the product name, no other text."
),
model_alias=MODEL_ALIAS,
)
)
# LLM-generated customer review
config_builder.add_column(
dd.LLMTextColumnConfig(
name="customer_review",
prompt=(
"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
"The style of the review should be '{{ review_style }}'. "
"Respond with only the review, no other text."
),
model_alias=MODEL_ALIAS,
)
)
Part 2: Execute on NMP#
Now submit your configuration to the Data Designer service for execution.
Creating a Client#
The NeMoDataDesignerClient is your interface to the Data Designer service. You can initialize it from an existing SDK instance or with direct parameters:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.data_designer.client import NeMoDataDesignerClient
sdk = NeMoMicroservices(
base_url="http://localhost:8080",
workspace="default",
)
client = NeMoDataDesignerClient(sdk=sdk)
from nemo_microservices.data_designer.client import NeMoDataDesignerClient
client = NeMoDataDesignerClient(
base_url="http://localhost:8080",
workspace="default",
)
Previewing the Dataset#
Use the preview method for rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:
preview = client.preview(config_builder)
# Display a random sample record
preview.display_sample_record()
# Access the full preview dataset as a pandas DataFrame
df = preview.dataset
print(df.head())
# View statistical analysis
preview.analysis.to_report()
Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you’re satisfied with the results.
Scaling Up with Jobs#
When you’re happy with the preview, submit a larger generation job:
job = client.create(config_builder, num_records=500)
# Block until the job completes
job.wait_until_done()
# Load the generated dataset as a pandas DataFrame
dataset = job.load_dataset()
print(dataset.head())
# Load the full analysis report
analysis = job.load_analysis()
analysis.to_report()
What Happens Under the Hood#
When you submit a job to the Data Designer service:
Configuration Validation: The service validates your configuration and resolves column dependencies
Job Creation: A job is created and queued for execution
Distributed Execution: The service orchestrates generation across multiple workers
Inference Routing: All LLM calls are routed through the Inference Gateway to your configured model providers
Artifact Storage: Generated datasets and analysis reports are stored in NMP artifact storage
Job Completion: You can monitor job status and load results when complete
Next Steps#
Seed data: Learn how to use external datasets in the seeding tutorial
Column types: Explore all available column types in the library documentation
Advanced features: Learn about processors and validation