Safe Synthesizer Jobs#
NeMo Safe Synthesizer jobs orchestrate the complete pipeline from data preparation through synthesis to evaluation. Understanding the job lifecycle and configuration options is essential for effective use of the platform.
Job Lifecycle#
A NeMo Safe Synthesizer job progresses through several states:
Job States#
created: Job has been created but not yet started
pending: Job is queued and waiting for resources (GPU)
active: Job is processing your data
completed: Job finished successfully - results are ready
error: Job encountered an error - check logs for details
cancelled: Job was manually cancelled
cancelling: Job is in the process of being cancelled
paused: Job execution has been paused
pausing: Job is in the process of being paused
resuming: Job is resuming from a paused state
Job Phases#
A complete job typically includes these phases:
Data Preparation
Data validation and preprocessing
Column type inference
Grouping and ordering (if configured)
Train/test split for holdout evaluation
PII Replacement (optional)
PII detection using configured methods
Entity classification
Value transformation
Output validation
Synthesis
Training: Fine-tune LLM on prepared data
Generation: Generate synthetic records
Apply differential privacy (if enabled)
Evaluation (optional, enabled by default)
Calculate SQS metrics
Calculate DPS metrics
Generate evaluation report
Job Configuration#
Jobs are configured through a hierarchical configuration structure:
Top-Level Configuration#
{
"name": "my-job",
"project": "my-project",
"spec": {
"data_source": "fileset://default/safe-synthesizer-inputs/data.csv",
"config": {
# Configuration sections below
}
}
}
Configuration Sections#
data_prep: Grouping, ordering, holdout configuration
replace_pii: PII detection and replacement rules
training: Model selection and training parameters
generation: Synthetic data generation settings
privacy: Differential privacy parameters
evaluation: Quality and privacy assessment options
Job Management#
Creating Jobs#
Jobs can be created using:
Python SDK: Recommended approach with
SafeSynthesizerJobBuilderREST API: Direct HTTP requests for integration
CLI: Command-line interface for scripting
Monitoring Jobs#
Track job progress through:
Status checks: Poll job state
Logs: View real-time execution logs
Events: Subscribe to job state changes (if supported)
Retrieving Results#
After job completion, access:
Synthetic data: Generated CSV files
Evaluation report: HTML report with scores and visualizations
Metadata: Job summary and configuration
Logs: Complete execution history
Job Builder API#
The SafeSynthesizerJobBuilder provides a high-level interface for common workflows:
import os
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder
# Placeholders
df: pd.DataFrame = pd.DataFrame()
client = NeMoMicroservices(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)
builder = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii()
.synthesize()
)
job = builder.create_job(name="my-job", project="my-project")
The builder:
Handles data upload to filesets automatically
Provides smart defaults
Validates configuration
Returns a
SafeSynthesizerJobinstance for interacting with the job
Best Practices#
Resource Planning#
Larger datasets and models require more GPU memory
Training time scales with data size and model complexity
Plan for 15-60 minutes for typical jobs
Configuration#
Start with default settings
Enable PII replacement for sensitive data
Use differential privacy for maximum privacy guarantees
Adjust generation parameters based on evaluation results
Monitoring#
Check status periodically during execution
Review logs if jobs fail or take longer than expected
Use evaluation reports to iterate on configuration
Error Handling#
Common errors: insufficient GPU memory, invalid data format, configuration errors
Check logs for detailed error messages
Reduce model size or data size if resource errors occur
Troubleshooting#
This section covers common issues and how to diagnose them.
Viewing Job Logs#
Logs are essential for diagnosing job failures. Access them through:
Python SDK:
# Print logs to stdout
job.print_logs()
# Or iterate over log entries programmatically
for log in job.fetch_logs():
print(log.message.strip())
Docker Compose Deployments#
When running NeMo Safe Synthesizer via Docker Compose, you can view container logs directly:
# View safe-synthesizer service logs
docker logs -f synthesis-test-20260114-051514-safe-synthesizer
# View logs with timestamps
docker logs -f --timestamps synthesis-test-20260114-051514-safe-synthesizer
# View last 100 lines
docker logs --tail 100 synthesis-test-20260114-051514-safe-synthesizer
To check the health status of containers:
docker ps
Kubernetes Deployments#
For Kubernetes deployments, use kubectl to access logs:
# List pods in your namespace
kubectl get pods -n <namespace>
# View logs for the safe-synthesizer pod
kubectl logs -f <pod-name> -n <namespace>
# View logs for a specific container in the pod
kubectl logs -f <pod-name> -c <container-name> -n <namespace>
# View previous container logs (if container restarted)
kubectl logs --previous <pod-name> -n <namespace>
To check pod status and events:
# Describe pod for detailed status and events
kubectl describe pod <pod-name> -n <namespace>
# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Common Issues and Solutions#
Job Stuck in “Pending” State#
Symptoms: Job remains in pending state for an extended period.
Possible Causes:
No GPU resources available
Resource quota exceeded
Scheduling constraints not met
Solutions:
Check available GPU resources in your cluster
Verify resource quotas and limits
Review pod events for scheduling failures
Out of Memory (OOM) Errors#
Symptoms: Job fails with memory-related errors during training.
Possible Causes:
Dataset too large for available GPU memory
Batch size too high
Model too large for available resources
Solutions:
Reduce
batch_sizein training parametersUse a smaller subset of data for initial testing
Increase
gradient_accumulation_stepsto maintain effective batch size with lower memory
Invalid Data Format Errors#
Symptoms: Job fails during data preparation phase.
Possible Causes:
CSV file has encoding issues
Missing or malformed columns
Unsupported data types
Solutions:
Ensure CSV is UTF-8 encoded
Validate column names don’t contain special characters
Check for null values or inconsistent data types
Generation Quality Issues#
Symptoms: Generated synthetic data has poor quality or many invalid records.
Possible Causes:
Insufficient training (low
num_input_records_to_sample)Temperature too high or too low
Data has complex patterns that need more training
Solutions:
Increase
num_input_records_to_samplefor more trainingAdjust
temperature(try 0.7-1.0 range)Enable
use_structured_generationfor better format adherenceReview evaluation report for specific quality issues