Safe Synthesizer Jobs#

NeMo Safe Synthesizer jobs orchestrate the complete pipeline from data preparation through synthesis to evaluation. Understanding the job lifecycle and configuration options is essential for effective use of the platform.

Job Lifecycle#

A NeMo Safe Synthesizer job progresses through several states:

Job States#

  • created: Job has been created but not yet started

  • pending: Job is queued and waiting for resources (GPU)

  • active: Job is processing your data

  • completed: Job finished successfully - results are ready

  • error: Job encountered an error - check logs for details

  • cancelled: Job was manually cancelled

  • cancelling: Job is in the process of being cancelled

  • paused: Job execution has been paused

  • pausing: Job is in the process of being paused

  • resuming: Job is resuming from a paused state

Job Phases#

A complete job typically includes these phases:

  1. Data Preparation

    • Data validation and preprocessing

    • Column type inference

    • Grouping and ordering (if configured)

    • Train/test split for holdout evaluation

  2. PII Replacement (optional)

    • PII detection using configured methods

    • Entity classification

    • Value transformation

    • Output validation

  3. Synthesis

    • Training: Fine-tune LLM on prepared data

    • Generation: Generate synthetic records

    • Apply differential privacy (if enabled)

  4. Evaluation (optional, enabled by default)

    • Calculate SQS metrics

    • Calculate DPS metrics

    • Generate evaluation report

Job Configuration#

Jobs are configured through a hierarchical configuration structure:

Top-Level Configuration#

{
    "name": "my-job",
    "project": "my-project",
    "spec": {
        "data_source": "fileset://default/safe-synthesizer-inputs/data.csv",
        "config": {
            # Configuration sections below
        }
    }
}

Configuration Sections#

  • data_prep: Grouping, ordering, holdout configuration

  • replace_pii: PII detection and replacement rules

  • training: Model selection and training parameters

  • generation: Synthetic data generation settings

  • privacy: Differential privacy parameters

  • evaluation: Quality and privacy assessment options

Job Management#

Creating Jobs#

Jobs can be created using:

  • Python SDK: Recommended approach with SafeSynthesizerJobBuilder

  • REST API: Direct HTTP requests for integration

  • CLI: Command-line interface for scripting

Monitoring Jobs#

Track job progress through:

  • Status checks: Poll job state

  • Logs: View real-time execution logs

  • Events: Subscribe to job state changes (if supported)

Retrieving Results#

After job completion, access:

  • Synthetic data: Generated CSV files

  • Evaluation report: HTML report with scores and visualizations

  • Metadata: Job summary and configuration

  • Logs: Complete execution history

Job Builder API#

The SafeSynthesizerJobBuilder provides a high-level interface for common workflows:

import os
import pandas as pd

from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder

# Placeholders
df: pd.DataFrame = pd.DataFrame()
client = NeMoMicroservices(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)

builder = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .synthesize()
)
job = builder.create_job(name="my-job", project="my-project")

The builder:

  • Handles data upload to filesets automatically

  • Provides smart defaults

  • Validates configuration

  • Returns a SafeSynthesizerJob instance for interacting with the job

Best Practices#

Resource Planning#

  • Larger datasets and models require more GPU memory

  • Training time scales with data size and model complexity

  • Plan for 15-60 minutes for typical jobs

Configuration#

  • Start with default settings

  • Enable PII replacement for sensitive data

  • Use differential privacy for maximum privacy guarantees

  • Adjust generation parameters based on evaluation results

Monitoring#

  • Check status periodically during execution

  • Review logs if jobs fail or take longer than expected

  • Use evaluation reports to iterate on configuration

Error Handling#

  • Common errors: insufficient GPU memory, invalid data format, configuration errors

  • Check logs for detailed error messages

  • Reduce model size or data size if resource errors occur

Troubleshooting#

This section covers common issues and how to diagnose them.

Viewing Job Logs#

Logs are essential for diagnosing job failures. Access them through:

Python SDK:

# Print logs to stdout
job.print_logs()

# Or iterate over log entries programmatically
for log in job.fetch_logs():
    print(log.message.strip())

Docker Compose Deployments#

When running NeMo Safe Synthesizer via Docker Compose, you can view container logs directly:

# View safe-synthesizer service logs
docker logs -f synthesis-test-20260114-051514-safe-synthesizer

# View logs with timestamps
docker logs -f --timestamps synthesis-test-20260114-051514-safe-synthesizer

# View last 100 lines
docker logs --tail 100 synthesis-test-20260114-051514-safe-synthesizer

To check the health status of containers:

docker ps

Kubernetes Deployments#

For Kubernetes deployments, use kubectl to access logs:

# List pods in your namespace
kubectl get pods -n <namespace>

# View logs for the safe-synthesizer pod
kubectl logs -f <pod-name> -n <namespace>

# View logs for a specific container in the pod
kubectl logs -f <pod-name> -c <container-name> -n <namespace>

# View previous container logs (if container restarted)
kubectl logs --previous <pod-name> -n <namespace>

To check pod status and events:

# Describe pod for detailed status and events
kubectl describe pod <pod-name> -n <namespace>

# Check events in the namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Common Issues and Solutions#

Job Stuck in “Pending” State#

Symptoms: Job remains in pending state for an extended period.

Possible Causes:

  • No GPU resources available

  • Resource quota exceeded

  • Scheduling constraints not met

Solutions:

  • Check available GPU resources in your cluster

  • Verify resource quotas and limits

  • Review pod events for scheduling failures

Out of Memory (OOM) Errors#

Symptoms: Job fails with memory-related errors during training.

Possible Causes:

  • Dataset too large for available GPU memory

  • Batch size too high

  • Model too large for available resources

Solutions:

  • Reduce batch_size in training parameters

  • Use a smaller subset of data for initial testing

  • Increase gradient_accumulation_steps to maintain effective batch size with lower memory

Invalid Data Format Errors#

Symptoms: Job fails during data preparation phase.

Possible Causes:

  • CSV file has encoding issues

  • Missing or malformed columns

  • Unsupported data types

Solutions:

  • Ensure CSV is UTF-8 encoded

  • Validate column names don’t contain special characters

  • Check for null values or inconsistent data types

Generation Quality Issues#

Symptoms: Generated synthetic data has poor quality or many invalid records.

Possible Causes:

  • Insufficient training (low num_input_records_to_sample)

  • Temperature too high or too low

  • Data has complex patterns that need more training

Solutions:

  • Increase num_input_records_to_sample for more training

  • Adjust temperature (try 0.7-1.0 range)

  • Enable use_structured_generation for better format adherence

  • Review evaluation report for specific quality issues