Download this tutorial as a Jupyter notebook

Differential Privacy Deep Dive#

Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.

Prerequisites#

  • Completed the Safe Synthesizer 101 tutorial

  • Understanding of basic privacy concepts

  • Safe Synthesizer deployment with GPU resources


What You’ll Learn#

  • Understanding differential privacy concepts (epsilon, delta)

  • Configuring privacy hyperparameters

  • Analyzing privacy-utility tradeoffs

  • Interpreting privacy metrics in evaluation reports


Understanding Differential Privacy#

Differential privacy (DP) provides mathematical guarantees that synthetic data doesn’t reveal information about individual records in the training data.

Key Concepts:

  • Epsilon (ε): Privacy budget - lower values mean stronger privacy

    • ε = 1: Very strong privacy

    • ε = 6-10: Moderate privacy

    • ε > 10: Weak privacy

    • Recommended starting range: ε ∈ [8, 12] - adjust downward based on privacy needs

  • Delta (δ): Probability of privacy breach

    • Typically set to 1/n^1.2 where n is dataset size

    • Use "auto" for automatic calculation (recommended)

    • Manual values typically between 1e-6 and 1e-4

  • Noise: Random noise added during training to prevent memorization

    • Calibrated based on epsilon, delta, and gradient clipping threshold

    • Higher privacy (lower epsilon) requires more noise

Record-Level vs Group-Level Privacy#

By default, NeMo Safe Synthesizer uses record-level differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use group-level privacy:

# Group-level privacy for multi-record entities
builder_grouped = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        group_training_examples_by="patient_id"  # Group records by patient
    )
    .with_differential_privacy(epsilon=8.0)
    .synthesize()
)

When to use group-level privacy:

  • Multiple records per person/entity in your dataset

  • Privacy guarantees should apply to entire entities, not individual records

  • Examples: patient medical histories, customer transaction logs


Setup#

Install the NeMo Microservices SDK with Safe Synthesizer support:

if command -v uv &> /dev/null; then
    uv pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib
else
    pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib
fi
import os
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder

# Configure client
client = NeMoMicroservices(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)

Load and Prepare Data#

# Load sample dataset
import kagglehub  # type: ignore[import-not-found]
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)

print(f"Dataset size: {len(df)} records")
print(f"Recommended delta: {1 / (len(df) ** 2):.2e}")

Experiment 1: No Differential Privacy (Baseline)#

First, create a baseline without differential privacy:

import time

print("🔬 Experiment 1: No Differential Privacy (Baseline)")

builder_baseline = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .synthesize()
)

# Create a project for our jobs (creates if it doesn't exist)
project_name = "test-project"
try:
    client.projects.create(workspace="default", name=project_name)
except Exception:
    pass  # Project may already exist

job_baseline = builder_baseline.create_job(name=f"dp-baseline-{int(time.time())}", project="test-project")
print(f"✅ Baseline job created: {job_baseline.job_name}")

job_baseline.wait_for_completion()
summary_baseline = job_baseline.fetch_summary()

print(f"\n📊 Baseline Results:")
print(f"  SQS (Quality): {summary_baseline.synthetic_data_quality_score}")
print(f"  DPS (Privacy): {summary_baseline.data_privacy_score}")

Experiment 2: Moderate Privacy (ε=6)#

Apply moderate differential privacy:

print("\n🔬 Experiment 2: Moderate Privacy (ε=6)")

builder_moderate = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(
        epsilon=6.0,
        delta=1e-5
    )
    .synthesize()
)

job_moderate = builder_moderate.create_job(name=f"dp-moderate-{int(time.time())}", project="test-project")
print(f"✅ Moderate privacy job created: {job_moderate.job_name}")

job_moderate.wait_for_completion()
summary_moderate = job_moderate.fetch_summary()

print(f"\n📊 Moderate Privacy Results:")
print(f"  SQS (Quality): {summary_moderate.synthetic_data_quality_score}")
print(f"  DPS (Privacy): {summary_moderate.data_privacy_score}")

Experiment 3: Strong Privacy (ε=1)#

Apply strong differential privacy:

print("\n🔬 Experiment 3: Strong Privacy (ε=1)")

builder_strong = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(
        epsilon=1.0,
        delta=1e-5
    )
    .synthesize()
)

job_strong = builder_strong.create_job(name=f"dp-strong-{int(time.time())}", project="test-project")
print(f"✅ Strong privacy job created: {job_strong.job_name}")

job_strong.wait_for_completion()
summary_strong = job_strong.fetch_summary()

print(f"\n📊 Strong Privacy Results:")
print(f"  SQS (Quality): {summary_strong.synthetic_data_quality_score}")
print(f"  DPS (Privacy): {summary_strong.data_privacy_score}")

Compare Results#

Visualize the privacy-utility tradeoff:

import matplotlib.pyplot as plt

experiments = ['Baseline\n(No DP)', 'Moderate\n(ε=6)', 'Strong\n(ε=1)']
sqs_scores = [
    summary_baseline.synthetic_data_quality_score,
    summary_moderate.synthetic_data_quality_score,
    summary_strong.synthetic_data_quality_score
]
dps_scores = [
    summary_baseline.data_privacy_score,
    summary_moderate.data_privacy_score,
    summary_strong.data_privacy_score
]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# SQS comparison
ax1.bar(experiments, sqs_scores, color=['blue', 'green', 'red'], alpha=0.7)
ax1.set_ylabel('Score')
ax1.set_title('Synthetic Quality Score (SQS)')
ax1.set_ylim([0, 100])
ax1.axhline(y=70, color='gray', linestyle='--', label='Good threshold')
ax1.legend()

# DPS comparison
ax2.bar(experiments, dps_scores, color=['blue', 'green', 'red'], alpha=0.7)
ax2.set_ylabel('Score')
ax2.set_title('Data Privacy Score (DPS)')
ax2.set_ylim([0, 100])
ax2.axhline(y=70, color='gray', linestyle='--', label='Good threshold')
ax2.legend()

plt.tight_layout()
plt.show()

print("\n📈 Privacy-Utility Tradeoff Summary:")
print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}")
print("-" * 50)
for i, exp in enumerate(experiments):
    print(f"{exp.strip():<20} {sqs_scores[i]:<15.1f} {dps_scores[i]:<15.1f}")

Advanced Configuration#

Custom Privacy Budget#

Configure differential privacy with custom parameters:

from nemo_microservices.beta.safe_synthesizer.config import DifferentialPrivacyHyperparams

# Create custom privacy configuration
privacy_config = DifferentialPrivacyHyperparams(
    dp_enabled=True,
    epsilon=3.0,
    delta=1e-5,
    per_sample_max_grad_norm=1.0,  # Gradient clipping threshold
)

# Use with SafeSynthesizerJobBuilder
builder_custom = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_replace_pii()
    .with_differential_privacy(config=privacy_config)
    .synthesize()
)

Privacy Budget Composition#

When running multiple experiments, the privacy budget compounds:

# Total privacy budget across experiments
total_epsilon = 0.0  # No DP baseline
total_epsilon += 6.0  # Moderate privacy
total_epsilon += 1.0  # Strong privacy

print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}")
print("Note: Each additional release compounds the privacy budget")
print("Best practice: Only release one synthetic dataset per original dataset")

Interpreting Privacy Metrics#

Membership Inference Attack (MIA)#

Measures if an attacker can determine whether a record was in training data:

# Fetch detailed evaluation reports
baseline_report = job_baseline.fetch_summary()
moderate_report = job_moderate.fetch_summary()

print("\n🛡️ Membership Inference Protection:")
print(f"Baseline: {baseline_report.membership_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}")

print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Score > 0.5 means attacker cannot reliably identify training records")

Attribute Inference Attack (AIA)#

Measures if sensitive attributes can be inferred from other attributes:

print("\n🔍 Attribute Inference Protection:")
print(f"Baseline: {baseline_report.attribute_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}")

print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Measures difficulty of inferring sensitive values from known attributes")

Best Practices#

Data Size Requirements#

Differential privacy works best with larger datasets:

def check_data_requirements(dataset_size):
    """Check if dataset size is suitable for DP."""
    print(f"📏 Dataset Size Analysis: {dataset_size} records")

    if dataset_size >= 10000:
        print("✅ Excellent - Dataset size is ideal for DP")
        print("   Expected: Good quality with ε ∈ [8, 12]")
    elif dataset_size >= 5000:
        print("⚠️  Moderate - Dataset may work with DP")
        print("   Recommendation: Start with higher epsilon (ε=10-12)")
    else:
        print("❌ Small - DP may significantly reduce quality")
        print("   Consider: Collecting more data or using DP without")

    print(f"\n   Recommended delta: {1 / (dataset_size ** 1.2):.2e}")

check_data_requirements(len(df))

Guidelines:

  • 10,000+ records: Ideal for differential privacy

  • 5,000-10,000 records: May work, use higher epsilon

  • < 5,000 records: Consider quality trade-offs carefully

Choosing Epsilon#

def recommend_epsilon(dataset_size, sensitivity):
    """
    Recommend epsilon based on dataset characteristics.

    Args:
        dataset_size: Number of records
        sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public
    """
    recommendations = {
        'high': (1.0, 3.0),
        'medium': (3.0, 6.0),
        'low': (6.0, 10.0)
    }

    epsilon_range = recommendations[sensitivity]
    delta = 1 / (dataset_size ** 1.2)

    print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:")
    print(f"  Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}")
    print(f"  Delta: {delta:.2e}")
    print(f"  Stronger privacy: Use lower epsilon within range")
    print(f"  Better utility: Use higher epsilon within range")
    print(f"\n  Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}")

recommend_epsilon(len(df), 'medium')

Explicit Epsilon Guidance:

  • Start at ε ∈ [8, 12] for most use cases

  • Reduce epsilon gradually if stronger privacy is required

  • Monitor SQS scores to understand quality impact

  • Delta calculation: use "auto" or 1/n^1.2 where n is dataset size

Training Optimization#

Differential privacy training requires special considerations:

# Optimal DP training configuration
builder_optimized = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        batch_size=256,  # Larger batch sizes benefit DP
        num_epochs=10,   # May need more epochs for convergence
    )
    .with_differential_privacy(
        epsilon=8.0,
        delta="auto",
        per_sample_max_grad_norm=1.0
    )
    .synthesize()
)

Training Tips:

  1. Use larger batch sizes - DP benefits from larger batches (reduces noise variance)

    • Default batch size may be too small for optimal DP training

    • Try batch_size=256 or 512 if GPU memory allows

    • If memory errors occur, reduce batch size gradually

  2. Monitor convergence - DP training may converge differently

    • Watch training and validation loss

    • May require more epochs than non-DP training

    • Lower learning rate if training is unstable

  3. Adjust gradient clipping - Controls sensitivity bound

    • per_sample_max_grad_norm=1.0 is a good default

    • Lower values (0.5) = stronger clipping, more privacy, potentially lower quality

    • Higher values (1.5) = less clipping, less privacy, potentially better quality

Privacy Budget Management#

  1. Single Release: Only release one synthetic dataset per original dataset

  2. Composition: If multiple releases needed, divide privacy budget accordingly

  3. Documentation: Track all data releases and cumulative privacy budget

  4. Renewal: Privacy budget doesn’t reset - consider this in data lifecycle

  5. Testing: Test with higher epsilon before final release with lower epsilon


Troubleshooting#

Low SQS with DP Enabled#

If synthetic quality drops significantly:

# Try these approaches:
# 1. Increase epsilon (reduce privacy slightly)
# 2. Increase training data size
# 3. Increase training epochs
# 4. Adjust per_sample_max_grad_norm for gradient clipping

builder_improved = (
    SafeSynthesizerJobBuilder(client)
    .with_data_source(df)
    .with_train(
        num_epochs=10  # More training
    )
    .with_replace_pii()
    .with_differential_privacy(
        epsilon=6.0,  # Slightly higher
        delta=1e-5,
        per_sample_max_grad_norm=1.5  # Less aggressive clipping
    )
    .synthesize()
)

Next Steps#


Resources#