Download this tutorial as a Jupyter notebook
Differential Privacy Deep Dive#
Learn how to apply differential privacy to achieve the maximum level of privacy with mathematical guarantees. This tutorial explores the privacy-utility tradeoff and demonstrates how to configure differential privacy parameters for optimal results.
Prerequisites#
Completed the Safe Synthesizer 101 tutorial
Understanding of basic privacy concepts
Safe Synthesizer deployment with GPU resources
What You’ll Learn#
Understanding differential privacy concepts (epsilon, delta)
Configuring privacy hyperparameters
Analyzing privacy-utility tradeoffs
Interpreting privacy metrics in evaluation reports
Understanding Differential Privacy#
Differential privacy (DP) provides mathematical guarantees that synthetic data doesn’t reveal information about individual records in the training data.
Key Concepts:
Epsilon (ε): Privacy budget - lower values mean stronger privacy
ε = 1: Very strong privacy
ε = 6-10: Moderate privacy
ε > 10: Weak privacy
Recommended starting range: ε ∈ [8, 12] - adjust downward based on privacy needs
Delta (δ): Probability of privacy breach
Typically set to 1/n^1.2 where n is dataset size
Use
"auto"for automatic calculation (recommended)Manual values typically between 1e-6 and 1e-4
Noise: Random noise added during training to prevent memorization
Calibrated based on epsilon, delta, and gradient clipping threshold
Higher privacy (lower epsilon) requires more noise
Record-Level vs Group-Level Privacy#
By default, NeMo Safe Synthesizer uses record-level differential privacy, which protects individual records. For datasets where multiple records belong to the same entity (e.g., a patient with multiple visits), you can use group-level privacy:
# Group-level privacy for multi-record entities
builder_grouped = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_train(
group_training_examples_by="patient_id" # Group records by patient
)
.with_differential_privacy(epsilon=8.0)
.synthesize()
)
When to use group-level privacy:
Multiple records per person/entity in your dataset
Privacy guarantees should apply to entire entities, not individual records
Examples: patient medical histories, customer transaction logs
Setup#
Install the NeMo Microservices SDK with Safe Synthesizer support:
if command -v uv &> /dev/null; then
uv pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib
else
pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib
fi
import os
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder
# Configure client
client = NeMoMicroservices(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)
Load and Prepare Data#
# Load sample dataset
import kagglehub # type: ignore[import-not-found]
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
print(f"Dataset size: {len(df)} records")
print(f"Recommended delta: {1 / (len(df) ** 2):.2e}")
Experiment 1: No Differential Privacy (Baseline)#
First, create a baseline without differential privacy:
import time
print("🔬 Experiment 1: No Differential Privacy (Baseline)")
builder_baseline = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii()
.synthesize()
)
# Create a project for our jobs (creates if it doesn't exist)
project_name = "test-project"
try:
client.projects.create(workspace="default", name=project_name)
except Exception:
pass # Project may already exist
job_baseline = builder_baseline.create_job(name=f"dp-baseline-{int(time.time())}", project="test-project")
print(f"✅ Baseline job created: {job_baseline.job_name}")
job_baseline.wait_for_completion()
summary_baseline = job_baseline.fetch_summary()
print(f"\n📊 Baseline Results:")
print(f" SQS (Quality): {summary_baseline.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_baseline.data_privacy_score}")
Experiment 2: Moderate Privacy (ε=6)#
Apply moderate differential privacy:
print("\n🔬 Experiment 2: Moderate Privacy (ε=6)")
builder_moderate = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii()
.with_differential_privacy(
epsilon=6.0,
delta=1e-5
)
.synthesize()
)
job_moderate = builder_moderate.create_job(name=f"dp-moderate-{int(time.time())}", project="test-project")
print(f"✅ Moderate privacy job created: {job_moderate.job_name}")
job_moderate.wait_for_completion()
summary_moderate = job_moderate.fetch_summary()
print(f"\n📊 Moderate Privacy Results:")
print(f" SQS (Quality): {summary_moderate.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_moderate.data_privacy_score}")
Experiment 3: Strong Privacy (ε=1)#
Apply strong differential privacy:
print("\n🔬 Experiment 3: Strong Privacy (ε=1)")
builder_strong = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii()
.with_differential_privacy(
epsilon=1.0,
delta=1e-5
)
.synthesize()
)
job_strong = builder_strong.create_job(name=f"dp-strong-{int(time.time())}", project="test-project")
print(f"✅ Strong privacy job created: {job_strong.job_name}")
job_strong.wait_for_completion()
summary_strong = job_strong.fetch_summary()
print(f"\n📊 Strong Privacy Results:")
print(f" SQS (Quality): {summary_strong.synthetic_data_quality_score}")
print(f" DPS (Privacy): {summary_strong.data_privacy_score}")
Compare Results#
Visualize the privacy-utility tradeoff:
import matplotlib.pyplot as plt
experiments = ['Baseline\n(No DP)', 'Moderate\n(ε=6)', 'Strong\n(ε=1)']
sqs_scores = [
summary_baseline.synthetic_data_quality_score,
summary_moderate.synthetic_data_quality_score,
summary_strong.synthetic_data_quality_score
]
dps_scores = [
summary_baseline.data_privacy_score,
summary_moderate.data_privacy_score,
summary_strong.data_privacy_score
]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# SQS comparison
ax1.bar(experiments, sqs_scores, color=['blue', 'green', 'red'], alpha=0.7)
ax1.set_ylabel('Score')
ax1.set_title('Synthetic Quality Score (SQS)')
ax1.set_ylim([0, 100])
ax1.axhline(y=70, color='gray', linestyle='--', label='Good threshold')
ax1.legend()
# DPS comparison
ax2.bar(experiments, dps_scores, color=['blue', 'green', 'red'], alpha=0.7)
ax2.set_ylabel('Score')
ax2.set_title('Data Privacy Score (DPS)')
ax2.set_ylim([0, 100])
ax2.axhline(y=70, color='gray', linestyle='--', label='Good threshold')
ax2.legend()
plt.tight_layout()
plt.show()
print("\n📈 Privacy-Utility Tradeoff Summary:")
print(f"{'Experiment':<20} {'SQS (Utility)':<15} {'DPS (Privacy)':<15}")
print("-" * 50)
for i, exp in enumerate(experiments):
print(f"{exp.strip():<20} {sqs_scores[i]:<15.1f} {dps_scores[i]:<15.1f}")
Advanced Configuration#
Custom Privacy Budget#
Configure differential privacy with custom parameters:
from nemo_microservices.beta.safe_synthesizer.config import DifferentialPrivacyHyperparams
# Create custom privacy configuration
privacy_config = DifferentialPrivacyHyperparams(
dp_enabled=True,
epsilon=3.0,
delta=1e-5,
per_sample_max_grad_norm=1.0, # Gradient clipping threshold
)
# Use with SafeSynthesizerJobBuilder
builder_custom = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii()
.with_differential_privacy(config=privacy_config)
.synthesize()
)
Privacy Budget Composition#
When running multiple experiments, the privacy budget compounds:
# Total privacy budget across experiments
total_epsilon = 0.0 # No DP baseline
total_epsilon += 6.0 # Moderate privacy
total_epsilon += 1.0 # Strong privacy
print(f"\n🔐 Total Privacy Budget Consumed: ε = {total_epsilon}")
print("Note: Each additional release compounds the privacy budget")
print("Best practice: Only release one synthetic dataset per original dataset")
Interpreting Privacy Metrics#
Membership Inference Attack (MIA)#
Measures if an attacker can determine whether a record was in training data:
# Fetch detailed evaluation reports
baseline_report = job_baseline.fetch_summary()
moderate_report = job_moderate.fetch_summary()
print("\n🛡️ Membership Inference Protection:")
print(f"Baseline: {baseline_report.membership_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.membership_inference_protection_score}")
print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Score > 0.5 means attacker cannot reliably identify training records")
Attribute Inference Attack (AIA)#
Measures if sensitive attributes can be inferred from other attributes:
print("\n🔍 Attribute Inference Protection:")
print(f"Baseline: {baseline_report.attribute_inference_protection_score}")
print(f"Moderate (ε=6): {moderate_report.attribute_inference_protection_score}")
print("\nInterpretation:")
print("- Higher score = Better protection")
print("- Measures difficulty of inferring sensitive values from known attributes")
Best Practices#
Data Size Requirements#
Differential privacy works best with larger datasets:
def check_data_requirements(dataset_size):
"""Check if dataset size is suitable for DP."""
print(f"📏 Dataset Size Analysis: {dataset_size} records")
if dataset_size >= 10000:
print("✅ Excellent - Dataset size is ideal for DP")
print(" Expected: Good quality with ε ∈ [8, 12]")
elif dataset_size >= 5000:
print("⚠️ Moderate - Dataset may work with DP")
print(" Recommendation: Start with higher epsilon (ε=10-12)")
else:
print("❌ Small - DP may significantly reduce quality")
print(" Consider: Collecting more data or using DP without")
print(f"\n Recommended delta: {1 / (dataset_size ** 1.2):.2e}")
check_data_requirements(len(df))
Guidelines:
10,000+ records: Ideal for differential privacy
5,000-10,000 records: May work, use higher epsilon
< 5,000 records: Consider quality trade-offs carefully
Choosing Epsilon#
def recommend_epsilon(dataset_size, sensitivity):
"""
Recommend epsilon based on dataset characteristics.
Args:
dataset_size: Number of records
sensitivity: 'high' for medical/financial, 'medium' for general, 'low' for public
"""
recommendations = {
'high': (1.0, 3.0),
'medium': (3.0, 6.0),
'low': (6.0, 10.0)
}
epsilon_range = recommendations[sensitivity]
delta = 1 / (dataset_size ** 1.2)
print(f"📋 Recommendations for {dataset_size} records, {sensitivity} sensitivity:")
print(f" Epsilon range: {epsilon_range[0]} - {epsilon_range[1]}")
print(f" Delta: {delta:.2e}")
print(f" Stronger privacy: Use lower epsilon within range")
print(f" Better utility: Use higher epsilon within range")
print(f"\n Starting point: ε = {(epsilon_range[0] + epsilon_range[1]) / 2:.1f}")
recommend_epsilon(len(df), 'medium')
Explicit Epsilon Guidance:
Start at ε ∈ [8, 12] for most use cases
Reduce epsilon gradually if stronger privacy is required
Monitor SQS scores to understand quality impact
Delta calculation: use
"auto"or1/n^1.2where n is dataset size
Training Optimization#
Differential privacy training requires special considerations:
# Optimal DP training configuration
builder_optimized = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_train(
batch_size=256, # Larger batch sizes benefit DP
num_epochs=10, # May need more epochs for convergence
)
.with_differential_privacy(
epsilon=8.0,
delta="auto",
per_sample_max_grad_norm=1.0
)
.synthesize()
)
Training Tips:
Use larger batch sizes - DP benefits from larger batches (reduces noise variance)
Default batch size may be too small for optimal DP training
Try batch_size=256 or 512 if GPU memory allows
If memory errors occur, reduce batch size gradually
Monitor convergence - DP training may converge differently
Watch training and validation loss
May require more epochs than non-DP training
Lower learning rate if training is unstable
Adjust gradient clipping - Controls sensitivity bound
per_sample_max_grad_norm=1.0is a good defaultLower values (0.5) = stronger clipping, more privacy, potentially lower quality
Higher values (1.5) = less clipping, less privacy, potentially better quality
Privacy Budget Management#
Single Release: Only release one synthetic dataset per original dataset
Composition: If multiple releases needed, divide privacy budget accordingly
Documentation: Track all data releases and cumulative privacy budget
Renewal: Privacy budget doesn’t reset - consider this in data lifecycle
Testing: Test with higher epsilon before final release with lower epsilon
Troubleshooting#
Low SQS with DP Enabled#
If synthetic quality drops significantly:
# Try these approaches:
# 1. Increase epsilon (reduce privacy slightly)
# 2. Increase training data size
# 3. Increase training epochs
# 4. Adjust per_sample_max_grad_norm for gradient clipping
builder_improved = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_train(
num_epochs=10 # More training
)
.with_replace_pii()
.with_differential_privacy(
epsilon=6.0, # Slightly higher
delta=1e-5,
per_sample_max_grad_norm=1.5 # Less aggressive clipping
)
.synthesize()
)
Next Steps#
Data Synthesis - Learn more about synthesis concepts
Evaluation - Privacy evaluation concepts
PII Replacement Tutorial - Combine with PII protection
Resources#
About Safe Synthesizer - Core concepts and components