Download this tutorial as a Jupyter notebook
PII Replacement Deep Dive#
Learn how to detect, redact, or replace PII without creating a fully synthetic version. This tutorial demonstrates using Safe Synthesizer exclusively for PII replacement, exploring different detection methods and replacement strategies.
Prerequisites#
Completed the Safe Synthesizer 101 tutorial
Understanding of PII concepts (names, emails, addresses, etc.)
Safe Synthesizer deployment
What You’ll Learn#
Configuring PII detection methods (GLiNER, LLM, Regex)
Defining custom entity types
Implementing replacement strategies (fake, redact, hash)
Comparing detection method performance
Using PII-only workflows (without synthesis)
Understanding PII Replacement#
NeMo Safe Synthesizer supports three detection methods:
Method |
Best For |
Speed |
Accuracy |
|---|---|---|---|
GLiNER |
General PII detection |
Fast |
High |
LLM |
Context-aware, custom entities |
Slow |
Very High |
Regex |
Structured patterns (SSN, phone) |
Very Fast |
Perfect (for known formats) |
Setup#
Install the NeMo Microservices SDK
if command -v uv &> /dev/null; then
uv pip install nemo-microservices[safe-synthesizer] matplotlib
else
pip install nemo-microservices[safe-synthesizer] matplotlib
fi
Configure the client
import os
import time
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder
client = NeMoMicroservices(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)
project_name = "test-project"
try:
client.projects.create(workspace="default", name=project_name)
except Exception:
pass # Project may already exist
Configure Column Classification (Required)#
Safe Synthesizer uses LLM-based column classification to automatically detect column types and improve PII detection accuracy. You need to set up a model provider for this feature.
Set your NVIDIA API key:
import os
# Get your API key from https://build.nvidia.com/
# You can also set this as an environment variable: export NIM_API_KEY=nvapi-...
api_key = os.environ.get("NIM_API_KEY")
if not api_key:
raise ValueError(
"NIM_API_KEY is required for column classification. "
"Get your free API key from https://build.nvidia.com/"
)
# Create the API key as a secret
timestamp = int(time.time())
api_key_secret_name = f"nim-api-key-pii-tutorial-{timestamp}"
client.secrets.create(workspace="default", name=api_key_secret_name, data=api_key)
print(f"✅ Created API key secret: {api_key_secret_name}")
# Create the model provider
provider_name = f"classify-llm-pii-tutorial-{timestamp}"
client.inference.providers.create(
workspace="default",
name=provider_name,
host_url="https://integrate.api.nvidia.com/v1",
api_key_secret_name=api_key_secret_name,
description="Model provider for Safe Synthesizer column classification",
)
print(f"✓ Created model provider: {provider_name}")
print("\n✅ Column classification configured successfully")
Note
Without column classification, you’ll see errors like:
"Could not perform classify, falling back to default entities."
This causes Safe Synthesizer to use generic defaults instead of context-aware column detection, which may miss some PII or incorrectly identify columns.
Create Sample Data with PII#
Create a test dataset with various types of PII:
# Sample data with different PII types
data = {
'customer_id': [1, 2, 3, 4, 5],
'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams', 'Charlie Brown'],
'email': ['john.smith@email.com', 'jane.doe@email.com', 'bob.j@email.com',
'alice.w@email.com', 'charlie.b@email.com'],
'phone': ['555-123-4567', '555-234-5678', '555-345-6789', '555-456-7890', '555-567-8901'],
'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA',
'789 Pine Rd, Chicago, IL', '321 Elm St, Houston, TX',
'654 Maple Dr, Phoenix, AZ'],
'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012', '567-89-0123'],
'review': [
'Great product! I am 35 years old and weigh 180 lbs.',
'Love it! Perfect for my apartment in Brooklyn.',
'My wife Jane really likes this.',
'Recommended by my doctor Dr. Smith.',
'Bought this for my daughter Emily.'
]
}
df = pd.DataFrame(data)
print("📋 Sample Data with PII:")
print(df.head())
Method 1: GLiNER Detection (Recommended)#
GLiNER provides fast, accurate detection for standard PII types:
print("🔬 Method 1: GLiNER Detection")
# Configure PII replacement with GLiNER
builder_gliner = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name) # Enable column classification
.with_replace_pii(
detection_method="gliner",
entity_types=["person", "email", "phone_number", "address", "ssn"],
replacement_strategy="fake"
)
# Note: No .synthesize() - PII replacement only!
)
job_gliner = builder_gliner.create_job(name=f"pii-gliner-job-{int(time.time())}", project="test-project")
print(f"✅ GLiNER job created: {job_gliner.job_name}")
# Wait for completion
job_gliner.wait_for_completion()
print("✅ GLiNER processing complete")
# Retrieve results
redacted_df_gliner = job_gliner.fetch_data()
print("📊 GLiNER Results:")
print(redacted_df_gliner.head())
Method 2: LLM Classification#
Use LLM for context-aware PII detection:
print("🔬 Method 2: LLM Classification")
# Configure PII replacement with LLM
builder_llm = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name) # Enable column classification
.with_replace_pii(
detection_method="llm",
entity_types=["person", "email", "location", "organization"],
replacement_strategy="fake",
llm_model="meta/llama-3.2-1b-instruct" # Specify LLM model
)
)
job_llm = builder_llm.create_job(name=f"pii-llm-job-{int(time.time())}", project="test-project")
print(f"✅ LLM job created: {job_llm.job_name}")
job_llm.wait_for_completion()
print("✅ LLM processing complete")
redacted_df_llm = job_llm.fetch_data()
print("\n📊 LLM Results:")
print(redacted_df_llm.head())
Method 3: Regex Detection#
Use regex for structured PII patterns:
print("\n🔬 Method 3: Regex Detection")
# Configure custom regex patterns
import re
regex_patterns = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}-\d{3}-\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b'
}
builder_regex = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name) # Enable column classification
.with_replace_pii(
detection_method="regex",
regex_patterns=regex_patterns,
replacement_strategy="redact" # Use redaction for regex
)
)
job_regex = builder_regex.create_job(name=f"pii-regex-job-{int(time.time())}", project="test-project")
print(f"✅ Regex job created: {job_regex.job_name}")
job_regex.wait_for_completion()
print("✅ Regex processing complete")
redacted_df_regex = job_regex.fetch_data()
print("\n📊 Regex Results:")
print(redacted_df_regex.head())
Replacement Strategies#
Strategy 1: Fake Data#
Replace PII with realistic fake values:
# Fake data maintains data type and format
builder_fake = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name)
.with_replace_pii(
detection_method="gliner",
entity_types=["person", "email", "phone_number"],
replacement_strategy="fake",
fake_locale="en_US" # Specify locale for realistic fakes
)
)
job_fake = builder_fake.create_job(name=f"pii-fake-job-{int(time.time())}", project="test-project")
job_fake.wait_for_completion()
df_fake = job_fake.fetch_data()
print("📋 Fake Strategy - Original vs Replaced:")
print(f"Original email: {df['email'].iloc[0]}")
print(f"Fake email: {df_fake['email'].iloc[0]}")
Strategy 2: Redaction#
Replace PII with placeholder tokens:
# Redaction replaces with [REDACTED] or custom token
builder_redact = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name)
.with_replace_pii(
detection_method="gliner",
entity_types=["person", "email", "phone_number"],
replacement_strategy="redact",
redaction_token="[REMOVED]" # Custom redaction token
)
)
job_redact = builder_redact.create_job(name=f"pii-redact-job-{int(time.time())}", project="test-project")
job_redact.wait_for_completion()
df_redact = job_redact.fetch_data()
print("\n📋 Redaction Strategy - Original vs Replaced:")
print(f"Original name: {df['name'].iloc[0]}")
print(f"Redacted name: {df_redact['name'].iloc[0]}")
Strategy 3: Hashing#
Replace with consistent hash values:
# Hashing maintains consistency (same input = same output)
builder_hash = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name)
.with_replace_pii(
detection_method="gliner",
entity_types=["person", "email"],
replacement_strategy="hash",
hash_salt="my-secret-salt" # Consistent salt for reproducibility
)
)
job_hash = builder_hash.create_job(name=f"pii-hash-job-{int(time.time())}", project="test-project")
job_hash.wait_for_completion()
df_hash = job_hash.fetch_data()
print("\n📋 Hashing Strategy - Original vs Replaced:")
print(f"Original name: {df['name'].iloc[0]}")
print(f"Hashed name: {df_hash['name'].iloc[0]}")
print("\nNote: Same name will always hash to the same value")
Advanced Configuration#
Custom Entity Types#
Define custom PII entities specific to your domain:
# Example: Healthcare-specific entities
custom_entities = [
"patient_id",
"medical_record_number",
"prescription_number",
"insurance_id",
"diagnosis_code"
]
builder_custom = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_classify_model_provider(provider_name)
.with_replace_pii(
detection_method="llm",
entity_types=custom_entities,
replacement_strategy="fake",
custom_entity_definitions={
"patient_id": "A unique identifier for a patient in the format P-XXXXX",
"medical_record_number": "Medical record number in format MRN-XXXXXX"
}
)
)
Column-Specific Configuration#
Apply different strategies to different columns:
# Use API directly for fine-grained control
job_spec = {
"data_source": "fileset://default/safe-synthesizer-inputs/data.csv",
"config": {
"enable_replace_pii": True,
"replace_pii": {
"globals": {"locales": ["en_US"]},
"steps": [
{
"rows": {
"update": [
{
"column": "email",
"entity": ["email"],
"value": "column.entity | fake"
},
{
"column": "name",
"entity": ["person"],
"value": "[REDACTED]"
},
{
"column": "ssn",
"entity": ["ssn"],
"value": "column.entity | hash"
}
]
}
}
]
}
}
}
job_custom = client.safe_synthesizer.jobs.create(
workspace="default",
name=f"column-specific-pii-{int(time.time())}",
project="test-project",
spec=job_spec
)
Compare Detection Methods#
Analyze the differences between detection methods:
import matplotlib.pyplot as plt
# Compare detection rates
methods = ['GLiNER', 'LLM', 'Regex']
# Count detected entities (example - adjust based on actual results)
def count_changes(original_df, processed_df):
"""Count how many values changed"""
changes = 0
for col in original_df.columns:
if col in processed_df.columns:
changes += (original_df[col] != processed_df[col]).sum()
return changes
detections = [
count_changes(df, redacted_df_gliner),
count_changes(df, redacted_df_llm),
count_changes(df, redacted_df_regex)
]
plt.figure(figsize=(10, 5))
plt.bar(methods, detections, color=['blue', 'green', 'red'], alpha=0.7)
plt.xlabel('Detection Method')
plt.ylabel('Number of Detections')
plt.title('PII Detection Comparison')
plt.show()
print(f"\n📊 Detection Method Comparison:")
for method, count in zip(methods, detections):
print(f" {method}: {count} changes detected")
Validation and Quality Checks#
Verify PII has been properly removed:
import re
# Check for remaining PII patterns
def check_for_pii(df, patterns):
"""Check if any PII patterns remain in the data"""
found_pii = []
for col in df.columns:
for idx, value in enumerate(df[col]):
if pd.notna(value):
value_str = str(value)
for pattern_name, pattern in patterns.items():
if re.search(pattern, value_str):
found_pii.append({
'column': col,
'row': idx,
'pattern': pattern_name,
'value': value_str
})
return found_pii
# Define PII patterns to check
pii_patterns = {
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'phone': r'\b\d{3}-\d{3}-\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b'
}
# Check original data
original_pii = check_for_pii(df, pii_patterns)
print(f"\n🔍 PII in original data: {len(original_pii)} instances")
# Check redacted data
redacted_pii = check_for_pii(df_fake, pii_patterns)
print(f"🔍 PII in redacted data: {len(redacted_pii)} instances")
if redacted_pii:
print("\n⚠️ Warning: Some PII may remain. Review configuration.")
for item in redacted_pii[:5]: # Show first 5
print(f" - {item}")
else:
print("\n✅ No PII patterns detected in redacted data")
Best Practices#
1. Layer Detection Methods#
Combine multiple methods for comprehensive coverage:
# Multi-layered approach
builder_layered = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_replace_pii(
detection_methods=[
{"method": "regex", "priority": 1}, # First pass: structured patterns
{"method": "gliner", "priority": 2}, # Second pass: general PII
{"method": "llm", "priority": 3} # Third pass: contextual PII
]
)
)
2. Test with Sample Data#
Always test on a small sample first:
# Test on small sample
df_sample = df.head(10)
builder_test = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df_sample)
.with_classify_model_provider(provider_name)
.with_replace_pii()
)
job_test = builder_test.create_job(name=f"pii-sample-test-{int(time.time())}", project="test-project")
job_test.wait_for_completion()
# Review results before processing full dataset
df_test_result = job_test.fetch_data()
print("Review sample results before proceeding with full dataset")
3. Document Replacement Strategy#
Keep records of what was replaced:
# Create audit log
audit_log = {
'job_name': job_gliner.job_name,
'date': pd.Timestamp.now(),
'method': 'gliner',
'strategy': 'fake',
'entities': ['person', 'email', 'phone_number'],
'records_processed': len(df),
'records_modified': count_changes(df, redacted_df_gliner)
}
print("\n📝 Audit Log:")
print(audit_log)
Next Steps#
PII Replacement - PII replacement concepts
Differential Privacy Tutorial - Add DP guarantees
About Safe Synthesizer - Core concepts and components
Troubleshooting#
PII Not Detected#
Try different detection methods
Add custom entity definitions
Use LLM classification for context-aware detection
Review entity type names (case-sensitive)
Incorrect Replacements#
Adjust confidence thresholds
Use more specific entity types
Combine with regex for known patterns
Test on sample data first
Performance Issues#
Use GLiNER for standard PII (fastest)
Reserve LLM for complex cases only
Process large datasets in batches
Use regex for structured patterns
For more help, see PII Replacement.