PII Replacement#
PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model has no chance of learning the most sensitive information like names, addresses, and other identifiers.
How It Works#
The PII replacement pipeline operates in multiple stages:
Detection: Identifies PII entities using configurable detection methods
Classification: Categorizes detected entities by type (name, email, address, etc.)
Transformation: Replaces or redacts PII using configurable rules
Validation: Verifies that sensitive information has been properly handled
Detection Methods#
NeMo Safe Synthesizer supports multiple PII detection approaches:
GLiNER Detection#
Uses the GLiNER model for entity recognition:
Zero-shot entity detection
Supports custom entity types
High accuracy for standard PII categories
Configurable confidence thresholds
LLM Classification#
Leverages language models for PII detection:
Contextual understanding of entities
Handles complex PII patterns
Flexible entity definitions
Configurable prompts and models
Regex Detection#
Pattern-based detection for structured PII:
Fast and deterministic
Ideal for known formats (SSN, phone numbers)
Customizable patterns
Low computational overhead
Replacement Strategies#
After detection, PII can be handled in multiple ways:
Fake Data: Generate realistic replacements using Faker library
Redaction: Replace with placeholder tokens
Hashing: One-way hashing for consistency
Custom Rules: Define your own transformation logic
Supported Entity Types#
NeMo Safe Synthesizer recognizes many PII types out of the box, organized by category:
Personal Information#
first_name- Given nameslast_name- Surnames and family namesname- Full names
Contact Information#
email- Email addressesphone_number- Phone numbers in various formatsaddress- Physical addresses
Identifiers#
ssn- Social Security Numberspassport_number- Passport identifierslicense_number- Driver’s license numbers
Financial Information#
credit_debit_card- Credit and debit card numbersbank_account- Bank account numbers
Medical Information#
patient_id- Medical record identifiersmedical_record_number- Health record numbers
Custom Entity Types#
Beyond these built-in types, you can define custom entities using:
GLiNER - Zero-shot entity detection for domain-specific PII
LLM Classification - Flexible detection with custom prompts
Regex Patterns - Pattern-based detection for structured formats
Detection Method Notes:
GLiNER: Fast, accurate zero-shot NER for standard and custom entity types
Regex: Deterministic pattern matching, best for consistent formats (SSN, credit cards)
LLM: Contextual understanding, handles complex patterns and ambiguous cases
Example Custom Entity:
{
"classify": {
"enable": true,
"entities": [
"first_name", "last_name", "email",
"employee_id", "project_code"
]
}
}
Configuration#
PII replacement is configured through the replace_pii section:
{
"replace_pii": {
"globals": {"locales": ["en_US"]},
"steps": [
{
"rows": {
"update": [
{
"entity": ["email", "phone_number"],
"value": "column.entity | fake"
}
]
}
}
]
}
}
When to Use PII Replacement#
Consider using PII replacement when:
Your data contains names, addresses, or other direct identifiers
Compliance requires PII removal before processing
You want to ensure the model cannot memorize sensitive values
You need to share synthetic data with external parties
PII replacement can be used standalone (without synthesis) or as a preprocessing step before synthesis.