PII Replacement#

PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model has no chance of learning the most sensitive information like names, addresses, and other identifiers.

How It Works#

The PII replacement pipeline operates in multiple stages:

  1. Detection: Identifies PII entities using configurable detection methods

  2. Classification: Categorizes detected entities by type (name, email, address, etc.)

  3. Transformation: Replaces or redacts PII using configurable rules

  4. Validation: Verifies that sensitive information has been properly handled

Detection Methods#

NeMo Safe Synthesizer supports multiple PII detection approaches:

GLiNER Detection#

Uses the GLiNER model for entity recognition:

  • Zero-shot entity detection

  • Supports custom entity types

  • High accuracy for standard PII categories

  • Configurable confidence thresholds

LLM Classification#

Leverages language models for PII detection:

  • Contextual understanding of entities

  • Handles complex PII patterns

  • Flexible entity definitions

  • Configurable prompts and models

Regex Detection#

Pattern-based detection for structured PII:

  • Fast and deterministic

  • Ideal for known formats (SSN, phone numbers)

  • Customizable patterns

  • Low computational overhead

Replacement Strategies#

After detection, PII can be handled in multiple ways:

  • Fake Data: Generate realistic replacements using Faker library

  • Redaction: Replace with placeholder tokens

  • Hashing: One-way hashing for consistency

  • Custom Rules: Define your own transformation logic

Supported Entity Types#

NeMo Safe Synthesizer recognizes many PII types out of the box, organized by category:

Personal Information#

  • first_name - Given names

  • last_name - Surnames and family names

  • name - Full names

Contact Information#

  • email - Email addresses

  • phone_number - Phone numbers in various formats

  • address - Physical addresses

Identifiers#

  • ssn - Social Security Numbers

  • passport_number - Passport identifiers

  • license_number - Driver’s license numbers

Financial Information#

  • credit_debit_card - Credit and debit card numbers

  • bank_account - Bank account numbers

Medical Information#

  • patient_id - Medical record identifiers

  • medical_record_number - Health record numbers

Custom Entity Types#

Beyond these built-in types, you can define custom entities using:

  • GLiNER - Zero-shot entity detection for domain-specific PII

  • LLM Classification - Flexible detection with custom prompts

  • Regex Patterns - Pattern-based detection for structured formats

Detection Method Notes:

  • GLiNER: Fast, accurate zero-shot NER for standard and custom entity types

  • Regex: Deterministic pattern matching, best for consistent formats (SSN, credit cards)

  • LLM: Contextual understanding, handles complex patterns and ambiguous cases

Example Custom Entity:

{
    "classify": {
        "enable": true,
        "entities": [
            "first_name", "last_name", "email",
            "employee_id", "project_code"
        ]
    }
}

Configuration#

PII replacement is configured through the replace_pii section:

{
    "replace_pii": {
        "globals": {"locales": ["en_US"]},
        "steps": [
            {
                "rows": {
                    "update": [
                        {
                            "entity": ["email", "phone_number"],
                            "value": "column.entity | fake"
                        }
                    ]
                }
            }
        ]
    }
}

When to Use PII Replacement#

Consider using PII replacement when:

  • Your data contains names, addresses, or other direct identifiers

  • Compliance requires PII removal before processing

  • You want to ensure the model cannot memorize sensitive values

  • You need to share synthetic data with external parties

PII replacement can be used standalone (without synthesis) or as a preprocessing step before synthesis.