{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "(tutorial-pii-replacement)=\n", "\n", "# PII Replacement Deep Dive\n", "\n", "Learn how to detect, redact, or replace PII without creating a fully synthetic version. This tutorial demonstrates using Safe Synthesizer exclusively for PII replacement, exploring different detection methods and replacement strategies.\n", "\n", "## Prerequisites\n", "\n", "- Completed the [Safe Synthesizer 101](tutorial-safe-synthesizer-101) tutorial\n", "- Understanding of PII concepts (names, emails, addresses, etc.)\n", "- Safe Synthesizer deployment\n", "\n", "---\n", "\n", "## What You'll Learn\n", "\n", "- Configuring PII detection methods (GLiNER, LLM, Regex)\n", "- Defining custom entity types\n", "- Implementing replacement strategies (fake, redact, hash)\n", "- Comparing detection method performance\n", "- Using PII-only workflows (without synthesis)\n", "\n", "---\n", "\n", "## Understanding PII Replacement\n", "\n", "{{nss_short_name}} supports three detection methods:\n", "\n", "| Method | Best For | Speed | Accuracy |\n", "|--------|----------|-------|----------|\n", "| **GLiNER** | General PII detection | Fast | High |\n", "| **LLM** | Context-aware, custom entities | Slow | Very High |\n", "| **Regex** | Structured patterns (SSN, phone) | Very Fast | Perfect (for known formats) |\n", "\n", "---\n", "\n", "## Setup\n", "\n", "Install the NeMo Microservices SDK" ] }, { "cell_type": "code", "metadata": { "language": "shell", "vscode": { "languageId": "shellscript" } }, "source": [ "if command -v uv &> /dev/null; then\n", " uv pip install nemo-microservices[safe-synthesizer] matplotlib\n", "else\n", " pip install nemo-microservices[safe-synthesizer] matplotlib\n", "fi" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure the client" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import os\n", "import time\n", "import pandas as pd\n", "from nemo_microservices import NeMoMicroservices\n", "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n", "\n", "client = NeMoMicroservices(\n", " base_url=os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n", ")\n", "project_name = \"test-project\"\n", "try:\n", " client.projects.create(workspace=\"default\", name=project_name)\n", "except Exception:\n", " pass # Project may already exist" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure Column Classification (Required)\n", "\n", "Safe Synthesizer uses LLM-based column classification to automatically detect column types and improve PII detection accuracy. You need to set up a model provider for this feature.\n", "\n", "**Set your NVIDIA API key:**" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import os\n", "# Get your API key from https://build.nvidia.com/\n", "# You can also set this as an environment variable: export NIM_API_KEY=nvapi-...\n", "api_key = os.environ.get(\"NIM_API_KEY\")\n", "\n", "if not api_key:\n", " raise ValueError(\n", " \"NIM_API_KEY is required for column classification. \"\n", " \"Get your free API key from https://build.nvidia.com/\"\n", " )\n", "\n", "# Create the API key as a secret\n", "timestamp = int(time.time())\n", "api_key_secret_name = f\"nim-api-key-pii-tutorial-{timestamp}\"\n", "client.secrets.create(workspace=\"default\", name=api_key_secret_name, data=api_key)\n", "print(f\"\u2705 Created API key secret: {api_key_secret_name}\")\n", "\n", "# Create the model provider\n", "provider_name = f\"classify-llm-pii-tutorial-{timestamp}\"\n", "client.inference.providers.create(\n", " workspace=\"default\",\n", " name=provider_name,\n", " host_url=\"https://integrate.api.nvidia.com/v1\",\n", " api_key_secret_name=api_key_secret_name,\n", " description=\"Model provider for Safe Synthesizer column classification\",\n", ")\n", "print(f\"\u2713 Created model provider: {provider_name}\")\n", "print(\"\\n\u2705 Column classification configured successfully\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", "Without column classification, you'll see errors like:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "\"Could not perform classify, falling back to default entities.\"" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "This causes Safe Synthesizer to use generic defaults instead of context-aware column detection, which may miss some PII or incorrectly identify columns.\n", ":::\n", "\n", "---\n", "\n", "## Create Sample Data with PII\n", "\n", "Create a test dataset with various types of PII:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Sample data with different PII types\n", "data = {\n", " 'customer_id': [1, 2, 3, 4, 5],\n", " 'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams', 'Charlie Brown'],\n", " 'email': ['john.smith@email.com', 'jane.doe@email.com', 'bob.j@email.com', \n", " 'alice.w@email.com', 'charlie.b@email.com'],\n", " 'phone': ['555-123-4567', '555-234-5678', '555-345-6789', '555-456-7890', '555-567-8901'],\n", " 'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA',\n", " '789 Pine Rd, Chicago, IL', '321 Elm St, Houston, TX',\n", " '654 Maple Dr, Phoenix, AZ'],\n", " 'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012', '567-89-0123'],\n", " 'review': [\n", " 'Great product! I am 35 years old and weigh 180 lbs.',\n", " 'Love it! Perfect for my apartment in Brooklyn.',\n", " 'My wife Jane really likes this.',\n", " 'Recommended by my doctor Dr. Smith.',\n", " 'Bought this for my daughter Emily.'\n", " ]\n", "}\n", "\n", "df = pd.DataFrame(data)\n", "print(\"\ud83d\udccb Sample Data with PII:\")\n", "print(df.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Method 1: GLiNER Detection (Recommended)\n", "\n", "GLiNER provides fast, accurate detection for standard PII types:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(\"\ud83d\udd2c Method 1: GLiNER Detection\")\n", "\n", "# Configure PII replacement with GLiNER\n", "builder_gliner = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name) # Enable column classification\n", " .with_replace_pii(\n", " detection_method=\"gliner\",\n", " entity_types=[\"person\", \"email\", \"phone_number\", \"address\", \"ssn\"],\n", " replacement_strategy=\"fake\"\n", " )\n", " # Note: No .synthesize() - PII replacement only!\n", ")\n", "\n", "job_gliner = builder_gliner.create_job(name=f\"pii-gliner-job-{int(time.time())}\", project=\"test-project\")\n", "print(f\"\u2705 GLiNER job created: {job_gliner.job_name}\")\n", "\n", "# Wait for completion\n", "job_gliner.wait_for_completion()\n", "print(\"\u2705 GLiNER processing complete\")\n", "\n", "# Retrieve results\n", "redacted_df_gliner = job_gliner.fetch_data()\n", "print(\"\ud83d\udcca GLiNER Results:\")\n", "print(redacted_df_gliner.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Method 2: LLM Classification\n", "\n", "Use LLM for context-aware PII detection:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(\"\ud83d\udd2c Method 2: LLM Classification\")\n", "\n", "# Configure PII replacement with LLM\n", "builder_llm = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name) # Enable column classification\n", " .with_replace_pii(\n", " detection_method=\"llm\",\n", " entity_types=[\"person\", \"email\", \"location\", \"organization\"],\n", " replacement_strategy=\"fake\",\n", " llm_model=\"meta/llama-3.2-1b-instruct\" # Specify LLM model\n", " )\n", ")\n", "\n", "job_llm = builder_llm.create_job(name=f\"pii-llm-job-{int(time.time())}\", project=\"test-project\")\n", "print(f\"\u2705 LLM job created: {job_llm.job_name}\")\n", "\n", "job_llm.wait_for_completion()\n", "print(\"\u2705 LLM processing complete\")\n", "\n", "redacted_df_llm = job_llm.fetch_data()\n", "print(\"\\n\ud83d\udcca LLM Results:\")\n", "print(redacted_df_llm.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Method 3: Regex Detection\n", "\n", "Use regex for structured PII patterns:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(\"\\n\ud83d\udd2c Method 3: Regex Detection\")\n", "\n", "# Configure custom regex patterns\n", "import re\n", "\n", "regex_patterns = {\n", " 'email': r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b',\n", " 'phone': r'\\b\\d{3}-\\d{3}-\\d{4}\\b',\n", " 'ssn': r'\\b\\d{3}-\\d{2}-\\d{4}\\b'\n", "}\n", "\n", "builder_regex = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name) # Enable column classification\n", " .with_replace_pii(\n", " detection_method=\"regex\",\n", " regex_patterns=regex_patterns,\n", " replacement_strategy=\"redact\" # Use redaction for regex\n", " )\n", ")\n", "\n", "job_regex = builder_regex.create_job(name=f\"pii-regex-job-{int(time.time())}\", project=\"test-project\")\n", "print(f\"\u2705 Regex job created: {job_regex.job_name}\")\n", "\n", "job_regex.wait_for_completion()\n", "print(\"\u2705 Regex processing complete\")\n", "\n", "redacted_df_regex = job_regex.fetch_data()\n", "print(\"\\n\ud83d\udcca Regex Results:\")\n", "print(redacted_df_regex.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Replacement Strategies\n", "\n", "### Strategy 1: Fake Data\n", "\n", "Replace PII with realistic fake values:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Fake data maintains data type and format\n", "builder_fake = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name)\n", " .with_replace_pii(\n", " detection_method=\"gliner\",\n", " entity_types=[\"person\", \"email\", \"phone_number\"],\n", " replacement_strategy=\"fake\",\n", " fake_locale=\"en_US\" # Specify locale for realistic fakes\n", " )\n", ")\n", "\n", "job_fake = builder_fake.create_job(name=f\"pii-fake-job-{int(time.time())}\", project=\"test-project\")\n", "job_fake.wait_for_completion()\n", "df_fake = job_fake.fetch_data()\n", "\n", "print(\"\ud83d\udccb Fake Strategy - Original vs Replaced:\")\n", "print(f\"Original email: {df['email'].iloc[0]}\")\n", "print(f\"Fake email: {df_fake['email'].iloc[0]}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Strategy 2: Redaction\n", "\n", "Replace PII with placeholder tokens:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Redaction replaces with [REDACTED] or custom token\n", "builder_redact = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name)\n", " .with_replace_pii(\n", " detection_method=\"gliner\",\n", " entity_types=[\"person\", \"email\", \"phone_number\"],\n", " replacement_strategy=\"redact\",\n", " redaction_token=\"[REMOVED]\" # Custom redaction token\n", " )\n", ")\n", "\n", "job_redact = builder_redact.create_job(name=f\"pii-redact-job-{int(time.time())}\", project=\"test-project\")\n", "job_redact.wait_for_completion()\n", "df_redact = job_redact.fetch_data()\n", "\n", "print(\"\\n\ud83d\udccb Redaction Strategy - Original vs Replaced:\")\n", "print(f\"Original name: {df['name'].iloc[0]}\")\n", "print(f\"Redacted name: {df_redact['name'].iloc[0]}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Strategy 3: Hashing\n", "\n", "Replace with consistent hash values:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Hashing maintains consistency (same input = same output)\n", "builder_hash = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name)\n", " .with_replace_pii(\n", " detection_method=\"gliner\",\n", " entity_types=[\"person\", \"email\"],\n", " replacement_strategy=\"hash\",\n", " hash_salt=\"my-secret-salt\" # Consistent salt for reproducibility\n", " )\n", ")\n", "\n", "job_hash = builder_hash.create_job(name=f\"pii-hash-job-{int(time.time())}\", project=\"test-project\")\n", "job_hash.wait_for_completion()\n", "df_hash = job_hash.fetch_data()\n", "\n", "print(\"\\n\ud83d\udccb Hashing Strategy - Original vs Replaced:\")\n", "print(f\"Original name: {df['name'].iloc[0]}\")\n", "print(f\"Hashed name: {df_hash['name'].iloc[0]}\")\n", "print(\"\\nNote: Same name will always hash to the same value\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Advanced Configuration\n", "\n", "### Custom Entity Types\n", "\n", "Define custom PII entities specific to your domain:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Example: Healthcare-specific entities\n", "custom_entities = [\n", " \"patient_id\",\n", " \"medical_record_number\",\n", " \"prescription_number\",\n", " \"insurance_id\",\n", " \"diagnosis_code\"\n", "]\n", "\n", "builder_custom = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name)\n", " .with_replace_pii(\n", " detection_method=\"llm\",\n", " entity_types=custom_entities,\n", " replacement_strategy=\"fake\",\n", " custom_entity_definitions={\n", " \"patient_id\": \"A unique identifier for a patient in the format P-XXXXX\",\n", " \"medical_record_number\": \"Medical record number in format MRN-XXXXXX\"\n", " }\n", " )\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column-Specific Configuration\n", "\n", "Apply different strategies to different columns:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Use API directly for fine-grained control\n", "job_spec = {\n", " \"data_source\": \"fileset://default/safe-synthesizer-inputs/data.csv\",\n", " \"config\": {\n", " \"enable_replace_pii\": True,\n", " \"replace_pii\": {\n", " \"globals\": {\"locales\": [\"en_US\"]},\n", " \"steps\": [\n", " {\n", " \"rows\": {\n", " \"update\": [\n", " {\n", " \"column\": \"email\",\n", " \"entity\": [\"email\"],\n", " \"value\": \"column.entity | fake\"\n", " },\n", " {\n", " \"column\": \"name\",\n", " \"entity\": [\"person\"],\n", " \"value\": \"[REDACTED]\"\n", " },\n", " {\n", " \"column\": \"ssn\",\n", " \"entity\": [\"ssn\"],\n", " \"value\": \"column.entity | hash\"\n", " }\n", " ]\n", " }\n", " }\n", " ]\n", " }\n", " }\n", "}\n", "\n", "job_custom = client.safe_synthesizer.jobs.create(\n", " workspace=\"default\",\n", " name=f\"column-specific-pii-{int(time.time())}\",\n", " project=\"test-project\",\n", " spec=job_spec\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Compare Detection Methods\n", "\n", "Analyze the differences between detection methods:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Compare detection rates\n", "methods = ['GLiNER', 'LLM', 'Regex']\n", "\n", "# Count detected entities (example - adjust based on actual results)\n", "def count_changes(original_df, processed_df):\n", " \"\"\"Count how many values changed\"\"\"\n", " changes = 0\n", " for col in original_df.columns:\n", " if col in processed_df.columns:\n", " changes += (original_df[col] != processed_df[col]).sum()\n", " return changes\n", "\n", "detections = [\n", " count_changes(df, redacted_df_gliner),\n", " count_changes(df, redacted_df_llm),\n", " count_changes(df, redacted_df_regex)\n", "]\n", "\n", "plt.figure(figsize=(10, 5))\n", "plt.bar(methods, detections, color=['blue', 'green', 'red'], alpha=0.7)\n", "plt.xlabel('Detection Method')\n", "plt.ylabel('Number of Detections')\n", "plt.title('PII Detection Comparison')\n", "plt.show()\n", "\n", "print(f\"\\n\ud83d\udcca Detection Method Comparison:\")\n", "for method, count in zip(methods, detections):\n", " print(f\" {method}: {count} changes detected\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Validation and Quality Checks\n", "\n", "Verify PII has been properly removed:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import re\n", "# Check for remaining PII patterns\n", "def check_for_pii(df, patterns):\n", " \"\"\"Check if any PII patterns remain in the data\"\"\"\n", " found_pii = []\n", " \n", " for col in df.columns:\n", " for idx, value in enumerate(df[col]):\n", " if pd.notna(value):\n", " value_str = str(value)\n", " for pattern_name, pattern in patterns.items():\n", " if re.search(pattern, value_str):\n", " found_pii.append({\n", " 'column': col,\n", " 'row': idx,\n", " 'pattern': pattern_name,\n", " 'value': value_str\n", " })\n", " \n", " return found_pii\n", "\n", "# Define PII patterns to check\n", "pii_patterns = {\n", " 'email': r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b',\n", " 'phone': r'\\b\\d{3}-\\d{3}-\\d{4}\\b',\n", " 'ssn': r'\\b\\d{3}-\\d{2}-\\d{4}\\b'\n", "}\n", "\n", "# Check original data\n", "original_pii = check_for_pii(df, pii_patterns)\n", "print(f\"\\n\ud83d\udd0d PII in original data: {len(original_pii)} instances\")\n", "\n", "# Check redacted data\n", "redacted_pii = check_for_pii(df_fake, pii_patterns)\n", "print(f\"\ud83d\udd0d PII in redacted data: {len(redacted_pii)} instances\")\n", "\n", "if redacted_pii:\n", " print(\"\\n\u26a0\ufe0f Warning: Some PII may remain. Review configuration.\")\n", " for item in redacted_pii[:5]: # Show first 5\n", " print(f\" - {item}\")\n", "else:\n", " print(\"\\n\u2705 No PII patterns detected in redacted data\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Best Practices\n", "\n", "### 1. Layer Detection Methods\n", "\n", "Combine multiple methods for comprehensive coverage:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Multi-layered approach\n", "builder_layered = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_replace_pii(\n", " detection_methods=[\n", " {\"method\": \"regex\", \"priority\": 1}, # First pass: structured patterns\n", " {\"method\": \"gliner\", \"priority\": 2}, # Second pass: general PII\n", " {\"method\": \"llm\", \"priority\": 3} # Third pass: contextual PII\n", " ]\n", " )\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Test with Sample Data\n", "\n", "Always test on a small sample first:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Test on small sample\n", "df_sample = df.head(10)\n", "\n", "builder_test = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df_sample)\n", " .with_classify_model_provider(provider_name)\n", " .with_replace_pii()\n", ")\n", "\n", "job_test = builder_test.create_job(name=f\"pii-sample-test-{int(time.time())}\", project=\"test-project\")\n", "job_test.wait_for_completion()\n", "\n", "# Review results before processing full dataset\n", "df_test_result = job_test.fetch_data()\n", "print(\"Review sample results before proceeding with full dataset\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Document Replacement Strategy\n", "\n", "Keep records of what was replaced:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create audit log\n", "audit_log = {\n", " 'job_name': job_gliner.job_name,\n", " 'date': pd.Timestamp.now(),\n", " 'method': 'gliner',\n", " 'strategy': 'fake',\n", " 'entities': ['person', 'email', 'phone_number'],\n", " 'records_processed': len(df),\n", " 'records_modified': count_changes(df, redacted_df_gliner)\n", "}\n", "\n", "print(\"\\n\ud83d\udcdd Audit Log:\")\n", "print(audit_log)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Next Steps\n", "\n", "- {doc}`../about/pii-replacement` - PII replacement concepts\n", "- [Differential Privacy Tutorial](tutorial-differential-privacy) - Add DP guarantees\n", "- {doc}`../about/index` - Core concepts and components\n", "\n", "---\n", "\n", "## Troubleshooting\n", "\n", "### PII Not Detected\n", "\n", "- Try different detection methods\n", "- Add custom entity definitions\n", "- Use LLM classification for context-aware detection\n", "- Review entity type names (case-sensitive)\n", "\n", "### Incorrect Replacements\n", "\n", "- Adjust confidence thresholds\n", "- Use more specific entity types\n", "- Combine with regex for known patterns\n", "- Test on sample data first\n", "\n", "### Performance Issues\n", "\n", "- Use GLiNER for standard PII (fastest)\n", "- Reserve LLM for complex cases only\n", "- Process large datasets in batches\n", "- Use regex for structured patterns\n", "\n", "\n", "For more help, see {doc}`../about/pii-replacement`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }