{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- @nemo-nb: process -->\n",
    "<!-- @nemo-nb: download -->\n",
    "(tutorial-pii-replacement)=\n",
    "\n",
    "# PII Replacement Deep Dive\n",
    "\n",
    "Learn how to detect, redact, or replace PII without creating a fully synthetic version. This tutorial demonstrates using Safe Synthesizer exclusively for PII replacement, exploring different detection methods and replacement strategies.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "- Completed the [Safe Synthesizer 101](tutorial-safe-synthesizer-101) tutorial\n",
    "- Understanding of PII concepts (names, emails, addresses, etc.)\n",
    "- Safe Synthesizer deployment\n",
    "\n",
    "---\n",
    "\n",
    "## What You'll Learn\n",
    "\n",
    "- Configuring PII detection methods (GLiNER, LLM, Regex)\n",
    "- Defining custom entity types\n",
    "- Implementing replacement strategies (fake, redact, hash)\n",
    "- Comparing detection method performance\n",
    "- Using PII-only workflows (without synthesis)\n",
    "\n",
    "---\n",
    "\n",
    "## Understanding PII Replacement\n",
    "\n",
    "{{nss_short_name}} supports three detection methods:\n",
    "\n",
    "| Method | Best For | Speed | Accuracy |\n",
    "|--------|----------|-------|----------|\n",
    "| **GLiNER** | General PII detection | Fast | High |\n",
    "| **LLM** | Context-aware, custom entities | Slow | Very High |\n",
    "| **Regex** | Structured patterns (SSN, phone) | Very Fast | Perfect (for known formats) |\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "Install the NeMo Microservices SDK"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "language": "shell",
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "source": [
    "if command -v uv &> /dev/null; then\n",
    "    uv pip install nemo-microservices[safe-synthesizer] matplotlib\n",
    "else\n",
    "    pip install nemo-microservices[safe-synthesizer] matplotlib\n",
    "fi"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configure the client"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import os\n",
    "import time\n",
    "import pandas as pd\n",
    "from nemo_microservices import NeMoMicroservices\n",
    "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n",
    "\n",
    "client = NeMoMicroservices(\n",
    "    base_url=os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n",
    ")\n",
    "project_name = \"test-project\"\n",
    "try:\n",
    "    client.projects.create(workspace=\"default\", name=project_name)\n",
    "except Exception:\n",
    "    pass  # Project may already exist"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure Column Classification (Required)\n",
    "\n",
    "Safe Synthesizer uses LLM-based column classification to automatically detect column types and improve PII detection accuracy. You need to set up a model provider for this feature.\n",
    "\n",
    "**Set your NVIDIA API key:**"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import os\n",
    "# Get your API key from https://build.nvidia.com/\n",
    "# You can also set this as an environment variable: export NIM_API_KEY=nvapi-...\n",
    "api_key = os.environ.get(\"NIM_API_KEY\")\n",
    "\n",
    "if not api_key:\n",
    "    raise ValueError(\n",
    "        \"NIM_API_KEY is required for column classification. \"\n",
    "        \"Get your free API key from https://build.nvidia.com/\"\n",
    "    )\n",
    "\n",
    "# Create the API key as a secret\n",
    "timestamp = int(time.time())\n",
    "api_key_secret_name = f\"nim-api-key-pii-tutorial-{timestamp}\"\n",
    "client.secrets.create(workspace=\"default\", name=api_key_secret_name, data=api_key)\n",
    "print(f\"\u2705 Created API key secret: {api_key_secret_name}\")\n",
    "\n",
    "# Create the model provider\n",
    "provider_name = f\"classify-llm-pii-tutorial-{timestamp}\"\n",
    "client.inference.providers.create(\n",
    "    workspace=\"default\",\n",
    "    name=provider_name,\n",
    "    host_url=\"https://integrate.api.nvidia.com/v1\",\n",
    "    api_key_secret_name=api_key_secret_name,\n",
    "    description=\"Model provider for Safe Synthesizer column classification\",\n",
    ")\n",
    "print(f\"\u2713 Created model provider: {provider_name}\")\n",
    "print(\"\\n\u2705 Column classification configured successfully\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "Without column classification, you'll see errors like:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "\"Could not perform classify, falling back to default entities.\""
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This causes Safe Synthesizer to use generic defaults instead of context-aware column detection, which may miss some PII or incorrectly identify columns.\n",
    ":::\n",
    "\n",
    "---\n",
    "\n",
    "## Create Sample Data with PII\n",
    "\n",
    "Create a test dataset with various types of PII:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Sample data with different PII types\n",
    "data = {\n",
    "    'customer_id': [1, 2, 3, 4, 5],\n",
    "    'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams', 'Charlie Brown'],\n",
    "    'email': ['john.smith@email.com', 'jane.doe@email.com', 'bob.j@email.com', \n",
    "              'alice.w@email.com', 'charlie.b@email.com'],\n",
    "    'phone': ['555-123-4567', '555-234-5678', '555-345-6789', '555-456-7890', '555-567-8901'],\n",
    "    'address': ['123 Main St, New York, NY', '456 Oak Ave, Los Angeles, CA',\n",
    "                '789 Pine Rd, Chicago, IL', '321 Elm St, Houston, TX',\n",
    "                '654 Maple Dr, Phoenix, AZ'],\n",
    "    'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012', '567-89-0123'],\n",
    "    'review': [\n",
    "        'Great product! I am 35 years old and weigh 180 lbs.',\n",
    "        'Love it! Perfect for my apartment in Brooklyn.',\n",
    "        'My wife Jane really likes this.',\n",
    "        'Recommended by my doctor Dr. Smith.',\n",
    "        'Bought this for my daughter Emily.'\n",
    "    ]\n",
    "}\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "print(\"\ud83d\udccb Sample Data with PII:\")\n",
    "print(df.head())"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Method 1: GLiNER Detection (Recommended)\n",
    "\n",
    "GLiNER provides fast, accurate detection for standard PII types:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "print(\"\ud83d\udd2c Method 1: GLiNER Detection\")\n",
    "\n",
    "# Configure PII replacement with GLiNER\n",
    "builder_gliner = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)  # Enable column classification\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"gliner\",\n",
    "        entity_types=[\"person\", \"email\", \"phone_number\", \"address\", \"ssn\"],\n",
    "        replacement_strategy=\"fake\"\n",
    "    )\n",
    "    # Note: No .synthesize() - PII replacement only!\n",
    ")\n",
    "\n",
    "job_gliner = builder_gliner.create_job(name=f\"pii-gliner-job-{int(time.time())}\", project=\"test-project\")\n",
    "print(f\"\u2705 GLiNER job created: {job_gliner.job_name}\")\n",
    "\n",
    "# Wait for completion\n",
    "job_gliner.wait_for_completion()\n",
    "print(\"\u2705 GLiNER processing complete\")\n",
    "\n",
    "# Retrieve results\n",
    "redacted_df_gliner = job_gliner.fetch_data()\n",
    "print(\"\ud83d\udcca GLiNER Results:\")\n",
    "print(redacted_df_gliner.head())"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Method 2: LLM Classification\n",
    "\n",
    "Use LLM for context-aware PII detection:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "print(\"\ud83d\udd2c Method 2: LLM Classification\")\n",
    "\n",
    "# Configure PII replacement with LLM\n",
    "builder_llm = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)  # Enable column classification\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"llm\",\n",
    "        entity_types=[\"person\", \"email\", \"location\", \"organization\"],\n",
    "        replacement_strategy=\"fake\",\n",
    "        llm_model=\"meta/llama-3.2-1b-instruct\"  # Specify LLM model\n",
    "    )\n",
    ")\n",
    "\n",
    "job_llm = builder_llm.create_job(name=f\"pii-llm-job-{int(time.time())}\", project=\"test-project\")\n",
    "print(f\"\u2705 LLM job created: {job_llm.job_name}\")\n",
    "\n",
    "job_llm.wait_for_completion()\n",
    "print(\"\u2705 LLM processing complete\")\n",
    "\n",
    "redacted_df_llm = job_llm.fetch_data()\n",
    "print(\"\\n\ud83d\udcca LLM Results:\")\n",
    "print(redacted_df_llm.head())"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Method 3: Regex Detection\n",
    "\n",
    "Use regex for structured PII patterns:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "print(\"\\n\ud83d\udd2c Method 3: Regex Detection\")\n",
    "\n",
    "# Configure custom regex patterns\n",
    "import re\n",
    "\n",
    "regex_patterns = {\n",
    "    'email': r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b',\n",
    "    'phone': r'\\b\\d{3}-\\d{3}-\\d{4}\\b',\n",
    "    'ssn': r'\\b\\d{3}-\\d{2}-\\d{4}\\b'\n",
    "}\n",
    "\n",
    "builder_regex = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)  # Enable column classification\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"regex\",\n",
    "        regex_patterns=regex_patterns,\n",
    "        replacement_strategy=\"redact\"  # Use redaction for regex\n",
    "    )\n",
    ")\n",
    "\n",
    "job_regex = builder_regex.create_job(name=f\"pii-regex-job-{int(time.time())}\", project=\"test-project\")\n",
    "print(f\"\u2705 Regex job created: {job_regex.job_name}\")\n",
    "\n",
    "job_regex.wait_for_completion()\n",
    "print(\"\u2705 Regex processing complete\")\n",
    "\n",
    "redacted_df_regex = job_regex.fetch_data()\n",
    "print(\"\\n\ud83d\udcca Regex Results:\")\n",
    "print(redacted_df_regex.head())"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Replacement Strategies\n",
    "\n",
    "### Strategy 1: Fake Data\n",
    "\n",
    "Replace PII with realistic fake values:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Fake data maintains data type and format\n",
    "builder_fake = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"gliner\",\n",
    "        entity_types=[\"person\", \"email\", \"phone_number\"],\n",
    "        replacement_strategy=\"fake\",\n",
    "        fake_locale=\"en_US\"  # Specify locale for realistic fakes\n",
    "    )\n",
    ")\n",
    "\n",
    "job_fake = builder_fake.create_job(name=f\"pii-fake-job-{int(time.time())}\", project=\"test-project\")\n",
    "job_fake.wait_for_completion()\n",
    "df_fake = job_fake.fetch_data()\n",
    "\n",
    "print(\"\ud83d\udccb Fake Strategy - Original vs Replaced:\")\n",
    "print(f\"Original email: {df['email'].iloc[0]}\")\n",
    "print(f\"Fake email: {df_fake['email'].iloc[0]}\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Strategy 2: Redaction\n",
    "\n",
    "Replace PII with placeholder tokens:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Redaction replaces with [REDACTED] or custom token\n",
    "builder_redact = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"gliner\",\n",
    "        entity_types=[\"person\", \"email\", \"phone_number\"],\n",
    "        replacement_strategy=\"redact\",\n",
    "        redaction_token=\"[REMOVED]\"  # Custom redaction token\n",
    "    )\n",
    ")\n",
    "\n",
    "job_redact = builder_redact.create_job(name=f\"pii-redact-job-{int(time.time())}\", project=\"test-project\")\n",
    "job_redact.wait_for_completion()\n",
    "df_redact = job_redact.fetch_data()\n",
    "\n",
    "print(\"\\n\ud83d\udccb Redaction Strategy - Original vs Replaced:\")\n",
    "print(f\"Original name: {df['name'].iloc[0]}\")\n",
    "print(f\"Redacted name: {df_redact['name'].iloc[0]}\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Strategy 3: Hashing\n",
    "\n",
    "Replace with consistent hash values:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Hashing maintains consistency (same input = same output)\n",
    "builder_hash = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"gliner\",\n",
    "        entity_types=[\"person\", \"email\"],\n",
    "        replacement_strategy=\"hash\",\n",
    "        hash_salt=\"my-secret-salt\"  # Consistent salt for reproducibility\n",
    "    )\n",
    ")\n",
    "\n",
    "job_hash = builder_hash.create_job(name=f\"pii-hash-job-{int(time.time())}\", project=\"test-project\")\n",
    "job_hash.wait_for_completion()\n",
    "df_hash = job_hash.fetch_data()\n",
    "\n",
    "print(\"\\n\ud83d\udccb Hashing Strategy - Original vs Replaced:\")\n",
    "print(f\"Original name: {df['name'].iloc[0]}\")\n",
    "print(f\"Hashed name: {df_hash['name'].iloc[0]}\")\n",
    "print(\"\\nNote: Same name will always hash to the same value\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Advanced Configuration\n",
    "\n",
    "### Custom Entity Types\n",
    "\n",
    "Define custom PII entities specific to your domain:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Example: Healthcare-specific entities\n",
    "custom_entities = [\n",
    "    \"patient_id\",\n",
    "    \"medical_record_number\",\n",
    "    \"prescription_number\",\n",
    "    \"insurance_id\",\n",
    "    \"diagnosis_code\"\n",
    "]\n",
    "\n",
    "builder_custom = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_classify_model_provider(provider_name)\n",
    "    .with_replace_pii(\n",
    "        detection_method=\"llm\",\n",
    "        entity_types=custom_entities,\n",
    "        replacement_strategy=\"fake\",\n",
    "        custom_entity_definitions={\n",
    "            \"patient_id\": \"A unique identifier for a patient in the format P-XXXXX\",\n",
    "            \"medical_record_number\": \"Medical record number in format MRN-XXXXXX\"\n",
    "        }\n",
    "    )\n",
    ")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Column-Specific Configuration\n",
    "\n",
    "Apply different strategies to different columns:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Use API directly for fine-grained control\n",
    "job_spec = {\n",
    "    \"data_source\": \"fileset://default/safe-synthesizer-inputs/data.csv\",\n",
    "    \"config\": {\n",
    "        \"enable_replace_pii\": True,\n",
    "        \"replace_pii\": {\n",
    "            \"globals\": {\"locales\": [\"en_US\"]},\n",
    "            \"steps\": [\n",
    "                {\n",
    "                    \"rows\": {\n",
    "                        \"update\": [\n",
    "                            {\n",
    "                                \"column\": \"email\",\n",
    "                                \"entity\": [\"email\"],\n",
    "                                \"value\": \"column.entity | fake\"\n",
    "                            },\n",
    "                            {\n",
    "                                \"column\": \"name\",\n",
    "                                \"entity\": [\"person\"],\n",
    "                                \"value\": \"[REDACTED]\"\n",
    "                            },\n",
    "                            {\n",
    "                                \"column\": \"ssn\",\n",
    "                                \"entity\": [\"ssn\"],\n",
    "                                \"value\": \"column.entity | hash\"\n",
    "                            }\n",
    "                        ]\n",
    "                    }\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "job_custom = client.safe_synthesizer.jobs.create(\n",
    "    workspace=\"default\",\n",
    "    name=f\"column-specific-pii-{int(time.time())}\",\n",
    "    project=\"test-project\",\n",
    "    spec=job_spec\n",
    ")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Compare Detection Methods\n",
    "\n",
    "Analyze the differences between detection methods:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Compare detection rates\n",
    "methods = ['GLiNER', 'LLM', 'Regex']\n",
    "\n",
    "# Count detected entities (example - adjust based on actual results)\n",
    "def count_changes(original_df, processed_df):\n",
    "    \"\"\"Count how many values changed\"\"\"\n",
    "    changes = 0\n",
    "    for col in original_df.columns:\n",
    "        if col in processed_df.columns:\n",
    "            changes += (original_df[col] != processed_df[col]).sum()\n",
    "    return changes\n",
    "\n",
    "detections = [\n",
    "    count_changes(df, redacted_df_gliner),\n",
    "    count_changes(df, redacted_df_llm),\n",
    "    count_changes(df, redacted_df_regex)\n",
    "]\n",
    "\n",
    "plt.figure(figsize=(10, 5))\n",
    "plt.bar(methods, detections, color=['blue', 'green', 'red'], alpha=0.7)\n",
    "plt.xlabel('Detection Method')\n",
    "plt.ylabel('Number of Detections')\n",
    "plt.title('PII Detection Comparison')\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\n\ud83d\udcca Detection Method Comparison:\")\n",
    "for method, count in zip(methods, detections):\n",
    "    print(f\"  {method}: {count} changes detected\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Validation and Quality Checks\n",
    "\n",
    "Verify PII has been properly removed:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import re\n",
    "# Check for remaining PII patterns\n",
    "def check_for_pii(df, patterns):\n",
    "    \"\"\"Check if any PII patterns remain in the data\"\"\"\n",
    "    found_pii = []\n",
    "    \n",
    "    for col in df.columns:\n",
    "        for idx, value in enumerate(df[col]):\n",
    "            if pd.notna(value):\n",
    "                value_str = str(value)\n",
    "                for pattern_name, pattern in patterns.items():\n",
    "                    if re.search(pattern, value_str):\n",
    "                        found_pii.append({\n",
    "                            'column': col,\n",
    "                            'row': idx,\n",
    "                            'pattern': pattern_name,\n",
    "                            'value': value_str\n",
    "                        })\n",
    "    \n",
    "    return found_pii\n",
    "\n",
    "# Define PII patterns to check\n",
    "pii_patterns = {\n",
    "    'email': r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b',\n",
    "    'phone': r'\\b\\d{3}-\\d{3}-\\d{4}\\b',\n",
    "    'ssn': r'\\b\\d{3}-\\d{2}-\\d{4}\\b'\n",
    "}\n",
    "\n",
    "# Check original data\n",
    "original_pii = check_for_pii(df, pii_patterns)\n",
    "print(f\"\\n\ud83d\udd0d PII in original data: {len(original_pii)} instances\")\n",
    "\n",
    "# Check redacted data\n",
    "redacted_pii = check_for_pii(df_fake, pii_patterns)\n",
    "print(f\"\ud83d\udd0d PII in redacted data: {len(redacted_pii)} instances\")\n",
    "\n",
    "if redacted_pii:\n",
    "    print(\"\\n\u26a0\ufe0f Warning: Some PII may remain. Review configuration.\")\n",
    "    for item in redacted_pii[:5]:  # Show first 5\n",
    "        print(f\"  - {item}\")\n",
    "else:\n",
    "    print(\"\\n\u2705 No PII patterns detected in redacted data\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Best Practices\n",
    "\n",
    "### 1. Layer Detection Methods\n",
    "\n",
    "Combine multiple methods for comprehensive coverage:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Multi-layered approach\n",
    "builder_layered = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df)\n",
    "    .with_replace_pii(\n",
    "        detection_methods=[\n",
    "            {\"method\": \"regex\", \"priority\": 1},  # First pass: structured patterns\n",
    "            {\"method\": \"gliner\", \"priority\": 2},  # Second pass: general PII\n",
    "            {\"method\": \"llm\", \"priority\": 3}     # Third pass: contextual PII\n",
    "        ]\n",
    "    )\n",
    ")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Test with Sample Data\n",
    "\n",
    "Always test on a small sample first:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Test on small sample\n",
    "df_sample = df.head(10)\n",
    "\n",
    "builder_test = (\n",
    "    SafeSynthesizerJobBuilder(client)\n",
    "    .with_data_source(df_sample)\n",
    "    .with_classify_model_provider(provider_name)\n",
    "    .with_replace_pii()\n",
    ")\n",
    "\n",
    "job_test = builder_test.create_job(name=f\"pii-sample-test-{int(time.time())}\", project=\"test-project\")\n",
    "job_test.wait_for_completion()\n",
    "\n",
    "# Review results before processing full dataset\n",
    "df_test_result = job_test.fetch_data()\n",
    "print(\"Review sample results before proceeding with full dataset\")"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Document Replacement Strategy\n",
    "\n",
    "Keep records of what was replaced:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Create audit log\n",
    "audit_log = {\n",
    "    'job_name': job_gliner.job_name,\n",
    "    'date': pd.Timestamp.now(),\n",
    "    'method': 'gliner',\n",
    "    'strategy': 'fake',\n",
    "    'entities': ['person', 'email', 'phone_number'],\n",
    "    'records_processed': len(df),\n",
    "    'records_modified': count_changes(df, redacted_df_gliner)\n",
    "}\n",
    "\n",
    "print(\"\\n\ud83d\udcdd Audit Log:\")\n",
    "print(audit_log)"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "- {doc}`../about/pii-replacement` - PII replacement concepts\n",
    "- [Differential Privacy Tutorial](tutorial-differential-privacy) - Add DP guarantees\n",
    "- {doc}`../about/index` - Core concepts and components\n",
    "\n",
    "---\n",
    "\n",
    "## Troubleshooting\n",
    "\n",
    "### PII Not Detected\n",
    "\n",
    "- Try different detection methods\n",
    "- Add custom entity definitions\n",
    "- Use LLM classification for context-aware detection\n",
    "- Review entity type names (case-sensitive)\n",
    "\n",
    "### Incorrect Replacements\n",
    "\n",
    "- Adjust confidence thresholds\n",
    "- Use more specific entity types\n",
    "- Combine with regex for known patterns\n",
    "- Test on sample data first\n",
    "\n",
    "### Performance Issues\n",
    "\n",
    "- Use GLiNER for standard PII (fastest)\n",
    "- Reserve LLM for complex cases only\n",
    "- Process large datasets in batches\n",
    "- Use regex for structured patterns\n",
    "\n",
    "\n",
    "For more help, see {doc}`../about/pii-replacement`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}