{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "(tutorial-safe-synthesizer-101)=\n", "\n", "# Safe Synthesizer 101\n", "\n", "Learn the fundamentals of {{nss_short_name}} by creating your first Safe Synthesizer job using provided defaults. In this tutorial, you'll upload sample customer data, replace personally identifiable information, fine-tune a model, generate synthetic records, and review the evaluation report.\n", "\n", "## Prerequisites\n", "\n", "Before you begin, make sure that you have:\n", "\n", "- Access to a deployment of {{nss_short_name}} (see {doc}`../getting-started`)\n", "- Python environment with `nemo-microservices` SDK installed\n", "- Basic understanding of Python and pandas\n", "\n", "---\n", "\n", "## What You'll Learn\n", "\n", "By the end of this tutorial, you'll understand how to:\n", "\n", "- Upload datasets for processing\n", "- Run Safe Synthesizer jobs using the Python SDK\n", "- Track job progress and retrieve results\n", "- Interpret evaluation reports\n", "\n", "---\n", "\n", "## Step 1: Install the SDK\n", "\n", "Install the NeMo Microservices SDK with Safe Synthesizer support:" ] }, { "cell_type": "code", "metadata": { "language": "shell", "vscode": { "languageId": "shellscript" } }, "source": [ "if command -v uv &> /dev/null; then\n", " uv pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib\n", "else\n", " pip install nemo-microservices[safe-synthesizer] kagglehub matplotlib\n", "fi" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 2: Configure the Client\n", "\n", "Set up the client to connect to your Safe Synthesizer deployment:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import os\n", "from nemo_microservices import NeMoMicroservices\n", "\n", "# Configure the client\n", "client = NeMoMicroservices(\n", " base_url=os.environ.get(\"NMP_BASE_URL\", \"http://localhost:8080\")\n", ")\n", "# set to none by default, update it if you need an hf_token\n", "hf_secret_name = None\n", "\n", "print(\"\u2705 Client configured successfully\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 3: Verify Service Connection\n", "\n", "Test the connection to ensure Safe Synthesizer is accessible:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "try:\n", " jobs = client.safe_synthesizer.jobs.list(workspace=\"default\")\n", " print(\"\u2705 Successfully connected to Safe Synthesizer service\")\n", " print(f\"Found {len(jobs.data)} existing jobs\")\n", "except Exception as e:\n", " print(f\"\u274c Cannot connect to service: {e}\")\n", " print(\"Please verify base_url and service status\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 4: Load Sample Dataset\n", "\n", "For this tutorial, we'll use a women's clothing reviews dataset from Kaggle that contains some PII:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import pandas as pd\n", "import kagglehub # type: ignore[import-not-found]\n", "\n", "# Download the dataset\n", "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n", "df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n", "\n", "print(f\"\u2705 Loaded dataset with {len(df)} records\")\n", "print(\"\\nDataset preview:\")\n", "print(df.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Dataset details:**\n", "\n", "- Contains customer reviews of women's clothing\n", "- Includes age, product category, rating, and review text\n", "- Some reviews contain PII like height, weight, age, and location\n", "\n", "---\n", "\n", "## Step 5: Configure Column Classification\n", "\n", "Before running jobs, set up column classification for accurate PII detection\n", "\n", ":::{tip}\n", "Column classification uses an LLM to automatically detect column types and improve PII detection accuracy. Without this setup, you may see classification errors and reduced detection quality.\n", "\n", "Column classification sends example data to the LLM for classification.\n", "Use an internally deployed LLM if you do not want to send your data to build.nvidia.com.\n", ":::" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import os\n", "import time\n", "\n", "# Get your API key from https://build.nvidia.com/\n", "# Set as environment variable: export NIM_API_KEY=nvapi-...\n", "api_key = os.environ.get(\"NIM_API_KEY\")\n", "\n", "if not api_key:\n", " raise ValueError(\n", " \"NIM_API_KEY is required. Get your free API key from https://build.nvidia.com/\"\n", " )\n", "\n", "# Create the API key as a secret\n", "timestamp = int(time.time())\n", "api_key_secret_name = f\"nim-api-key-tutorial-{timestamp}\"\n", "client.secrets.create(workspace=\"default\", name=api_key_secret_name, data=api_key)\n", "print(f\"\u2705 Created API key secret: {api_key_secret_name}\")\n", "\n", "# Create the model provider for column classification\n", "provider_name = f\"classify-llm-tutorial-{timestamp}\"\n", "client.inference.providers.create(\n", " workspace=\"default\",\n", " name=provider_name,\n", " host_url=\"https://integrate.api.nvidia.com/v1\",\n", " api_key_secret_name=api_key_secret_name,\n", " description=\"Model provider for Safe Synthesizer column classification\",\n", ")\n", "print(f\"\u2713 Created model provider: {provider_name}\")\n", "print(\"\u2705 Column classification configured\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{tip}\n", "**Secret naming best practice**: Use lowercase letters, numbers, and hyphens in secret names for Kubernetes compatibility (e.g., `hf-token` not `hf_token` or `HF_TOKEN`).\n", ":::\n", "\n", "For more details on managing secrets, see {doc}`../../set-up/manage-secrets`.\n", "\n", "---\n", "\n", "## Step 6: HuggingFace Token Usage (Optional)\n", "\n", "If you're using private HuggingFace models or want to avoid rate limits, create a secret for your [HuggingFace token](https://huggingface.co/settings/tokens):" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import os\n", "import time\n", "\n", "# Create a unique secret name (use hyphens, not underscores)\n", "hf_secret_name = f\"hf-token-{int(time.time())}\"\n", "hf_token = os.environ.get(\"HF_TOKEN\")\n", "\n", "if hf_token:\n", " # Store your HuggingFace token as a platform secret\n", " client.secrets.create(\n", " workspace=\"default\",\n", " name=hf_secret_name,\n", " data=hf_token\n", " )\n", " print(f\"\u2713 Created secret: {hf_secret_name}\")\n", "" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Create and Run a Safe Synthesizer Job\n", "\n", "Use the `SafeSynthesizerJobBuilder` to configure and create a job:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import pandas as pd\n", "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n", "\n", "# Create a project for our jobs (creates if it doesn't exist)\n", "project_name = \"test-project\"\n", "try:\n", " client.projects.create(workspace=\"default\", name=project_name)\n", "except Exception:\n", " pass # Project may already exist\n", "\n", "# Build the job configuration\n", "job_name = f\"synthesis-test-{pd.Timestamp.now().strftime('%Y%m%d-%H%M%S')}\"\n", "builder = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_classify_model_provider(provider_name) # Enable column classification\n", " .with_replace_pii() # Enable PII replacement\n", " .synthesize() # Enable synthesis\n", ")\n", "\n", "if hf_secret_name:\n", " # add the token secret if an HF token was specified\n", " builder = builder.with_hf_token_secret(hf_secret_name)\n", "\n", "# Create and start the job\n", "job = builder.create_job(name=job_name, project=project_name)\n", "print(f\"\u2705 Job created: {job.job_name}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**What happens next:**\n", "\n", "1. Dataset is uploaded to the fileset storage\n", "2. PII detection and replacement\n", "3. Model fine-tuning on your data\n", "4. Synthetic data generation\n", "5. Quality and privacy evaluation\n", "\n", "---\n", "\n", "## Step 8: Monitor Job Progress\n", "\n", "Check the job status:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "status = job.fetch_status()\n", "print(f\"Current status: {status}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Job States:**\n", "\n", "- `created`: Job has been created\n", "- `pending`: Waiting for GPU resources\n", "- `active`: Processing your data\n", "- `completed`: Finished successfully\n", "- `error`: Encountered an error\n", "\n", "View real-time logs:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "job.print_logs()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait for completion (this may take 15-30 minutes depending on data size):" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(\"\u23f3 Waiting for job to complete...\")\n", "job.wait_for_completion()\n", "print(\"\u2705 Job completed!\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 9: Retrieve Synthetic Data\n", "\n", "Once the job is complete, retrieve the generated synthetic data:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "synthetic_df = job.fetch_data()\n", "\n", "print(f\"\u2705 Generated {len(synthetic_df)} synthetic records\")\n", "print(\"\\nSynthetic data preview:\")\n", "print(synthetic_df.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare with original data structure:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(\"\\n\ud83d\udcca Data Comparison:\")\n", "print(f\"Original shape: {df.shape}\")\n", "print(f\"Synthetic shape: {synthetic_df.shape}\")\n", "print(f\"\\nOriginal columns: {list(df.columns)}\")\n", "print(f\"Synthetic columns: {list(synthetic_df.columns)}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 10: Review Evaluation Report\n", "\n", "Fetch the job summary with high-level metrics:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "summary = job.fetch_summary()\n", "\n", "print(\"\ud83d\udcc8 Evaluation Summary:\")\n", "print(f\" Synthetic Quality Score: {summary.synthetic_data_quality_score}\")\n", "print(f\" Data Privacy Score: {summary.data_privacy_score}\")\n", "print(f\" Valid Records: {summary.num_valid_records}/{summary.num_prompts}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download the full HTML evaluation report:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "job.save_report(\"./evaluation_report.html\")\n", "print(\"\u2705 Evaluation report saved to evaluation_report.html\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "If using Jupyter, display the report inline:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "job.display_report_in_notebook()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The evaluation report includes:**\n", "\n", "- **Synthetic Quality Score (SQS)**: Measures data utility\n", " - Column correlation stability\n", " - Distribution similarity\n", " - Text semantic similarity\n", "- **Data Privacy Score (DPS)**: Measures privacy protection\n", " - Membership inference protection\n", " - Attribute inference protection\n", " - PII replay detection\n", "\n", "---\n", "\n", "## Understanding the Results\n", "\n", "### Interpreting Scores\n", "\n", "**Synthetic Quality Score (SQS):**\n", "\n", "- 90-100: Excellent - synthetic data closely matches original utility\n", "- 70-89: Good - suitable for most use cases\n", "- 50-69: Fair - noticeable differences\n", "- Below 50: Poor - consider adjusting configuration\n", "\n", "**Data Privacy Score (DPS):**\n", "\n", "- 90-100: Excellent - strong privacy protection\n", "- 70-89: Good - adequate for most use cases\n", "- 50-69: Fair - consider enabling differential privacy\n", "- Below 50: Poor - insufficient privacy protection\n", "\n", "### Example Analysis" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Compare distributions\n", "import matplotlib.pyplot as plt\n", "\n", "# Example: Compare age distribution\n", "if 'Age' in df.columns and 'Age' in synthetic_df.columns:\n", " fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n", " \n", " ax1.hist(df['Age'].dropna(), bins=20, alpha=0.7, edgecolor='black')\n", " ax1.set_title('Original Age Distribution')\n", " ax1.set_xlabel('Age')\n", " ax1.set_ylabel('Frequency')\n", " \n", " ax2.hist(synthetic_df['Age'].dropna(), bins=20, alpha=0.7, \n", " edgecolor='black', color='green')\n", " ax2.set_title('Synthetic Age Distribution')\n", " ax2.set_xlabel('Age')\n", " ax2.set_ylabel('Frequency')\n", " \n", " plt.tight_layout()\n", " plt.show()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Next Steps\n", "\n", "Now that you've completed your first Safe Synthesizer job, explore more advanced features:\n", "\n", "### Advanced Tutorials\n", "\n", "- [Differential Privacy Deep Dive](tutorial-differential-privacy) - Apply mathematical privacy guarantees\n", "- [PII Replacement Deep Dive](tutorial-pii-replacement) - Advanced PII detection and replacement\n", "\n", "### Documentation\n", "\n", "- {doc}`../about/index` - Understand core concepts\n", "\n", "### Try These Next\n", "\n", "1. **Customize PII replacement**: Configure specific entity types and replacement strategies\n", "2. **Enable differential privacy**: Add formal privacy guarantees with epsilon and delta parameters\n", "3. **Tune generation parameters**: Adjust temperature and sampling for better synthetic data\n", "4. **Use your own data**: Replace the sample dataset with your sensitive data\n", "\n", "---\n", "\n", "## Cleanup\n", "\n", "List and optionally delete completed jobs:" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# List all jobs\n", "all_jobs = client.safe_synthesizer.jobs.list(workspace=\"default\")\n", "print(f\"Total jobs: {len(all_jobs.data)}\")\n", "\n", "# Delete this job (optional)\n", "# client.safe_synthesizer.jobs.delete(job.job_name, workspace=\"default\")\n", "# print(f\"\u2705 Job {job.job_name} deleted\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Troubleshooting\n", "\n", "### Common Issues\n", "\n", "**Connection errors:**\n", "\n", "- Verify `NMP_BASE_URL` is correct\n", "- Check that Safe Synthesizer service is running\n", "- Ensure network connectivity\n", "\n", "**Job failures:**\n", "\n", "- Check logs with `job.print_logs()`\n", "- Verify dataset format (CSV with proper columns)\n", "- Ensure sufficient GPU memory for model size\n", "\n", "**Slow performance:**\n", "\n", "- Reduce dataset size for testing\n", "- Use smaller model (adjust `training.pretrained_model`)\n", "- Check GPU availability\n", "\n", "For more help, see {doc}`../about/jobs`.\n", "\n", "**Error: \"Dataset must have at least 200 records to use holdout.\"**\n", "\n", "This occurs when synthesis is enabled on datasets with fewer than 200 records. Holdout validation\n", "splits your data into training and test sets to measure quality, requiring a minimum dataset size.\n", "\n", "**Solution:**" ] }, { "cell_type": "code", "metadata": {}, "source": [ "builder = (\n", " SafeSynthesizerJobBuilder(client)\n", " .with_data_source(df)\n", " .with_data(holdout=0) # Disable holdout for small datasets\n", " .with_replace_pii()\n", " .synthesize()\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{warning}\n", "Disabling holdout means you won't get quality metrics like privacy scores and synthetic data quality\n", "scores. For production use, ensure your dataset has at least 200 records.\n", ":::" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }