Evaluation#

Evaluation is a critical component of NeMo Safe Synthesizer that helps you understand both the utility and privacy of your synthetic data. The evaluation step is enabled by default and provides comprehensive reports comparing your original and synthetic datasets across multiple dimensions.

How It Works#

The evaluation system compares your original and synthetic datasets using two main frameworks:

  1. Synthetic Quality Score (SQS): Measures how well the synthetic data preserves statistical properties and utility

  2. Data Privacy Score (DPS): Assesses privacy protection and resistance to various attack vectors

Each framework consists of multiple metrics that are combined into an overall score.

Synthetic Quality Score (SQS)#

The SQS measures data utility across several dimensions:

Column Correlation Stability#

Analyzes the correlation across every combination of two columns:

  • Compares correlation matrices between original and synthetic data

  • Ensures relationships between variables are preserved

  • Critical for maintaining predictive power in ML models

Deep Structure Stability#

Uses Principal Component Analysis to reduce dimensionality when comparing datasets:

  • Captures overall data structure and patterns

  • Evaluates high-dimensional relationships

  • Assesses whether data maintains its fundamental characteristics

Column Distribution Stability#

Compares the distribution for each column in the original data to the matching column in the synthetic data:

  • Statistical tests for numeric columns (KS test, Wasserstein distance)

  • Frequency comparison for categorical columns

  • Identifies distribution drift

Text Structure Similarity#

For text columns, calculates sentence, word, and character counts:

  • Compares structural properties of text

  • Ensures text length and complexity are preserved

  • Validates text generation quality

Text Semantic Similarity#

Understands whether the semantic meaning of the text is maintained after synthesizing:

  • Uses embedding-based similarity measures

  • Captures contextual and semantic properties

  • Ensures text maintains intended meaning

Data Privacy Score (DPS)#

The DPS assesses privacy protection through attack simulations:

Membership Inference Protection#

Tests whether attackers can determine if specific records were in the training data:

  • Simulates membership inference attacks

  • Measures how distinguishable training records are

  • Higher scores indicate better privacy protection

Attribute Inference Protection#

Assesses whether sensitive attributes can be inferred when other attributes are known:

  • Tests ability to predict hidden values

  • Measures information leakage

  • Validates that synthesis doesn’t create inference vulnerabilities

PII Replay#

Evaluates the frequency with which sensitive values from the original data appear in the synthetic version:

  • Checks for exact matches of PII values

  • Identifies potential memorization

  • Critical for compliance and privacy guarantees

Privacy Guarantees and Evaluation#

The Data Privacy Score (DPS) measures empirical privacy through attack simulations and real-world privacy tests. When you enable differential privacy during synthesis, you gain both:

  1. Mathematical privacy guarantees (epsilon/delta bounds from DP) - Formal proof that the algorithm limits information leakage

  2. Empirical privacy measurement (DPS from evaluation) - Practical testing of privacy protection against attacks

These two approaches are complementary:

  • Differential privacy provides worst-case theoretical guarantees regardless of data or model

  • DPS evaluation measures actual privacy in practice for your specific dataset and configuration

Impact of Differential Privacy on Scores:

  • Enabling DP typically improves DPS by reducing memorization and attack success rates

  • Lower epsilon (stronger privacy) generally yields higher DPS scores

  • May reduce SQS due to privacy-utility tradeoff (noise affects quality)

Interpreting Combined Metrics:

  • High DPS + High SQS = Excellent privacy and utility balance

  • High DPS + Lower SQS = Strong privacy with acceptable quality loss

  • Lower DPS + High SQS = Good utility but consider enabling DP for stronger privacy

For more on differential privacy configuration and privacy-utility tradeoffs, see Data Synthesis and Differential Privacy Deep Dive.

Evaluation Reports#

Every NeMo Safe Synthesizer job automatically generates an HTML evaluation report containing:

  • Overall SQS and DPS scores

  • Detailed subscores for each metric

  • Visualizations comparing original and synthetic data

  • Statistical test results

  • Recommendations for improvement

The report provides both high-level summaries for stakeholders and detailed technical metrics for data scientists.

Configuration#

Evaluation is enabled by default but can be customized:

{
    "evaluation": {
        "mia_enabled": true,
        "aia_enabled": true
    }
}

Interpreting Scores#

SQS Interpretation#

  • 90-100: Excellent - synthetic data closely matches original utility

  • 70-89: Good - suitable for most use cases with minor differences

  • 50-69: Fair - noticeable differences, may impact some analyses

  • Below 50: Poor - significant utility loss, review configuration

DPS Interpretation#

  • 90-100: Excellent - strong privacy protection

  • 70-89: Good - adequate privacy for most use cases

  • 50-69: Fair - some privacy risks, consider differential privacy

  • Below 50: Poor - insufficient privacy protection