Evaluation#
Evaluation is a critical component of NeMo Safe Synthesizer that helps you understand both the utility and privacy of your synthetic data. The evaluation step is enabled by default and provides comprehensive reports comparing your original and synthetic datasets across multiple dimensions.
How It Works#
The evaluation system compares your original and synthetic datasets using two main frameworks:
Synthetic Quality Score (SQS): Measures how well the synthetic data preserves statistical properties and utility
Data Privacy Score (DPS): Assesses privacy protection and resistance to various attack vectors
Each framework consists of multiple metrics that are combined into an overall score.
Synthetic Quality Score (SQS)#
The SQS measures data utility across several dimensions:
Column Correlation Stability#
Analyzes the correlation across every combination of two columns:
Compares correlation matrices between original and synthetic data
Ensures relationships between variables are preserved
Critical for maintaining predictive power in ML models
Deep Structure Stability#
Uses Principal Component Analysis to reduce dimensionality when comparing datasets:
Captures overall data structure and patterns
Evaluates high-dimensional relationships
Assesses whether data maintains its fundamental characteristics
Column Distribution Stability#
Compares the distribution for each column in the original data to the matching column in the synthetic data:
Statistical tests for numeric columns (KS test, Wasserstein distance)
Frequency comparison for categorical columns
Identifies distribution drift
Text Structure Similarity#
For text columns, calculates sentence, word, and character counts:
Compares structural properties of text
Ensures text length and complexity are preserved
Validates text generation quality
Text Semantic Similarity#
Understands whether the semantic meaning of the text is maintained after synthesizing:
Uses embedding-based similarity measures
Captures contextual and semantic properties
Ensures text maintains intended meaning
Data Privacy Score (DPS)#
The DPS assesses privacy protection through attack simulations:
Membership Inference Protection#
Tests whether attackers can determine if specific records were in the training data:
Simulates membership inference attacks
Measures how distinguishable training records are
Higher scores indicate better privacy protection
Attribute Inference Protection#
Assesses whether sensitive attributes can be inferred when other attributes are known:
Tests ability to predict hidden values
Measures information leakage
Validates that synthesis doesn’t create inference vulnerabilities
PII Replay#
Evaluates the frequency with which sensitive values from the original data appear in the synthetic version:
Checks for exact matches of PII values
Identifies potential memorization
Critical for compliance and privacy guarantees
Privacy Guarantees and Evaluation#
The Data Privacy Score (DPS) measures empirical privacy through attack simulations and real-world privacy tests. When you enable differential privacy during synthesis, you gain both:
Mathematical privacy guarantees (epsilon/delta bounds from DP) - Formal proof that the algorithm limits information leakage
Empirical privacy measurement (DPS from evaluation) - Practical testing of privacy protection against attacks
These two approaches are complementary:
Differential privacy provides worst-case theoretical guarantees regardless of data or model
DPS evaluation measures actual privacy in practice for your specific dataset and configuration
Impact of Differential Privacy on Scores:
Enabling DP typically improves DPS by reducing memorization and attack success rates
Lower epsilon (stronger privacy) generally yields higher DPS scores
May reduce SQS due to privacy-utility tradeoff (noise affects quality)
Interpreting Combined Metrics:
High DPS + High SQS = Excellent privacy and utility balance
High DPS + Lower SQS = Strong privacy with acceptable quality loss
Lower DPS + High SQS = Good utility but consider enabling DP for stronger privacy
For more on differential privacy configuration and privacy-utility tradeoffs, see Data Synthesis and Differential Privacy Deep Dive.
Evaluation Reports#
Every NeMo Safe Synthesizer job automatically generates an HTML evaluation report containing:
Overall SQS and DPS scores
Detailed subscores for each metric
Visualizations comparing original and synthetic data
Statistical test results
Recommendations for improvement
The report provides both high-level summaries for stakeholders and detailed technical metrics for data scientists.
Configuration#
Evaluation is enabled by default but can be customized:
{
"evaluation": {
"mia_enabled": true,
"aia_enabled": true
}
}
Interpreting Scores#
SQS Interpretation#
90-100: Excellent - synthetic data closely matches original utility
70-89: Good - suitable for most use cases with minor differences
50-69: Fair - noticeable differences, may impact some analyses
Below 50: Poor - significant utility loss, review configuration
DPS Interpretation#
90-100: Excellent - strong privacy protection
70-89: Good - adequate privacy for most use cases
50-69: Fair - some privacy risks, consider differential privacy
Below 50: Poor - insufficient privacy protection