Parameters Reference#
This page provides a complete reference for all configuration parameters available when creating NeMo Safe Synthesizer jobs. These schemas are automatically extracted from the authoritative OpenAPI specification, ensuring they are always in sync with the API.
Top-Level Configuration#
The SafeSynthesizerParametersInput schema defines the main configuration structure for Safe Synthesizer jobs.
| Parameter | Type | Description |
|---|---|---|
data | object | Data parameters. |
evaluation | object | Evaluation parameters. |
enable_synthesis | boolean | Default: trueEnable synthesizing new data by training a model. |
enable_replace_pii | boolean | Default: trueEnable replacing PII in the data. |
training | object | Training parameters. |
generation | object | Generation parameters. |
privacy | object | Privacy parameters. Optional. |
time_series | object | Time series parameters. |
replace_pii | object | PII replacement parameters. Optional. |
Data Parameters#
Configuration for how to shape or use the input data, including grouping, ordering, and holdout settings.
| Parameter | Type | Description |
|---|---|---|
group_training_examples_by | string | Column to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records. |
order_training_examples_by | string | Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide `group_training_examples_by`. |
max_sequences_per_example | string | integer | Default: autoIf specified, adds at most this number of sequences per example; otherwise, fills up context. Supports 'auto' where a value of 1 is chosen if differential privacy is enabled, and None otherwise. Required for DP to limit contribution of each example. |
holdout | number | Default: 0.05Amount of records to holdout. If this is a float between 0 and 1, that ratio of records is held out. If an integer greater than 1, that number of records is held out.If the value is equal to zero, no holdout will be performed. |
max_holdout | integer | Default: 2000Maximum number of records to hold out. Overrides any behavior set by holdout parameter. |
random_state | integer | Use this random state for holdout split to ensure reproducibility. |
Training Parameters#
Hyperparameters for model fine-tuning, including learning rate, batch size, and LoRA configuration.
| Parameter | Type | Description |
|---|---|---|
num_input_records_to_sample | string | integer | Default: autoNumber of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data. |
batch_size | integer | Default: 1The batch size per device for training |
gradient_accumulation_steps | integer | Default: 8Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. |
weight_decay | number | Default: 0.01The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the AdamW optimizer. |
warmup_ratio | number | Default: 0.05Ratio of total training steps used for a linear warmup from 0 to the learning rate. |
lr_scheduler | string | Default: cosineThe scheduler type to use. See the HuggingFace documentation of `SchedulerType` for all possible values. |
learning_rate | number | Default: 0.0005The initial learning rate for `AdamW` optimizer. |
lora_r | integer | Default: 32The rank of the LoRA update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters. |
lora_alpha_over_r | number | Default: 1.0The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. |
lora_target_modules | string[] | Default: ['q_proj', 'k_proj', 'v_proj', 'o_proj']The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj' |
use_unsloth | string | boolean | Default: autoWhether to use unsloth. |
rope_scaling_factor | string | integer | Default: autoScale the base LLM's context length by this factor using RoPE scaling. |
validation_ratio | number | Default: 0.0The fraction of the training data that will be used for validation.The range should be 0 to 1. If set to 0, no validation will be performed.If set larger than 0, validation loss will be computed and reported throughout training. |
validation_steps | integer | Default: 15The number of steps between validation checks for the HF Trainer arguments. |
pretrained_model | string | Default: TinyLlama/TinyLlama-1.1B-Chat-v1.0Pretrained model to use for fine tuning. Uses default of TinyLlama. |
quantize_model | boolean | Default: falseWhether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy. |
quantization_bits | integer | Default: 8The number of bits to use for quantization if `quantize_model` is True. Common values are 8 or 4 bits. Allowed: 4, 8 |
peft_implementation | string | Default: QLORAThe PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options include 'lora' for Low-Rank Adaptation or QLoRA for Quantized LoRA. Each method has its own trade-offs in terms of performance and resource requirements. |
max_vram_fraction | number | Default: 0.8The fraction of the total VRAM to use for training. Default is 0.9. Modify this to allow longer sequences to be used. |
Generation Parameters#
Configuration for synthetic data generation after training, including number of records, temperature, and structured generation options.
| Parameter | Type | Description |
|---|---|---|
num_records | integer | Default: 1000Number of records to generate. |
temperature | number | Default: 0.9Sampling temperature. |
repetition_penalty | number | Default: 1.0The value used to control the likelihood of the model repeating the same token. |
top_p | number | Default: 1.0Nucleus sampling probability. |
patience | integer | Default: 3Number of consecutive generations where the `invalid_fraction_threshold` is reached before stopping generation. |
invalid_fraction_threshold | number | Default: 0.8The fraction of invalid records that will stop generation after the `patience` limit is reached. |
use_structured_generation | boolean | Default: falseUse structured generation. |
structured_generation_backend | string | Default: autoThe backend used by VLLM when use_structured_generation=True. Supported backends (from vllm) are 'outlines', 'guidance', 'xgrammar', 'lm-format-enforcer'. 'auto' will allow vllm to choose the backend. Allowed: auto, xgrammar, guidance, outlines, lm-format-enforcer |
structured_generation_schema_method | string | Default: regexThe method used to generate the schema from your dataset and pass it to the generation backend. auto will usually default to 'json_schema'. Use 'regex to use our custom regex construction method, which tends to be more comprehensive than 'json_schema' at the cost of speed. Allowed: regex, json_schema |
enforce_timeseries_fidelity | boolean | Default: falseEnforce timeseries fidelity by enforcing the time series order, intervals, start and end times of the records. |
Differential Privacy Parameters#
Hyperparameters for differential privacy during training using DP-SGD. Enable these for formal privacy guarantees.
| Parameter | Type | Description |
|---|---|---|
dp_enabled | boolean | Default: falseEnable differentially-private training with DP-SGD. |
epsilon | number | Default: 8.0Target for epsilon when training completes. |
delta | string | number | Default: autoProbability of accidentally leaking information. Setting to 'auto' usesdelta of 1/n^1.2, where n is the number of training records |
per_sample_max_grad_norm | number | Default: 1.0Maximum L2 norm of per sample gradients. |
Evaluation Parameters#
Configuration for synthetic data quality and privacy assessment, including MIA, AIA, and PII replay detection.
| Parameter | Type | Description |
|---|---|---|
mia_enabled | boolean | Default: trueEnable membership inference attack evaluation. |
aia_enabled | boolean | Default: trueEnable attribute inference attack evaluation. |
sqs_report_columns | integer | Default: 250 |
sqs_report_rows | integer | Default: 5000 |
mandatory_columns | integer | - |
enabled | boolean | Default: trueEnable evaluation. |
quasi_identifier_count | integer | Default: 3Number of quasi-identifiers to sample. |
pii_replay_enabled | boolean | Default: trueEnable PII Replay detection. |
pii_replay_entities | string[] | List of entities for PII Replay. If not provided, default entities will be used. |
pii_replay_columns | string[] | List of columns for PII Replay. If not provided, only entities will be used. |
PII Replacement Configuration#
Configuration for PII detection and replacement. See PII Replacement for conceptual documentation.
| Parameter | Type | Description |
|---|---|---|
globals | object | Global config options. |
steps * | object[] | list of transform steps to perform on input. |
Example Configuration#
Here’s an example showing a complete job configuration using the Python SDK:
import os
import pandas as pd
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder
# Placeholders
df: pd.DataFrame = pd.DataFrame()
client = NeMoMicroservices(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080")
)
builder = (
SafeSynthesizerJobBuilder(client)
.with_data_source(df)
.with_train(
num_input_records_to_sample=10000,
learning_rate=0.0005,
batch_size=1,
)
.with_generate(
num_records=5000,
temperature=0.9,
)
.with_differential_privacy(
dp_enabled=True,
epsilon=8.0,
)
.with_replace_pii()
.synthesize()
)
job = builder.create_job(name="my-job", project="my-project")