Running Inference with Guardrails#

NeMo Guardrails exposes inference endpoints that applications use to run inference with guardrails. When the client application invokes these endpoints, NeMo Guardrails applies the specified safety checks to the user input and model response. You can apply any guardrail configuration within the workspace to any model within the same workspace.

Prerequisites#

Tip

If you don’t have access to GPUs, you can use NIMs hosted on build.nvidia.com. See Using an External Endpoint for setup instructions.

For the examples below, create a simple guardrail configuration that will apply the self-check input and output rail.

from nemo_microservices import NeMoMicroservices

sdk = NeMoMicroservices(base_url="http://localhost:8080", workspace="default")

# Model Entity to use for inference (workspace/model_name format)
BASE_MODEL = "default/nvidia-nemotron-3-nano-30b-a3b"

config_data = {
    "models": [
        {
            "type": "main",
            "engine": "nim",
        }
    ],
    "prompts": [
        {
            "task": "self_check_input",
            "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
        },
        {
            "task": "self_check_output",
            "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
        },
    ],
    "rails": {
        "input": {"flows": ["self check input"]},
        "output": {"flows": ["self check output"]},
    },
}

config = sdk.guardrail.configs.create(
    name="self-check-config",
    description="Demo self-check configuration for inference examples",
    data=config_data,
)

Guardrails Inference Endpoints#

NeMo Guardrails exposes OpenAI-compatible endpoints with an additional guardrails field in the request body:

/v2/workspaces/{workspace}/guardrail/chat/completions
/v2/workspaces/{workspace}/guardrail/completions

The guardrails field is an object that supports the following fields:

config_id: The name of the guardrail configuration in the workspace to apply.
config: Alternative to config_id; specifies an in-line guardrail configuration to apply.
options: Additional guardrail options that control which rails are applied and what information is returned. See Guardrails Options for configurable fields.
return_choice: If true, the guardrail data is returned as a choice in the choices array with the guardrails_data role. Defaults to false. This field is helpful when you use third-party clients to make requests to the NeMo Guardrails microservice. These clients typically don’t forward back additional response fields that are not part of the OpenAI response format.

By default, if your model references a Model Entity (i.e. workspace/model_name), inference requests are automatically routed through Inference Gateway.

Optionally, if you require a direct connection to a specific endpoint for a model, you can explicitly set parameters.base_url. As a fallback, NeMo Guardrails uses the NIM_ENDPOINT_URL environment variable.

Chat Completions#

The /v2/workspaces/{workspace}/guardrail/chat/completions is an OpenAI-compatible endpoint for chat completions.

Python SDK

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "What is the company policy on vacation time?"}
    ],
    guardrails={
        "config_id": "self-check-config",  # Reference config by name
    },
    max_tokens=200,
    stream=False,
)

print(f"Response: {response.choices[0].message.content}")

Completions#

The /v2/workspaces/{workspace}/guardrail/completions endpoint is an OpenAI-compatible endpoint for completions.

Note

The https://integrate.api.nvidia.com/v1 endpoint provided by the NVIDIA API Catalog does not support the completions endpoint.

Python SDK

response = sdk.guardrail.completions.create(
    model=BASE_MODEL,
    prompt="Briefly list 3 fun summer activities",
    guardrails={
        "config_id": "self-check-config",
    },
    max_tokens=100,
    stream=False,
)

print(f"Response: {response.choices[0].text}")

Inline Configuration#

Instead of referencing a stored config, you can pass the configuration inline:

inline_config = {
    "models": [
        {
            "type": "main",
            "engine": "nim",
        }
    ],
    "prompts": [
        {
            "task": "self_check_input",
            "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
        },
        {
            "task": "self_check_output",
            "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
        },
    ],
    "rails": {
        "input": {"flows": ["self check input"]},
        "output": {"flows": ["self check output"]},
    },
}

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Hello, how are you?"},
    ],
    guardrails={
        "config": inline_config,
    },
)

print(f"Response with inline config: {response.choices[0].message.content}")

Streaming Output#

Streaming reduces time-to-first-token (TTFT) by returning chunks as they’re generated. Guardrails returns safe chunks until the model finishes responding or responds with an unsafe chunk.

Configuration#

Enable streaming in your config’s output rails. The streaming property supports the following fields:

Field	Type	Purpose	Default value
`enabled`	`boolean`	Enable LLM output streaming	`False`
`chunk_size`	`int`	Number of tokens per chunk that output rails process	`200`
`context_size`	`int`	Tokens carried over between chunks for continuity	`50`
`stream_first`	`boolean`	If `True`, tokens stream immediately before output rails are applied	`True`

rails = {
    "output": {
        "flows": ["self check output"],
        "streaming": {
            "enabled": True,
            "chunk_size": 200,
            "context_size": 50,
            "stream_first": True
        }
    }
}

Performance#

The primary purpose of streaming output is to reduce the time-to-first-token (TTFT) of the LLM response. The following table shows some timing results for 20 requests using the GPT 4o model from OpenAI with and without streaming output and a very basic timing script.

Configuration	Mean TTFT	Median TTFT	Stdev	Min TTFT	Max TTFT
Streaming enabled	0.5475	0.5208	0.1248	0.4241	0.9287
Streaming disabled	3.6834	3.6127	1.6949	0.4487	7.4227

The streaming enabled configuration is faster by 85.14%, on average, and has more consistent performance, as shown by the lower standard deviation, 0.1248 versus 1.6949.

The streaming configuration used the default configuration values: chunk_size: 200, context_size: 50, and stream_first: True.

Streaming Chat Completions#

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    guardrails={
        "config_id": "self-check-config",
    },
    max_tokens=200,
    stream=True,  # Enable streaming
)

print("Streaming response:")
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Blocked Content Detection#

The service applies guardrail checks on chunks of tokens as they are streamed from the model. When content is blocked during streaming, you’ll receive an error response:

{
  "error": {
    "message": "Blocked by self check output rails.",
    "type": "guardrails_violation",
    "param": "self check output",
    "code": "content_blocked"
  }
}

Guardrails Options#

The guardrails.options field controls which rails are applied and what information is returned. The following fields can be configured:

{
  "options": {
    "rails": {
      "input": true,
      "output": true,
      "dialog": true,
      "retrieval": true,
      "tool_input": true,
      "tool_output": true
    },
    "log": {
      "activated_rails": true,
      "llm_calls": false
    },
    "llm_params": {
      "temperature": 0.7
    },
    "llm_output": false,
    "output_vars": ["relevant_chunks"],
  }
}

Top-level options#

The guardrails object supports the following fields:

Field	Type	Description	Default value
`activated_rails`	`boolean`	Whether to include information about which rails were activated.	`false`
`llm_calls`	`boolean`	Whether to include details about LLM calls (prompts, completions, token usage).	`false`
`internal_events`	`boolean`	Whether to include the array of internal generated events.	`false`
`colang_history`	`boolean`	Whether to include the conversation history in Colang format.	`false`

Rails options#

The guardrails.rails object supports the following fields:

Field	Type	Description	Default value
`input`	`boolean` or `array`	Set to `false` to disable input rails. To just enable specific rails, provide a list of rail names.	`true`
`output`	`boolean` or `array`	Set to `false` to disable output rails. To just enable specific rails, provide a list of rail names.	`true`
`dialog`	`boolean`	Set to `false` to disable dialog rails.	`true`
`retrieval`	`boolean` or `array`	Set to `false` to disable retrieval rails. To just enable specific rails, provide a list of rail names.	`true`
`tool_input`	`boolean`	Set to `false` to disable tool input rails.	`true`
`tool_output`	`boolean`	Set to `false` to disable tool output rails.	`true`

Log options#

The guardrails.log object supports the following fields:

Field	Type	Description	Default value
`llm_params`	`object`	Additional parameters to pass to the main model (ex. `temperature` or `max_tokens`).	`None`
`llm_output`	`boolean`	Whether to include custom model output in the response.	`false`
`output_vars`	`boolean` or `array`	Context variables to return. Set to `true` for all, or provide a list of variable names.	`None`

Return as Choice#

For clients that don’t handle extra response fields, you can configure the request to return the guardrail data as a choice in the choices list, with the role field set toguardrails_data. The default value is false.

This field is helpful when you use third-party clients to make requests to the NeMo Guardrails microservice. These clients typically don’t forward back additional response fields that are not part of the OpenAI response format.

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[{"role": "user", "content": "Hello!"}],
    guardrails={
        "config_id": "self-check-config",
        "return_choice": True,
    },
)

Custom HTTP Headers#

You can optionally provide custom headers to propagate in requests to a model. A custom header must start with the x- or X- prefix.

There are two ways to specify custom headers:

At configuration time: headers are specified in the guardrails configuration. These headers are used by default in every inference or check request that use the configuration.
At request time: headers are specified when making an inference request.

Note

If you define a custom header with the same name (case insensitive) in both the request and the guardrail configuration, the request header value take precedence over the default value set in the guardrail configuration.

Specify Custom Headers in Guardrail Configuration#

In the guardrail configuration, use the parameters.default_headers field to populate default headers to use in requests to the given model. The key is the header name, and the value is the default value.

updated_config = sdk.guardrail.configs.update(
    name="self-check-config",
    data={
        "models": [
            {
                "type": "main",
                "engine": "nim",
                "parameters": {
                    # Optional default headers. These can be overridden at request-time
                    "default_headers": {
                        "X-Custom-Header": "custom-header-value",
                    },
                },
            }
        ],
        "prompts": [
            {
                "task": "self_check_input",
                "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
            },
            {
                "task": "self_check_output",
                "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
            },
        ],
        "rails": {
            "input": {"flows": ["self check input"]},
            "output": {"flows": ["self check output"]},
        },
    },
)

print(updated_config.model_dump_json(indent=2))

Specify Custom Headers at Request Time#

Chat Completions#

Add custom headers when making a chat completion request.

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Briefly list 3 fun summer activities"}
    ],
    guardrails={
        "config_id": "self-check-config",
    },
    extra_headers={
        "X-Custom-Header": "updated-custom-header-value",
    },
    max_tokens=100,
    stream=False,
)

print(response.model_dump_json(indent=2))

Completions#

Add custom headers when making a completion request.

response = sdk.guardrail.completions.create(
    model=BASE_MODEL,
    prompt="Briefly list 3 fun summer activities",
    guardrails={
        "config_id": "self-check-config",
    },
    extra_headers={
        "X-Custom-Header": "updated-custom-header-value",
    },
    max_tokens=100,
    stream=False,
)

print(response.model_dump_json(indent=2))