Running Inference with Guardrails#

NeMo Guardrails exposes inference endpoints that applications use to run inference with guardrails. When the client application invokes these endpoints, NeMo Guardrails applies the specified safety checks to the user input and model response. You can apply any guardrail configuration within the workspace to any model within the same workspace.

Prerequisites#

Tip

If you don’t have access to GPUs, you can use NIMs hosted on build.nvidia.com. See Using an External Endpoint for setup instructions.

For the examples below, create a simple guardrail configuration that will apply the self-check input and output rail.

from nemo_microservices import NeMoMicroservices

sdk = NeMoMicroservices(base_url="http://localhost:8080", workspace="default")

# Model Entity to use for inference (workspace/model_name format)
BASE_MODEL = "default/nvidia-nemotron-3-nano-30b-a3b"

config_data = {
    "models": [
        {
            "type": "main",
            "engine": "nim",
        }
    ],
    "prompts": [
        {
            "task": "self_check_input",
            "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
        },
        {
            "task": "self_check_output",
            "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
        },
    ],
    "rails": {
        "input": {"flows": ["self check input"]},
        "output": {"flows": ["self check output"]},
    },
}

config = sdk.guardrail.configs.create(
    name="self-check-config",
    description="Demo self-check configuration for inference examples",
    data=config_data,
)

Guardrails Inference Endpoints#

NeMo Guardrails exposes OpenAI-compatible endpoints with an additional guardrails field in the request body:

  • /v2/workspaces/{workspace}/guardrail/chat/completions

  • /v2/workspaces/{workspace}/guardrail/completions

The guardrails field is an object that supports the following fields:

  • config_id: The name of the guardrail configuration in the workspace to apply.

  • config: Alternative to config_id; specifies an in-line guardrail configuration to apply.

  • options: Additional guardrail options that control which rails are applied and what information is returned. See Guardrails Options for configurable fields.

  • return_choice: If true, the guardrail data is returned as a choice in the choices array with the guardrails_data role. Defaults to false. This field is helpful when you use third-party clients to make requests to the NeMo Guardrails microservice. These clients typically don’t forward back additional response fields that are not part of the OpenAI response format.

By default, if your model references a Model Entity (i.e. workspace/model_name), inference requests are automatically routed through Inference Gateway.

Optionally, if you require a direct connection to a specific endpoint for a model, you can explicitly set parameters.base_url. As a fallback, NeMo Guardrails uses the NIM_ENDPOINT_URL environment variable.

Chat Completions#

The /v2/workspaces/{workspace}/guardrail/chat/completions is an OpenAI-compatible endpoint for chat completions.

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "What is the company policy on vacation time?"}
    ],
    guardrails={
        "config_id": "self-check-config",  # Reference config by name
    },
    max_tokens=200,
    stream=False,
)

print(f"Response: {response.choices[0].message.content}")

Completions#

The /v2/workspaces/{workspace}/guardrail/completions endpoint is an OpenAI-compatible endpoint for completions.

Note

The https://integrate.api.nvidia.com/v1 endpoint provided by the NVIDIA API Catalog does not support the completions endpoint.

response = sdk.guardrail.completions.create(
    model=BASE_MODEL,
    prompt="Briefly list 3 fun summer activities",
    guardrails={
        "config_id": "self-check-config",
    },
    max_tokens=100,
    stream=False,
)

print(f"Response: {response.choices[0].text}")

Inline Configuration#

Instead of referencing a stored config, you can pass the configuration inline:

inline_config = {
    "models": [
        {
            "type": "main",
            "engine": "nim",
        }
    ],
    "prompts": [
        {
            "task": "self_check_input",
            "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
        },
        {
            "task": "self_check_output",
            "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
        },
    ],
    "rails": {
        "input": {"flows": ["self check input"]},
        "output": {"flows": ["self check output"]},
    },
}

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Hello, how are you?"},
    ],
    guardrails={
        "config": inline_config,
    },
)

print(f"Response with inline config: {response.choices[0].message.content}")

Streaming Output#

Streaming reduces time-to-first-token (TTFT) by returning chunks as they’re generated. Guardrails returns safe chunks until the model finishes responding or responds with an unsafe chunk.

Configuration#

Enable streaming in your config’s output rails. The streaming property supports the following fields:

Field

Type

Purpose

Default value

enabled

boolean

Enable LLM output streaming

False

chunk_size

int

Number of tokens per chunk that output rails process

200

context_size

int

Tokens carried over between chunks for continuity

50

stream_first

boolean

If True, tokens stream immediately before output rails are applied

True

rails = {
    "output": {
        "flows": ["self check output"],
        "streaming": {
            "enabled": True,
            "chunk_size": 200,
            "context_size": 50,
            "stream_first": True
        }
    }
}

Performance#

The primary purpose of streaming output is to reduce the time-to-first-token (TTFT) of the LLM response. The following table shows some timing results for 20 requests using the GPT 4o model from OpenAI with and without streaming output and a very basic timing script.

Configuration

Mean TTFT

Median TTFT

Stdev

Min TTFT

Max TTFT

Streaming enabled

0.5475

0.5208

0.1248

0.4241

0.9287

Streaming disabled

3.6834

3.6127

1.6949

0.4487

7.4227

The streaming enabled configuration is faster by 85.14%, on average, and has more consistent performance, as shown by the lower standard deviation, 0.1248 versus 1.6949.

The streaming configuration used the default configuration values: chunk_size: 200, context_size: 50, and stream_first: True.

Streaming Chat Completions#

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    guardrails={
        "config_id": "self-check-config",
    },
    max_tokens=200,
    stream=True,  # Enable streaming
)

print("Streaming response:")
for chunk in response:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Blocked Content Detection#

The service applies guardrail checks on chunks of tokens as they are streamed from the model. When content is blocked during streaming, you’ll receive an error response:

{
  "error": {
    "message": "Blocked by self check output rails.",
    "type": "guardrails_violation",
    "param": "self check output",
    "code": "content_blocked"
  }
}

Guardrails Options#

The guardrails.options field controls which rails are applied and what information is returned. The following fields can be configured:

{
  "options": {
    "rails": {
      "input": true,
      "output": true,
      "dialog": true,
      "retrieval": true,
      "tool_input": true,
      "tool_output": true
    },
    "log": {
      "activated_rails": true,
      "llm_calls": false
    },
    "llm_params": {
      "temperature": 0.7
    },
    "llm_output": false,
    "output_vars": ["relevant_chunks"],
  }
}

Top-level options#

The guardrails object supports the following fields:

Field

Type

Description

Default value

activated_rails

boolean

Whether to include information about which rails were activated.

false

llm_calls

boolean

Whether to include details about LLM calls (prompts, completions, token usage).

false

internal_events

boolean

Whether to include the array of internal generated events.

false

colang_history

boolean

Whether to include the conversation history in Colang format.

false

Rails options#

The guardrails.rails object supports the following fields:

Field

Type

Description

Default value

input

boolean or array

Set to false to disable input rails. To just enable specific rails, provide a list of rail names.

true

output

boolean or array

Set to false to disable output rails. To just enable specific rails, provide a list of rail names.

true

dialog

boolean

Set to false to disable dialog rails.

true

retrieval

boolean or array

Set to false to disable retrieval rails. To just enable specific rails, provide a list of rail names.

true

tool_input

boolean

Set to false to disable tool input rails.

true

tool_output

boolean

Set to false to disable tool output rails.

true

Log options#

The guardrails.log object supports the following fields:

Field

Type

Description

Default value

llm_params

object

Additional parameters to pass to the main model (ex. temperature or max_tokens).

None

llm_output

boolean

Whether to include custom model output in the response.

false

output_vars

boolean or array

Context variables to return. Set to true for all, or provide a list of variable names.

None

Return as Choice#

For clients that don’t handle extra response fields, you can configure the request to return the guardrail data as a choice in the choices list, with the role field set toguardrails_data. The default value is false.

This field is helpful when you use third-party clients to make requests to the NeMo Guardrails microservice. These clients typically don’t forward back additional response fields that are not part of the OpenAI response format.

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[{"role": "user", "content": "Hello!"}],
    guardrails={
        "config_id": "self-check-config",
        "return_choice": True,
    },
)

Custom HTTP Headers#

You can optionally provide custom headers to propagate in requests to a model. A custom header must start with the x- or X- prefix.

There are two ways to specify custom headers:

  • At configuration time: headers are specified in the guardrails configuration. These headers are used by default in every inference or check request that use the configuration.

  • At request time: headers are specified when making an inference request.

Note

If you define a custom header with the same name (case insensitive) in both the request and the guardrail configuration, the request header value take precedence over the default value set in the guardrail configuration.

Specify Custom Headers in Guardrail Configuration#

In the guardrail configuration, use the parameters.default_headers field to populate default headers to use in requests to the given model. The key is the header name, and the value is the default value.

updated_config = sdk.guardrail.configs.update(
    name="self-check-config",
    data={
        "models": [
            {
                "type": "main",
                "engine": "nim",
                "parameters": {
                    # Optional default headers. These can be overridden at request-time
                    "default_headers": {
                        "X-Custom-Header": "custom-header-value",
                    },
                },
            }
        ],
        "prompts": [
            {
                "task": "self_check_input",
                "content": 'Your task is to check if the user message below complies with company policy.\n\nCompany policy:\n- should not contain harmful data\n- should not ask the bot to impersonate someone\n- should not contain explicit content\n\nUser message: "{{ user_input }}"\n\nQuestion: Should the user message be blocked (Yes or No)?\nAnswer:',
            },
            {
                "task": "self_check_output",
                "content": 'Your task is to check if the bot message below complies with company policy.\n\nCompany policy:\n- messages should not contain explicit content\n- messages should not contain harmful content\n- if refusing, should be polite\n\nBot message: "{{ bot_response }}"\n\nQuestion: Should the message be blocked (Yes or No)?\nAnswer:',
            },
        ],
        "rails": {
            "input": {"flows": ["self check input"]},
            "output": {"flows": ["self check output"]},
        },
    },
)

print(updated_config.model_dump_json(indent=2))

Specify Custom Headers at Request Time#

Chat Completions#

Add custom headers when making a chat completion request.

response = sdk.guardrail.chat.completions.create(
    model=BASE_MODEL,
    messages=[
        {"role": "user", "content": "Briefly list 3 fun summer activities"}
    ],
    guardrails={
        "config_id": "self-check-config",
    },
    extra_headers={
        "X-Custom-Header": "updated-custom-header-value",
    },
    max_tokens=100,
    stream=False,
)

print(response.model_dump_json(indent=2))
Example Output
{
  "id": "chatcmpl-6e6ee35f-87be-4372-8f3d-f4f0c61f51db",
  "object": "chat.completion",
  "model": "default/nvidia-nemotron-3-nano-30b-a3b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here are three summer-time ideas you can try right away:\n 1. Beach volleyball\n 2. DIY frozen-fruit popsicles\n 3. Paddleboarding"
      },
      "finish_reason": "stop"
    }
  ]
}

Completions#

Add custom headers when making a completion request.

response = sdk.guardrail.completions.create(
    model=BASE_MODEL,
    prompt="Briefly list 3 fun summer activities",
    guardrails={
        "config_id": "self-check-config",
    },
    extra_headers={
        "X-Custom-Header": "updated-custom-header-value",
    },
    max_tokens=100,
    stream=False,
)

print(response.model_dump_json(indent=2))