Download this tutorial as a Jupyter notebook

Run Inference#

Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.

Note

This tutorial assumes you have a model deployed or external provider registered. See Deploy Models to set up inference endpoints.

# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_microservices import NeMoMicroservices

sdk = NeMoMicroservices(
    base_url=os.environ["NMP_BASE_URL"],
    workspace="default"
)

Route by Model Entity#

Route requests using the model entity name. The gateway selects an available provider automatically.

Find available model entities:

nmp models list
for model in sdk.models.list():
    print(model.name)

Run inference by passing the Model Entity.

# Model entities are auto-discovered from deployments
# Use the model entity name from 'nmp models list'
nmp chat meta-llama-3-2-1b-instruct "Hello!" --max-tokens 100
# Model entities are auto-discovered from deployments
response = sdk.inference.gateway.model.post(
    "v1/chat/completions",
    name="meta-llama-3-2-1b-instruct",  # Model entity name
    body={
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Route by Provider#

Route to a specific provider instance. Use for A/B testing or targeting specific deployments.

Find available providers:

nmp inference providers list

# List models available on a provider if its API is OpenAI compliant
nmp inference gateway provider get v1/models --name llama-3-2-1b-deployment
for provider in sdk.inference.providers.list():
    print(f"{provider.name}: {provider.status}")

# List models available on a provider if its API is OpenAI compliant
models = sdk.inference.gateway.provider.get(
    "v1/models",
    name="llama-3-2-1b-deployment"
)
print(models)

Pass inference request using provider routing.

# Provider name matches deployment name for auto-created providers
nmp chat meta/llama-3.2-1b-instruct "Hello!" \
    --provider llama-3-2-1b-deployment \
    --max-tokens 100
# Provider name matches deployment name for auto-created providers
response = sdk.inference.gateway.provider.post(
    "v1/chat/completions",
    name="llama-3-2-1b-deployment",  # Provider name
    body={
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Route using OpenAI SDK#

Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses format {workspace}/{model_entity}.

List Available Models#

List available model entities that are reachable via Inference Gateway.

models = sdk.inference.gateway.openai.v1.models.list()
for model in models.data:
    print(model.id)

Using SDK#

You can make requests to the OpenAI compatible Inference Gateway route with the NeMo Microservices SDK.

response = sdk.inference.gateway.openai.post(
    "v1/chat/completions",
    body={
        "model": "default/meta-llama-3-2-1b-instruct",  # workspace/model-entity
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }
)

Using OpenAI Python SDK#

Models service provides a convenient helper method sdk.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. See SDK Helper Methods for more info.

# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()

response = openai_client.chat.completions.create(
    model="default/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

Streaming#

The OpenAI compatible Inference Gateway endpoint also supports streaming.

openai_client = sdk.models.get_openai_client()

stream = openai_client.chat.completions.create(
    model="default/meta-llama-3-2-1b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about coding."}],
    max_tokens=100,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")