Download this tutorial as a Jupyter notebook
Run Inference#
Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.
Note
This tutorial assumes you have a model deployed or external provider registered. See Deploy Models to set up inference endpoints.
# Configure CLI (if not already done)
nmp config set --base-url "$NMP_BASE_URL" --workspace default
import os
from nemo_microservices import NeMoMicroservices
sdk = NeMoMicroservices(
base_url=os.environ["NMP_BASE_URL"],
workspace="default"
)
Route by Model Entity#
Route requests using the model entity name. The gateway selects an available provider automatically.
Find available model entities:
nmp models list
for model in sdk.models.list():
print(model.name)
Run inference by passing the Model Entity.
# Model entities are auto-discovered from deployments
# Use the model entity name from 'nmp models list'
nmp chat meta-llama-3-2-1b-instruct "Hello!" --max-tokens 100
# Model entities are auto-discovered from deployments
response = sdk.inference.gateway.model.post(
"v1/chat/completions",
name="meta-llama-3-2-1b-instruct", # Model entity name
body={
"model": "meta/llama-3.2-1b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Route by Provider#
Route to a specific provider instance. Use for A/B testing or targeting specific deployments.
Find available providers:
nmp inference providers list
# List models available on a provider if its API is OpenAI compliant
nmp inference gateway provider get v1/models --name llama-3-2-1b-deployment
for provider in sdk.inference.providers.list():
print(f"{provider.name}: {provider.status}")
# List models available on a provider if its API is OpenAI compliant
models = sdk.inference.gateway.provider.get(
"v1/models",
name="llama-3-2-1b-deployment"
)
print(models)
Pass inference request using provider routing.
# Provider name matches deployment name for auto-created providers
nmp chat meta/llama-3.2-1b-instruct "Hello!" \
--provider llama-3-2-1b-deployment \
--max-tokens 100
# Provider name matches deployment name for auto-created providers
response = sdk.inference.gateway.provider.post(
"v1/chat/completions",
name="llama-3-2-1b-deployment", # Provider name
body={
"model": "meta/llama-3.2-1b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Route using OpenAI SDK#
Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses format {workspace}/{model_entity}.
List Available Models#
List available model entities that are reachable via Inference Gateway.
models = sdk.inference.gateway.openai.v1.models.list()
for model in models.data:
print(model.id)
Using SDK#
You can make requests to the OpenAI compatible Inference Gateway route with the NeMo Microservices SDK.
response = sdk.inference.gateway.openai.post(
"v1/chat/completions",
body={
"model": "default/meta-llama-3-2-1b-instruct", # workspace/model-entity
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
)
Using OpenAI Python SDK#
Models service provides a convenient helper method sdk.models.get_openai_client() that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. See SDK Helper Methods for more info.
# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()
response = openai_client.chat.completions.create(
model="default/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
Streaming#
The OpenAI compatible Inference Gateway endpoint also supports streaming.
openai_client = sdk.models.get_openai_client()
stream = openai_client.chat.completions.create(
model="default/meta-llama-3-2-1b-instruct",
messages=[{"role": "user", "content": "Write a haiku about coding."}],
max_tokens=100,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")