About Models and Inference#
The NeMo platform provides APIs for deploying models, registering external providers, and routing inference requests through a unified gateway.
flowchart TB
subgraph Self-Hosted
MDC[ModelDeploymentConfig] --> MD[ModelDeployment]
MD --> MP1[ModelProvider]
end
subgraph External
API[External API] --> MP2[ModelProvider]
end
MP1 --> IG[Inference Gateway]
MP2 --> IG
IG --> Client
Model Registry#
Models service manages model entities, deployment configurations, and deployments.
Core Objects#
ModelDeploymentConfig — A versioned blueprint for deploying a NIM container. Specifies GPU count, container image, model name, and optional settings like LoRA support or custom environment variables. Configs are reusable. You can create multiple deployments from the same config, and updating a config creates a new version without affecting existing deployments.
ModelDeployment — A running instance of a NIM container based on a ModelDeploymentConfig. Deployments progress through lifecycle states (CREATED → PENDING → READY or FAILED) as the container is pulled, started, and initialized. When a deployment reaches the READY state, a ModelProvider is automatically created.
Model — A registered model within the platform, referencing a specific model like nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. The model may be hosted locally via a NIM, with weights stored in FileService or HuggingFace, or it may be an external model made available by a hosted provider (NVIDIA Build, OpenAI, etc.). All Models are served via a ModelProvider.
ModelProvider — A routable inference host. The provider may be manually registered for external APIs (NVIDIA Build, OpenAI, etc.) or auto-created by Models Service for ModelDeployments. All inference requests route through a ModelProvider which serves one or more Models.
Deployment Lifecycle#
stateDiagram-v2
[*] --> CREATED
CREATED --> PENDING
PENDING --> READY
PENDING --> FAILED
READY --> DELETING
FAILED --> DELETING
DELETING --> DELETED
DELETED --> [*]
When a deployment reaches READY:
ModelProvider is auto-created pointing to the NIM’s internal service URL
Models are discovered from the NIM’s
/v1/modelsendpointModel entities are created for each discovered model
Model Providers#
Auto-created providers: Created automatically when a ModelDeployment becomes ready. Named the same as the deployment.
flowchart TB
subgraph Auto-Created
MD[ModelDeployment] --> MP1[ModelProvider]
MP1 --> NIM[Self-hosted NIM]
end
Manual providers: Created by users for external inference endpoints. Requires storing the API key in the Secrets service first.
flowchart TB
subgraph User-Created
User --> MP2[ModelProvider]
MP2 --> External[External API]
end
Inference Gateway#
Inference Gateway is a Layer 7 reverse proxy providing unified access to all inference endpoints. It supports three routing patterns:
Routing Patterns#
Pattern |
Endpoint |
Use Case |
|---|---|---|
Model Entity |
|
Route by model name |
Provider |
|
Route to specific provider (A/B testing) |
OpenAI |
|
OpenAI SDK compatibility (model in body) |
All patterns use /-/ as a separator. Everything after /-/ is forwarded to the backend unchanged.
Path Examples#
# Model entity routing
/v2/workspaces/default/inference/gateway/model/llama-3-2-1b/-/v1/chat/completions
# Provider routing
/v2/workspaces/default/inference/gateway/provider/my-deployment/-/v1/chat/completions
# OpenAI routing (model specified in request body as "workspace/model-entity")
/v2/workspaces/default/inference/gateway/openai/-/v1/chat/completions
SDK Helper Methods#
The SDK provides convenience methods for OpenAI compatibility:
import os
from nemo_microservices import NeMoMicroservices
sdk = NeMoMicroservices(base_url=os.environ["NMP_BASE_URL"])
# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()
# Get base URLs for different routing patterns
sdk.models.get_openai_route_base_url()
sdk.models.get_model_entity_route_openai_url(entity) # ty: ignore[unresolved-reference]
sdk.models.get_provider_route_openai_url(provider) # ty: ignore[unresolved-reference]
sdk.models.get_provider_route_openai_url_for_deployment(deployment) # ty: ignore[unresolved-reference]
API Reference#
For complete API details, see the API Reference and SDK Reference.