About Models and Inference#

The NeMo platform provides APIs for deploying models, registering external providers, and routing inference requests through a unified gateway.

        flowchart TB
    subgraph Self-Hosted
        MDC[ModelDeploymentConfig] --> MD[ModelDeployment]
        MD --> MP1[ModelProvider]
    end

    subgraph External
        API[External API] --> MP2[ModelProvider]
    end

    MP1 --> IG[Inference Gateway]
    MP2 --> IG
    IG --> Client
    

Model Registry#

Models service manages model entities, deployment configurations, and deployments.

Core Objects#

ModelDeploymentConfig — A versioned blueprint for deploying a NIM container. Specifies GPU count, container image, model name, and optional settings like LoRA support or custom environment variables. Configs are reusable. You can create multiple deployments from the same config, and updating a config creates a new version without affecting existing deployments.

ModelDeployment — A running instance of a NIM container based on a ModelDeploymentConfig. Deployments progress through lifecycle states (CREATEDPENDINGREADY or FAILED) as the container is pulled, started, and initialized. When a deployment reaches the READY state, a ModelProvider is automatically created.

Model — A registered model within the platform, referencing a specific model like nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16. The model may be hosted locally via a NIM, with weights stored in FileService or HuggingFace, or it may be an external model made available by a hosted provider (NVIDIA Build, OpenAI, etc.). All Models are served via a ModelProvider.

ModelProvider — A routable inference host. The provider may be manually registered for external APIs (NVIDIA Build, OpenAI, etc.) or auto-created by Models Service for ModelDeployments. All inference requests route through a ModelProvider which serves one or more Models.

Deployment Lifecycle#

        stateDiagram-v2
    [*] --> CREATED
    CREATED --> PENDING
    PENDING --> READY
    PENDING --> FAILED
    READY --> DELETING
    FAILED --> DELETING
    DELETING --> DELETED
    DELETED --> [*]
    

When a deployment reaches READY:

  1. ModelProvider is auto-created pointing to the NIM’s internal service URL

  2. Models are discovered from the NIM’s /v1/models endpoint

  3. Model entities are created for each discovered model


Model Providers#

Auto-created providers: Created automatically when a ModelDeployment becomes ready. Named the same as the deployment.

        flowchart TB
    subgraph Auto-Created
        MD[ModelDeployment] --> MP1[ModelProvider]
        MP1 --> NIM[Self-hosted NIM]
    end
    

Manual providers: Created by users for external inference endpoints. Requires storing the API key in the Secrets service first.

        flowchart TB
    subgraph User-Created
        User --> MP2[ModelProvider]
        MP2 --> External[External API]
    end
    

Inference Gateway#

Inference Gateway is a Layer 7 reverse proxy providing unified access to all inference endpoints. It supports three routing patterns:

Routing Patterns#

Pattern

Endpoint

Use Case

Model Entity

.../gateway/model/{name}/-/*

Route by model name

Provider

.../gateway/provider/{name}/-/*

Route to specific provider (A/B testing)

OpenAI

.../gateway/openai/-/*

OpenAI SDK compatibility (model in body)

All patterns use /-/ as a separator. Everything after /-/ is forwarded to the backend unchanged.

Path Examples#

# Model entity routing
/v2/workspaces/default/inference/gateway/model/llama-3-2-1b/-/v1/chat/completions

# Provider routing
/v2/workspaces/default/inference/gateway/provider/my-deployment/-/v1/chat/completions

# OpenAI routing (model specified in request body as "workspace/model-entity")
/v2/workspaces/default/inference/gateway/openai/-/v1/chat/completions

SDK Helper Methods#

The SDK provides convenience methods for OpenAI compatibility:

import os

from nemo_microservices import NeMoMicroservices

sdk = NeMoMicroservices(base_url=os.environ["NMP_BASE_URL"])

# Get pre-configured OpenAI client
openai_client = sdk.models.get_openai_client()

# Get base URLs for different routing patterns
sdk.models.get_openai_route_base_url()
sdk.models.get_model_entity_route_openai_url(entity) # ty: ignore[unresolved-reference]
sdk.models.get_provider_route_openai_url(provider) # ty: ignore[unresolved-reference]
sdk.models.get_provider_route_openai_url_for_deployment(deployment) # ty: ignore[unresolved-reference]


API Reference#

For complete API details, see the API Reference and SDK Reference.