{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "(tutorial-run-inference)=\n", "\n", "# Run Inference\n", "\n", "Route inference requests through the gateway using model entity routing, provider routing, or OpenAI-compatible routing.\n", "\n", ":::{note}\n", "This tutorial assumes you have a model deployed or external provider registered. See [Deploy Models](deploy-models.md) to set up inference endpoints.\n", ":::\n", "\n", "```{include} ../../_snippets/tutorials/cli-sdk-setup.md\n", "```\n", "\n", "---\n", "\n", "## Route by Model Entity\n", "\n", "Route requests using the model entity name. The gateway selects an available provider automatically.\n", "\n", "Find available model entities:\n", "\n", "::::{tab-set}\n", "\n", ":::{tab-item} CLI\n", ":sync: cli\n", "" ] }, { "cell_type": "code", "metadata": { "language": "bash", "vscode": { "languageId": "shellscript" } }, "source": [ "nmp models list" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", ":::{tab-item} Python SDK\n", ":sync: python-sdk\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "for model in sdk.models.list():\n", " print(model.name)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "::::\n", "\n", "Run inference by passing the Model Entity.\n", "\n", "::::{tab-set}\n", "\n", ":::{tab-item} CLI\n", ":sync: cli\n", "" ] }, { "cell_type": "code", "metadata": { "language": "bash", "vscode": { "languageId": "shellscript" } }, "source": [ "# Model entities are auto-discovered from deployments\n", "# Use the model entity name from 'nmp models list'\n", "nmp chat meta-llama-3-2-1b-instruct \"Hello!\" --max-tokens 100" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", ":::{tab-item} Python SDK\n", ":sync: python-sdk\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Model entities are auto-discovered from deployments\n", "response = sdk.inference.gateway.model.post(\n", " \"v1/chat/completions\",\n", " name=\"meta-llama-3-2-1b-instruct\", # Model entity name\n", " body={\n", " \"model\": \"meta/llama-3.2-1b-instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n", " \"max_tokens\": 100\n", " }\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "::::\n", "\n", "---\n", "\n", "## Route by Provider\n", "\n", "Route to a specific provider instance. Use for A/B testing or targeting specific deployments.\n", "\n", "Find available providers:\n", "\n", "::::{tab-set}\n", "\n", ":::{tab-item} CLI\n", ":sync: cli\n", "" ] }, { "cell_type": "code", "metadata": { "language": "bash", "vscode": { "languageId": "shellscript" } }, "source": [ "nmp inference providers list\n", "\n", "# List models available on a provider if its API is OpenAI compliant\n", "nmp inference gateway provider get v1/models --name llama-3-2-1b-deployment" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", ":::{tab-item} Python SDK\n", ":sync: python-sdk\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "for provider in sdk.inference.providers.list():\n", " print(f\"{provider.name}: {provider.status}\")\n", "\n", "# List models available on a provider if its API is OpenAI compliant\n", "models = sdk.inference.gateway.provider.get(\n", " \"v1/models\",\n", " name=\"llama-3-2-1b-deployment\"\n", ")\n", "print(models)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "::::\n", "\n", "Pass inference request using provider routing.\n", "\n", "::::{tab-set}\n", "\n", ":::{tab-item} CLI\n", ":sync: cli\n", "" ] }, { "cell_type": "code", "metadata": { "language": "bash", "vscode": { "languageId": "shellscript" } }, "source": [ "# Provider name matches deployment name for auto-created providers\n", "nmp chat meta/llama-3.2-1b-instruct \"Hello!\" \\\n", " --provider llama-3-2-1b-deployment \\\n", " --max-tokens 100" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", ":::{tab-item} Python SDK\n", ":sync: python-sdk\n", "" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Provider name matches deployment name for auto-created providers\n", "response = sdk.inference.gateway.provider.post(\n", " \"v1/chat/completions\",\n", " name=\"llama-3-2-1b-deployment\", # Provider name\n", " body={\n", " \"model\": \"meta/llama-3.2-1b-instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n", " \"max_tokens\": 100\n", " }\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::\n", "\n", "::::\n", "\n", "---\n", "\n", "## Route using OpenAI SDK\n", "\n", "Use the OpenAI-compatible endpoint for drop-in SDK replacement. The model field uses format `{workspace}/{model_entity}`.\n", "\n", "### List Available Models\n", "\n", "List available model entities that are reachable via Inference Gateway." ] }, { "cell_type": "code", "metadata": {}, "source": [ "models = sdk.inference.gateway.openai.v1.models.list()\n", "for model in models.data:\n", " print(model.id)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using SDK\n", "\n", "You can make requests to the OpenAI compatible Inference Gateway route with the NeMo Microservices SDK." ] }, { "cell_type": "code", "metadata": {}, "source": [ "response = sdk.inference.gateway.openai.post(\n", " \"v1/chat/completions\",\n", " body={\n", " \"model\": \"default/meta-llama-3-2-1b-instruct\", # workspace/model-entity\n", " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n", " \"max_tokens\": 100\n", " }\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using OpenAI Python SDK\n", "\n", "Models service provides a convenient helper method `sdk.models.get_openai_client()` that can generate an OpenAI SDK client for the configured workspace. The SDK also offers helper methods for generating OpenAI-compatible URL strings for Inference Gateway. See [SDK Helper Methods](../about.md#sdk-helper-methods) for more info." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Get pre-configured OpenAI client\n", "openai_client = sdk.models.get_openai_client()\n", "\n", "response = openai_client.chat.completions.create(\n", " model=\"default/meta-llama-3-2-1b-instruct\",\n", " messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n", " max_tokens=100\n", ")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming\n", "\n", "The OpenAI compatible Inference Gateway endpoint also supports streaming." ] }, { "cell_type": "code", "metadata": {}, "source": [ "openai_client = sdk.models.get_openai_client()\n", "\n", "stream = openai_client.chat.completions.create(\n", " model=\"default/meta-llama-3-2-1b-instruct\",\n", " messages=[{\"role\": \"user\", \"content\": \"Write a haiku about coding.\"}],\n", " max_tokens=100,\n", " stream=True\n", ")\n", "\n", "for chunk in stream:\n", " if chunk.choices[0].delta.content:\n", " print(chunk.choices[0].delta.content, end=\"\")" ], "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }