Run vLLM Fashions Domestically with a Safe Public API

26 October 2025

14

Introduction

vLLM is a high-throughput, open-source inference and serving engine for big language fashions (LLMs). It supplies quick, memory-efficient inference utilizing GPU optimizations equivalent to PagedAttention and steady batching, making it appropriate for GPU-based workloads.

On this tutorial, we are going to present learn how to run LLMs with vLLM completely in your native machine and expose them via a safe public API. This method helps you to run fashions with GPU acceleration, keep native execution velocity, and hold full management over your atmosphere with out counting on cloud providers or an web connection.

Clarifai Native Runners make this course of easy. You possibly can serve AI fashions or brokers straight out of your laptop computer, workstation, or inner server via a safe public API. You don’t want to add your mannequin or handle infrastructure. The Native Runner routes API requests to your machine, executes them regionally, and returns the outcomes to the shopper, whereas all computation stays in your {hardware}.

Let’s have a look at learn how to set that up.

Operating Fashions through vLLM Domestically

The vLLM Toolkit within the Clarifai CLI helps you to initialize, configure, and run fashions through vLLM regionally whereas exposing them via a safe public API. You possibly can take a look at, combine, and iterate straight out of your machine with out standing up any infrastructure.

Step 1: Stipulations

Set up the Clarifai CLI

vLLM helps fashions from the Hugging Face Hub. When you’re utilizing personal repositories, you’ll want a Hugging Face entry token.

Step 2: Initialize a Mannequin

Use the Clarifai CLI to scaffold a vLLM-based mannequin listing. This can put together all required information for native execution and integration with Clarifai.

If you wish to work with a particular mannequin, use the --model-name flag:

Be aware: Some fashions are massive and require vital reminiscence. Guarantee your machine meets the mannequin’s necessities.

After initialization, the generated folder construction appears like this:

mannequin.py – Accommodates logic that runs the vLLM server regionally and handles inference.
config.yaml – Defines metadata, runtime, checkpoints, and compute settings.
necessities.txt – Lists Python dependencies.

Step 3: Customise mannequin.py

The scaffold features a VLLMModel class extending OpenAIModelClass. It defines how your Native Runner interacts with vLLM’s OpenAI-compatible server.

Key strategies:

load_model() – Launches vLLM’s native runtime, hundreds checkpoints, and connects to the OpenAI-compatible API endpoint.
predict() – Handles single-prompt inference with optionally available parameters like max_tokens, temperature, and top_p. Returns the entire response.
generate() – Streams generated tokens in actual time for interactive outputs.

You should utilize these implementations as-is or customise them to suit your most well-liked request/response buildings.

Step 4: Configure config.yaml

The config.yaml file defines the mannequin identification, runtime, checkpoints, and compute metadata:

Be aware: For native execution, inference_compute_info is optionally available — the mannequin runs completely in your machine utilizing native CPU/GPU assets. If deploying on Clarifai’s devoted compute, you may specify accelerators and useful resource limits.

Step 5: Begin the Native Runner

Begin a Native Runner that connects to the vLLM runtime:

If any configuration is lacking, the CLI will immediate you to outline it. After startup, you’ll obtain a public Clarifai URL to your mannequin. Requests despatched to this endpoint route securely to your machine, run via vLLM, and return to the shopper.

Step 6: Run Inference with Native Runner

As soon as your mannequin is working regionally and uncovered through the Clarifai Native Runner, you may ship inference requests utilizing the OpenAI-compatible API or the Clarifai SDK.

OpenAI-Appropriate API

Clarifai Python SDK

You may also experiment with generate() technique for real-time streaming.

Conclusion

Native Runners offer you full management over the place your fashions execute with out sacrificing integration, safety, or flexibility. You possibly can prototype, take a look at, and serve actual workloads by yourself {hardware}, whereas Clarifai handles routing, authentication, and the general public endpoint.

You possibly can strive Native Runners without spending a dime with the Free Tier, or improve to the Developer Plan at $1 per 30 days for the primary 12 months to attach as much as 5 Native Runners with limitless hours.

Run vLLM Fashions Domestically with a Safe Public API

Introduction

Operating Fashions through vLLM Domestically

Step 1: Stipulations

Step 2: Initialize a Mannequin

Step 3: Customise mannequin.py

Step 4: Configure config.yaml

Step 5: Begin the Native Runner

Step 6: Run Inference with Native Runner

OpenAI-Appropriate API

Clarifai Python SDK

Conclusion

Related Articles

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

China discovered how one can promote EVs. Now it has to bury their batteries.

LEAVE A REPLY Cancel reply

Latest Articles

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

China discovered how one can promote EVs. Now it has to bury their batteries.

Meta AI Releases SAM Audio: A State-of-the-Artwork Unified Mannequin that Makes use of Intuitive and Multimodal Prompts for Audio Separation

Designing Progressive Puzzle Video games with Zach Barth