Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions turn into bigger and extra succesful, the frameworks that energy them are pressured to maintain tempo, optimizing for all the things from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we carry these issues collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and supply sensible steering on which to decide on based mostly in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the concept of structured era. It brings distinctive abstractions like RadixAttention and specialised state administration that enable it to ship low latency for interactive purposes. This makes SGLang particularly interesting when the workload requires exact management over outputs, equivalent to when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration by means of PagedAttention. It additionally gives broad help for quantization strategies like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible selection for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which implies it takes full benefit of {hardware} options within the H100 and B200. The result’s larger effectivity, sooner response instances, and higher scaling as workloads enhance. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

Framework	Design Focus	Key Strengths
SGLANG	Structured era, RadixAttention	Low latency, environment friendly token era
vLLM	Steady batching, PagedAttention	Excessive throughput, helps quantization
TensorRT-LLM	TensorRT optimizations	GPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

To judge the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs below a wide range of circumstances. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its measurement and complexity make it a demanding benchmark, which is strictly why it’s perfect for testing inference frameworks and {hardware}.

We measured three most important classes of efficiency:

Latency – How briskly the mannequin generates the primary token (TTFT) and the way shortly it produces subsequent tokens.
Throughput – What number of tokens per second might be generated below various ranges of concurrency.
Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. Whenever you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

This is how the three frameworks stacked up:

Time to First Token (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.053	0.125	0.177
10	1.91	1.155	2.496
50	7.546	3.08	4.14
100	1.87	8.991	5.467

Per-Token Latency (seconds)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	0.005	0.004	0.004
10	0.011	0.01	0.009
50	0.021	0.015	0.018
100	0.019	0.021	0.049

What this exhibits:

vLLM was constantly the quickest to generate the primary token throughout all concurrency ranges, with glorious scaling traits.
SGLang had essentially the most secure per-token latency, constantly round 4–21 ms throughout totally different masses.
TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

With regards to serving a number of requests, throughput is the quantity to look at. This is how the three frameworks carried out as concurrency elevated:

Total Throughput (tokens/second)

Concurrency	vLLM	SGLang	TensorRT-LLM
1	187.15	230.96	242.79
10	863.15	988.18	867.21
50	2211.85	3108.75	2162.95
100	4741.62	3221.84	1942.64

One of the vital findings was how vLLM achieved the very best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed robust efficiency at reasonable to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated the very best single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

Strengths: Secure per-token latency, robust throughput at reasonable concurrency, good total steadiness.
Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.
Finest For: Reasonable to high-throughput purposes, situations requiring constant token era timing.

vLLM

Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, glorious scaling.
Weaknesses: Barely larger per-token latency at excessive masses.
Finest For: Interactive purposes, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

Strengths: Finest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.
Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.
Finest For: Single-user or low-concurrency purposes, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There is no such thing as a single framework that outperforms throughout all classes. As an alternative, every has been optimized for various targets, and the best selection depends upon workload and infrastructure.

Use vLLM for interactive purposes and high-concurrency deployments requiring quick responses and most throughput scaling.
Select SGLang when reasonable throughput and constant efficiency are wanted.
Deploy TensorRT-LLM for single-user purposes or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework depends upon workload sort and {hardware} availability, fairly than on the lookout for a common winner. Operating GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI purposes at scale.

It is value noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM constantly outperformed each SGLang and vLLM throughout all metrics, due to its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the best framework to your particular {hardware} to unlock most efficiency potential.

You may discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is easy. With Clarifai’s Compute Orchestration, you’ll be able to serve GPT-OSS-120B or every other open-weight fashions or your personal customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you’ll be able to shortly go from mannequin to software. The very best half is that you’re not locked right into a single framework. You may experiment with totally different runtimes, and select the one which greatest aligns together with your efficiency and value necessities.

This flexibility makes it simple to combine cutting-edge frameworks into your workflows and ensures you’re at all times getting the very best efficiency out of your {hardware}. Try the documentation to learn to add your personal fashions.

Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B

Introduction

Overview of the Frameworks

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

Latency Outcomes

Throughput Outcomes

Framework Evaluation and Suggestions

Conclusion

Related Articles

DataRobot This fall replace: driving success throughout the total agentic AI lifecycle

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

LEAVE A REPLY Cancel reply

Latest Articles

DataRobot This fall replace: driving success throughout the total agentic AI lifecycle

Immediate Engineering for Information High quality and Validation Checks

Past code: How Bitrix24’s AI powers end-to-end venture administration

China discovered how one can promote EVs. Now it has to bury their batteries.

Meta AI Releases SAM Audio: A State-of-the-Artwork Unified Mannequin that Makes use of Intuitive and Multimodal Prompts for Audio Separation