20.6 C
New York
Sunday, August 31, 2025

Evaluating SGLANG, vLLM, and TensorRT-LLM with GPT-OSS-120B


Blog thumbnail - Comparing SGLANG, vLLM, and TRTLM 
with GPT-OSS-120B.png.png

Introduction

The ecosystem of LLM inference frameworks has been rising quickly. As fashions turn into bigger and extra succesful, the frameworks that energy them are pressured to maintain tempo, optimizing for all the things from latency to throughput to reminiscence effectivity. For builders, researchers, and enterprises alike, the selection of framework can dramatically have an effect on each efficiency and value.

On this weblog, we carry these issues collectively by evaluating SGLang, vLLM, and TensorRT-LLM. We consider how every performs when serving GPT-OSS-120B on 2x NVIDIA H100 GPUs. The outcomes spotlight the distinctive strengths of every framework and supply sensible steering on which to decide on based mostly in your workload and {hardware}.

Overview of the Frameworks

SGLang: SGLang was designed across the concept of structured era. It brings distinctive abstractions like RadixAttention and specialised state administration that enable it to ship low latency for interactive purposes. This makes SGLang particularly interesting when the workload requires exact management over outputs, equivalent to when producing structured knowledge codecs or working with agentic workflows.

vLLM: vLLM has established itself as one of many main open-source inference frameworks for serving massive language fashions at scale. Its key benefit lies in throughput, powered by steady batching and environment friendly reminiscence administration by means of PagedAttention. It additionally gives broad help for quantization strategies like INT8, INT4, GPTQ, AWQ, and FP8, making it a flexible selection for many who want to maximise tokens per second throughout many concurrent requests.

TensorRT-LLM: TensorRT-LLM is NVIDIA’s TensorRT-based inference runtime, purpose-built to extract most efficiency from NVIDIA GPUs. It’s deeply optimized for Hopper and Blackwell architectures, which implies it takes full benefit of {hardware} options within the H100 and B200. The result’s larger effectivity, sooner response instances, and higher scaling as workloads enhance. Whereas it requires a bit extra setup and tuning in comparison with different frameworks, TensorRT-LLM represents NVIDIA’s imaginative and prescient for production-grade inference efficiency.

FrameworkDesign FocusKey Strengths
SGLANGStructured era, RadixAttentionLow latency, environment friendly token era
vLLMSteady batching, PagedAttentionExcessive throughput, helps quantization
TensorRT-LLMTensorRT optimizationsGPU-level effectivity, lowest latency on H100/B200

Benchmark Setup and Outcomes

Benchmark Setup and Outcomes

To judge the three frameworks pretty, we ran GPT-OSS-120B on 2x NVIDIA H100 GPUs below a wide range of circumstances. The GPT-OSS-120B mannequin is a big mixture-of-experts mannequin that pushes the boundaries of open-weight efficiency. Its measurement and complexity make it a demanding benchmark, which is strictly why it’s perfect for testing inference frameworks and {hardware}.

We measured three most important classes of efficiency:

  • Latency – How briskly the mannequin generates the primary token (TTFT) and the way shortly it produces subsequent tokens.
  • Throughput – What number of tokens per second might be generated below various ranges of concurrency.
  • Concurrency scaling – How nicely every framework holds up because the variety of simultaneous requests will increase.

Latency Outcomes

Let’s begin with latency. Whenever you care about responsiveness, two issues matter most: the time to first token and the per-token latency as soon as decoding begins.

This is how the three frameworks stacked up:

Time to First Token (seconds)

ConcurrencyvLLMSGLangTensorRT-LLM
10.0530.1250.177
101.911.1552.496
507.5463.084.14
1001.878.9915.467

Per-Token Latency (seconds)

ConcurrencyvLLMSGLangTensorRT-LLM
10.0050.0040.004
100.0110.010.009
500.0210.0150.018
1000.0190.0210.049

What this exhibits:

  • vLLM was constantly the quickest to generate the primary token throughout all concurrency ranges, with glorious scaling traits.
  • SGLang had essentially the most secure per-token latency, constantly round 4–21 ms throughout totally different masses.
  • TensorRT-LLM confirmed the slowest time to first token however maintained aggressive per-token efficiency at decrease concurrency ranges.

Throughput Outcomes

With regards to serving a number of requests, throughput is the quantity to look at. This is how the three frameworks carried out as concurrency elevated:

Total Throughput (tokens/second)

ConcurrencyvLLMSGLangTensorRT-LLM
1187.15230.96242.79
10863.15988.18867.21
502211.853108.752162.95
1004741.623221.841942.64

One of the vital findings was how vLLM achieved the very best throughput at 100 concurrent requests, reaching 4,741 tokens per second. SGLang confirmed robust efficiency at reasonable to excessive concurrency (50 requests), whereas TensorRT-LLM demonstrated the very best single-request throughput however decrease scaling at excessive concurrency.

Framework Evaluation and Suggestions

SGLang

  • Strengths: Secure per-token latency, robust throughput at reasonable concurrency, good total steadiness.

  • Weaknesses: Slower time-to-first-token at single requests, throughput drops at 100 concurrent requests.

  • Finest For: Reasonable to high-throughput purposes, situations requiring constant token era timing.

vLLM

  • Strengths: Quickest time-to-first-token throughout all concurrency ranges, highest throughput at excessive concurrency, glorious scaling.

     

  • Weaknesses: Barely larger per-token latency at excessive masses.

     

  • Finest For: Interactive purposes, high-concurrency deployments, situations prioritizing quick preliminary responses and most throughput scaling.

TensorRT-LLM

  • Strengths: Finest single-request throughput, aggressive per-token latency at low concurrency, hardware-optimized efficiency.

     

  • Weaknesses: Slowest time-to-first-token, poor scaling at excessive concurrency, considerably degraded per-token latency at 100 requests.

     

  • Finest For: Single-user or low-concurrency purposes, situations the place {hardware} optimization issues greater than scaling.

Conclusion

There is no such thing as a single framework that outperforms throughout all classes. As an alternative, every has been optimized for various targets, and the best selection depends upon workload and infrastructure.

  • Use vLLM for interactive purposes and high-concurrency deployments requiring quick responses and most throughput scaling.
  • Select SGLang when reasonable throughput and constant efficiency are wanted.
  • Deploy TensorRT-LLM for single-user purposes or when maximizing {hardware} effectivity at low concurrency is the precedence.

The important thing takeaway is that selecting the best framework depends upon workload sort and {hardware} availability, fairly than on the lookout for a common winner. Operating GPT-OSS-120B on NVIDIA H100 GPUs with these optimized inference frameworks unlocks highly effective choices for constructing and deploying AI purposes at scale.

It is value noting that these efficiency traits can shift dramatically relying in your GPU {hardware}. We additionally prolonged the benchmarks to B200 GPUs, the place TensorRT-LLM constantly outperformed each SGLang and vLLM throughout all metrics, due to its deeper optimization for NVIDIA’s newest {hardware} structure.

This highlights how framework choice is not nearly software program capabilities—it is equally about matching the best framework to your particular {hardware} to unlock most efficiency potential.

 

You may discover the full set of benchmark outcomes right here.

Bonus: Serve a Mannequin with Your Most popular Framework

Getting began with these frameworks is easy. With Clarifai’s Compute Orchestration, you’ll be able to serve GPT-OSS-120B or every other open-weight fashions or your personal customized fashions out of your most well-liked inference engine, whether or not it’s SGLang, vLLM, or TensorRT-LLM .

From establishing the runtime to deploying a production-ready API, you’ll be able to shortly go from mannequin to software. The very best half is that you’re not locked right into a single framework. You may experiment with totally different runtimes, and select the one which greatest aligns together with your efficiency and value necessities.

This flexibility makes it simple to combine cutting-edge frameworks into your workflows and ensures you’re at all times getting the very best efficiency out of your {hardware}. Try the documentation to learn to add your personal fashions.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles