31.1 C
New York
Friday, August 15, 2025

Benchmarking GPT-OSS Throughout H100s and B200s


11.7_blog_hero

This weblog put up focuses on new options and enhancements. For a complete checklist, together with bug fixes, please see the launch notes.

Benchmarking GPT-OSS Throughout H100s and B200s

OpenAI has launched gpt-oss-120b and gpt-oss-20b, a brand new era of open-weight reasoning fashions underneath the Apache 2.0 license. Constructed for sturdy instruction following, highly effective device use, and superior reasoning, these fashions are designed for next-generation agentic workflows.

With a Combination of Consultants (MoE) design, prolonged context size of 131K tokens, and quantization that enables the 120b mannequin to run on a single 80 GB GPU, GPT-OSS combines huge scale with sensible deployment. Builders can alter reasoning ranges from low to excessive to optimize velocity, value, or accuracy, and use built-in looking, code execution, and customized instruments for advanced workflows.

Our analysis crew benchmarked gpt-oss-120b throughout NVIDIA B200 and H100 GPUs utilizing vLLM, SGLang, and TensorRT-LLM. Checks coated single-request situations and high-concurrency workloads with 50–100 requests. Key findings embrace:

  • Single request velocity: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in a number of circumstances.

  • Excessive concurrency: B200 sustains 7,236 tokens/sec at most load with decrease per-token latency.

  • Effectivity: One B200 can change two H100s for equal or higher efficiency, with decrease energy use and fewer complexity.

  • Efficiency positive aspects: Some workloads see as much as 15x quicker inference in comparison with a single H100.

For detailed benchmarks on throughput, latency, time to first token, and different metrics, learn our full weblog on NVIDIA B200 vs H100.

If you’re seeking to deploy GPT-OSS fashions on H100s, you are able to do it at this time on Clarifai throughout a number of clouds. Help for B200s is coming quickly, providing you with entry to the most recent NVIDIA GPUs for testing and manufacturing.

Developer Plan

Final month we launched Native Runners, and the response from builders has been unbelievable. From AI hobbyists to manufacturing groups, many have been wanting to run open supply fashions regionally on their very own {hardware} whereas nonetheless making the most of the Clarifai platform. With Native Runners, you may run and check fashions by yourself machines, then entry them via a public API for integration into any utility.

Now, with the arrival of the most recent GPT-OSS fashions together with gpt-oss-20b, you may run these superior reasoning fashions regionally with full management of your compute and the flexibility to deploy agentic workflows immediately.

To make it even simpler, we’re introducing the Developer Plan at a promotional value of simply $1/month. It contains all the things within the Neighborhood Plan, plus:

Try the Developer Plan and begin operating your personal fashions regionally at this time. If you’re able to run GPT-OSS-20b in your {hardware}, observe our step-by-step tutorial right here.

Printed Fashions

We have now expanded our mannequin library with new open-weight and specialised fashions which might be prepared to make use of in your workflows.

The most recent additions embrace:

  • GPT-OSS-120b – open-weight language mannequin designed for robust reasoning, superior device use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and variable reasoning ranges, making it ultimate for advanced agentic functions.

  • GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship mannequin for probably the most demanding reasoning and generative duties. GPT-5 Mini presents a quicker, cost-effective different for real-time functions. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.

  • Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding mannequin with long-context assist and powerful agentic capabilities, well-suited for code era, refactoring, and improvement automation.

You can begin exploring these fashions straight within the Clarifai Playground or entry them through API to combine into your functions.

Ollama Help

Ollama makes it easy to obtain and run highly effective open-source fashions straight in your machine. With Clarifai Native Runners, now you can expose these regionally operating fashions through a safe public API.

We’ve additionally added Ollama toolkit to the Clarifai CLI, letting you obtain, run, and expose Ollama fashions with a single command.

Learn our step-by-step information on operating Ollama fashions regionally and making them accessible through API.

Playground Enhancements

Now you can examine a number of fashions facet by facet within the Playground as a substitute of testing them separately. Rapidly spot variations in output, velocity, and high quality to decide on the most effective match in your use case.

We’ve additionally added enhanced inference controls, Pythonic assist, and mannequin model selectors for smoother experimentation.

Screenshot 2025-08-14 at 6.58.27 PM

Further Updates

Python SDK:

  • Improved logging, pipeline dealing with, authentication, Native Runner assist, and code validation.

  • Added reside logging, verbose output, and integration with GitHub repositories for versatile mannequin initialization.

Platform:

Clarifai Organizations:

Prepared to begin constructing?

With Clarifai’s Compute Orchestration, you may deploy GPT-OSS, Qwen3-Coder, and different open supply and your personal customized fashions on devoted GPUs like NVIDIA B200s and H100s, on-prem or within the cloud. Serve fashions, MCP servers, or full agentic workflows straight out of your {hardware} with full management over efficiency, value, and safety.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles