Benchmarking Pace, Scale, and Value Effectivity

11.8_blog_hero (1)

This weblog put up focuses on new options and enhancements. For a complete record, together with bug fixes, please see the launch notes.

GPT-OSS-120B: Benchmarking Pace, Scale, and Value Effectivity

Synthetic Evaluation has benchmarked Clarifai’s Compute Orchestration with the GPT-OSS-120B mannequin—probably the most superior open-source massive language fashions out there as we speak. The outcomes underscore Clarifai as one of many high {hardware} and GPU-agnostic engines for AI workloads the place velocity, flexibility, effectivity and reliability matter most.

What the benchmark reveals (P50, final 72h; single question, 1k-token immediate):

Excessive throughput: 313 output tokens per second—among the many very quickest measured on this configuration.
Low latency: 0.27s time-to-first-token (TTFT), so responses start streaming nearly immediately.
Compelling value/efficiency: Positioned within the benchmark’s “most tasty quadrant” (excessive velocity + low value).

Pricing that scales:

Clarifai provides GPT-OSS-120B at $0.09 per 1M enter tokens and $0.36 per 1M output tokens. Synthetic Evaluation shows a blended value (3:1 enter:output) of simply $0.16 per 1M tokens, inserting Clarifai considerably under the $0.26–$0.28 cluster of opponents whereas matching or exceeding their efficiency.

Under is a comparability of output velocity versus value throughout main suppliers for GPT-OSS-120B. Clarifai stands out within the “most tasty quadrant,” combining excessive throughput with aggressive pricing.

Output Speed vs Price (10 Sep 25) (2)

Output Pace vs. Worth

This chart compares latency (time to first token) in opposition to output velocity. Clarifai demonstrates one of many lowest latencies whereas sustaining top-tier throughput—inserting it among the many best-in-class suppliers.

Latency vs Output Speed (10 Sep 25) (1)

Latency vs. Output Pace

Why GPT-OSS-120B Issues

As one of many main open-source “GPT-OSS” fashions, GPT-OSS-120B represents the rising demand for clear, community-driven alternate options to closed-source LLMs. Working a mannequin of this scale requires infrastructure that may not solely ship excessive velocity and low latency, but additionally preserve prices beneath management at manufacturing scale. That’s precisely the place Clarifai’s Compute Orchestraction makes a distinction.

Why This Benchmark Issues

These outcomes are greater than numbers—they present how Clarifai has engineered each layer of the stack to optimize GPU utilization. With CO, a number of fashions can run on the identical GPUs, workloads scale elastically, and enterprises can squeeze extra worth out of each accelerator. The payoff is quick, dependable, and cost-efficient inference that may assist each experimentation and large-scale deployment.

Verify the complete benchmarks on Synthetic Evaluation right here.

Right here’s a fast demo of how one can entry the GPT-OSS-120B mannequin within the Playground.

Native Runners

Native Runners allow you to develop and run fashions by yourself {hardware}—laptops, workstations, edge containers—whereas making them callable by means of Clarifai’s cloud API. Clarifai handles the general public URL, routing, and authentication; your mannequin executes regionally and your information stays in your machine. It behaves like every other Clarifai‑hosted mannequin.

Why groups use Native Runners

Construct the place your information and instruments reside. Maintain fashions near native recordsdata, inner databases, and OS‑stage utilities.
No customized networking. Begin a runner and get a public URL—no port‑forwarding or reverse proxies.
Use your individual compute. Deliver your GPUs and customized setups; the platform nonetheless offers the API, workflows, and governance round them.

New: Ollama Toolkit (now within the CLI)

We’ve added an Ollama Toolkit to the Clarifai CLI so you’ll be able to initialize an Ollama‑backed mannequin listing in a single command (and select any mannequin from the Ollama library). It pairs completely with Native Runners—obtain, run, and expose an Ollama mannequin through a public API with a minimal setup.

The CLI helps --toolkit ollama plus flags like --model-name, --port, and --context-length, making it trivial to focus on particular Ollama fashions.

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B regionally and serve it by means of a public API

Decide a mannequin in Ollama.
- Gemma 3 270M (tiny, quick; 32K context): gemma3:270m.
- GPT‑OSS 20B (OpenAI open‑weight, optimized for native use): gpt-oss:20b.
Initialize the mission with the Ollama Toolkit.
Use the command above, swapping --model-name in your decide (e.g., gpt-oss:20b). This may create a brand new mannequin listing construction that’s appropriate with the Clarifai platform. You’ll be able to customise or optimize the generated mannequin by modifying the 1/mannequin.py file as wanted.
Begin your Native Runner.
From the mannequin listing:

The runner registers with Clarifai and exposes your native mannequin through a public URL; the CLI prints a prepared‑to‑run consumer snippet.
Name it like every Clarifai mannequin.
For instance (Python SDK):
Behind the scenes, the API name is routed to your machine; outcomes return to the caller over Clarifai’s safe management aircraft.

Deep dive: We revealed a step‑by‑step information that walks by means of operating Ollama fashions regionally and exposing them with Native Runners. Test it out right here.

Attempt it on the Developer Plan

You can begin totally free, or use the Developer Plan—$1/month for the primary yr—which incorporates as much as 5 Native Runners and limitless runner hours.

Take a look at the complete instance and setup information within the documentation right here.

Billing

We’ve made billing extra clear and versatile with this launch. Month-to-month spending limits have been launched: $100 for Developer and Important plans, and $500 for the Skilled plan. Should you want increased limits, you’ll be able to attain out to our group.

We’ve additionally added a brand new bank card pre-authorization course of. A short lived cost is utilized to confirm card validity and out there funds — $50 for Developer, $100 for Important, and $500 for Skilled plans. The quantity is routinely refunded inside seven days, making certain a seamless verification expertise.

Management Middle

The Management Middle will get much more versatile and informative with this replace. Now you can resize charts to half their unique dimension on the configure web page, making side-by-side comparisons smoother and layouts extra manageable.
Charts are smarter too: the Saved Inputs Value chart now accurately reveals the typical value for the chosen interval, whereas longer date ranges routinely show weekly aggregated information for simpler readability. Empty charts show significant messages as a substitute of zeros, so that you all the time know when information isn’t out there.
We’ve additionally added cross-links between compute value and utilization charts, making it easy to navigate between these views and get a whole image of your AI infrastructure.

Further Adjustments

Python SDK: Mounted Native Runner CLI command, up to date protocol and gRPC variations, built-in secrets and techniques, corrected num_threads defaults, added stream_options validation, prevented downloading unique checkpoints, improved mannequin add and deployment, and added consumer affirmation to forestall Dockerfile overwrite throughout uploads.
Verify all SDK updates right here.
Platform Updates: Added a public useful resource filter to shortly view Group-shared assets, improved Playground error messaging for streaming limits, and prolonged login session period for Google and GitHub SSO customers to seven days.
Discover all platform modifications right here.

Prepared to start out constructing?

With Native Runners, now you can serve fashions, MCP servers, or brokers straight from your individual {hardware} with out importing mannequin weights or managing infrastructure. It’s the quickest solution to check, iterate, and securely run fashions out of your laptop computer, workstation, or on-prem server. You’ll be able to learn the documentation to get began, or take a look at the weblog to see how one can run Ollama fashions regionally and expose them through a public API.

Benchmarking Pace, Scale, and Value Effectivity

GPT-OSS-120B: Benchmarking Pace, Scale, and Value Effectivity

Pricing that scales:

Why GPT-OSS-120B Issues

Why This Benchmark Issues

Native Runners

New: Ollama Toolkit (now within the CLI)

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B regionally and serve it by means of a public API

Attempt it on the Developer Plan

Billing

Management Middle

Further Adjustments

Prepared to start out constructing?

Related Articles

Prime 7 Open Supply OCR Fashions

4 shiny spots in local weather information in 2025

The highest software program improvement information of the 12 months

LEAVE A REPLY Cancel reply

Latest Articles

Prime 7 Open Supply OCR Fashions

4 shiny spots in local weather information in 2025

The highest software program improvement information of the 12 months

InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Basis Mannequin, Designed for 1 Mb Context Lengths at Single-Nucleotide esolution

What Is Cloud Optimization? Sensible Information to Optimizing Cloud Utilization

Benchmarking Pace, Scale, and Value Effectivity

GPT-OSS-120B: Benchmarking Pace, Scale, and Value Effectivity

Pricing that scales:

Why GPT-OSS-120B Issues

Why This Benchmark Issues

Native Runners

New: Ollama Toolkit (now within the CLI)

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B regionally and serve it by means of a public API

Attempt it on the Developer Plan

Billing

Management Middle

Further Adjustments

Prepared to start out constructing?

Related Articles

LEAVE A REPLY Cancel reply

Latest Articles

Instance workflow: run Gemma 3 270M or GPT‑OSS- 20B regionally and serve it by means of a public API