25 C
New York
Wednesday, August 6, 2025

How It Compares to GLM‑4.5, Qwen3, DeepSeek, and Kimi K2


Blog thumbnail - OpenAI GPT‐OSS Benchmarks

Introduction

OpenAI has launched gpt‑oss‑120b and gpt‑oss‑20b, a brand new collection of open‑weight reasoning fashions. Launched below the Apache 2.0 license, these textual content‑solely fashions are designed for strong instruction following, device use, and robust reasoning capabilities, making them nicely‑suited to integration into superior agentic workflows. This launch displays OpenAI’s ongoing dedication to enabling innovation and inspiring collaborative security throughout the AI neighborhood.

A key query is how these fashions evaluate to different main choices within the quick‑transferring open‑ and semi‑open‑weight ecosystem. On this weblog, we take a look at GPT‑OSS intimately and evaluate its capabilities with fashions like GLM‑4.5, Qwen3‑Pondering, DeepSeek‑R1, and Kimi K2.

GPT‑OSS: Structure and Core Strengths

The gpt‑oss fashions construct on the foundations of GPT‑2 and GPT‑3, incorporating a Combination‑of‑Specialists (MoE) design to enhance effectivity throughout each coaching and inference. This strategy prompts solely a subset of parameters per token, giving the fashions the size of very massive techniques whereas controlling compute price.

There are two fashions within the household:

  • gpt‑oss‑120b: 116.8 billion whole parameters, with about 5.1 billion energetic per token throughout 36 layers.

  • gpt‑oss‑20b: 20.9 billion whole parameters, with 3.6 billion energetic per token throughout 24 layers.

Each fashions share a number of architectural decisions:

  • Residual stream dimension of 2880.

  • Grouped Question Consideration with 64 question heads and eight key‑worth heads.

  • Rotary place embeddings for improved contextual reasoning.

  • Prolonged context size of 131,072 tokens utilizing YaRN.

To make deployment sensible, OpenAI utilized MXFP4 quantization to the MoE weights. This enables the 120 billion‑parameter mannequin to run on a single 80 GB GPU and the 20 billion‑parameter variant to function on {hardware} with as little as 16 GB of reminiscence.

One other notable characteristic is variable reasoning effort. Builders can specify “low,” “medium,” or “excessive” reasoning ranges through the system immediate, which dynamically adjusts the size of the Chain‑of‑Thought (CoT). This gives flexibility in balancing accuracy, latency, and compute price.

The fashions are additionally skilled with constructed‑in assist for agentic workflows, together with:

  • A shopping device for actual‑time internet search and retrieval.

  • A Python device for stateful code execution in a Jupyter‑like surroundings.

  • Assist for customized developer features, enabling advanced workflows with interleaved reasoning, device use, and consumer interplay.

GPT‑OSS in Context: Evaluating Efficiency Throughout Fashions

The open‑mannequin ecosystem is filled with succesful contenders — GLM‑4.5, Qwen3 Pondering, DeepSeek R1, and Kimi K2 — every with totally different strengths and commerce‑offs. Evaluating them with GPT‑OSS offers a clearer view of how these fashions carry out throughout reasoning, coding, and agentic workflows.

Reasoning and Information

On broad information and reasoning duties, GPT‑OSS delivers a number of the highest scores relative to its dimension.

  • On MMLU‑Professional, GPT‑OSS‑120b reaches 90.0%, forward of GLM‑4.5 (84.6%), Qwen3 Pondering (84.4%), DeepSeek R1 (85.0%), and Kimi K2 (81.1%).

  • For competitors‑fashion math duties, GPT‑OSS shines. On AIME 2024, it hits 96.6% with instruments, and on AIME 2025, it pushes to 97.9%, outperforming all others.

  • On the GPQA PhD‑stage science benchmark, GPT‑OSS‑120b achieves 80.9% with instruments, corresponding to GLM‑4.5 (79.1%) and Qwen3 Pondering (81.1%), and simply behind DeepSeek R1 (81.0%).

What makes these numbers notable is the steadiness between mannequin dimension and efficiency. GPT‑OSS‑120b is a 116.8B‑parameter mannequin (with solely 5.1B energetic parameters per token because of its Combination‑of‑Specialists design). GLM‑4.5 and Qwen3 Pondering are considerably bigger full‑parameter fashions, which partially explains their robust device use and coding outcomes. DeepSeek R1 additionally leans towards increased parameter counts and deeper token utilization for reasoning duties (as much as 20k tokens per question), whereas Kimi K2 is tuned as a smaller however extra specialised instruct mannequin.

This implies GPT‑OSS manages frontier‑stage reasoning scores whereas utilizing fewer energetic parameters, making it extra environment friendly for builders who need deep reasoning with out the price of working very massive dense fashions.

Coding and Software program Engineering

Fashionable AI coding benchmarks deal with a mannequin’s capacity to know massive codebases, make adjustments, and execute multi‑step reasoning.

  • On SWE‑bench Verified, GPT‑OSS‑120b scores 62.4%, near GLM‑4.5 (64.2%) and DeepSeek R1 (≈65.8% in agentic mode).

  • On Terminal‑Bench, GLM‑4.5 leads with 37.5%, adopted by Kimi K2 at round 30%.

  • GLM‑4.5 additionally exhibits robust leads to head‑to‑head agentic coding duties, with over 50% win charges towards Kimi K2 and over 80% towards Qwen3, whereas sustaining a excessive success price for device‑based mostly coding workflows.

Right here once more, mannequin dimension issues. GLM‑4.5 is a a lot bigger dense mannequin than GPT‑OSS‑120b and Kimi K2, which provides it an edge in agentic coding workflows. However for builders who need stable code‑modifying capabilities in a mannequin that may run on a single 80GB GPU, GPT‑OSS gives an interesting steadiness.

Agentic Device Use and Perform Calling

Agentic capabilities — the place a mannequin autonomously calls instruments, executes features, and solves multi‑step duties — are more and more essential.

  • On TAU‑bench Retail, GPT‑OSS‑120b scores 67.8%, in comparison with GLM‑4.5’s 79.7% and Kimi K2’s 70.6%.

  • On BFCL‑v3 (a perform‑calling benchmark), GLM‑4.5 leads with 77.8%, adopted by Qwen3 Pondering at 71.9% and GPT‑OSS round 67–68%.

These outcomes spotlight a commerce‑off: GLM‑4.5 dominates in perform‑calling and agentic workflows, nevertheless it does in order a considerably bigger, useful resource‑intensive mannequin. GPT‑OSS delivers aggressive outcomes whereas staying accessible to builders who can’t afford multi‑GPU clusters.

Placing It All Collectively

Right here’s a fast snapshot of how these fashions stack up:

BenchmarkGPT‑OSS‑120b (Excessive)GLM‑4.5Qwen3 PonderingDeepSeek R1Kimi K2
MMLU‑Professional90.0%84.6%84.4%85.0%81.1%
AIME 202496.6% (with instruments)~91%~91.4%~87.5%~69.6%
AIME 202597.9% (with instruments)~92%~92.3%~87.5%~49.5%
GPQA Diamond (Science)~80.9% (with instruments)79.1%81.1%81.0%75.1%
SWE‑bench Verified62.4%64.2%~65.8%65.8% agentic
TAU‑bench Retail67.8%79.7%~67.8%~63.9%~70.6%
BFCL‑v3 Perform Calling~67–68%77.8%71.9%37.0%

Key takeaways:

  • GPT‑OSS punches above its weight in reasoning and lengthy‑kind CoT duties whereas utilizing fewer energetic parameters.

  • GLM‑4.5 is a heavyweight dense mannequin that excels at agentic workflows and performance‑calling however requires way more compute.

  • DeepSeek R1 and Qwen3 supply robust hybrid reasoning efficiency at bigger sizes, whereas Kimi K2 targets agentic coding workflows with smaller, extra specialised setups.

Conclusion

GPT‑OSS brings frontier‑stage reasoning and lengthy‑kind CoT capabilities with a smaller energetic parameter footprint than many dense fashions. GLM‑4.5 leads in agentic workflows and performance‑calling however requires considerably extra compute. DeepSeek R1 and Qwen3 ship robust hybrid reasoning at bigger scales, whereas Kimi K2 focuses on specialised coding workflows with a compact setup.

This makes GPT‑OSS a compelling steadiness of reasoning efficiency, coding capacity, and deployment effectivity, nicely‑suited to experimentation, integration into agentic techniques, or useful resource‑conscious manufacturing workloads.

If you wish to strive the GPT‑OSS‑20B mannequin, its smaller dimension makes it sensible to run domestically by yourself hardwareusing Ollama and expose it through a public API with Clarifai’s Native Runners — supplying you with full management over your compute and retaining your knowledge native. Try the tutorial right here.

If you wish to check out the total‑scale GPT‑OSS‑120B mannequin, you’ll be able to strive it immediately on the playground right here.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles