16.2 C
New York
Tuesday, October 14, 2025

NVIDIA Researchers Suggest Reinforcement Studying Pretraining (RLP): Reinforcement as a Pretraining Goal for Constructing Reasoning Throughout Pretraining


NVIDIA AI has launched Reinforcement Studying Pretraining (RLP), a coaching goal that injects reinforcement studying into the pretraining stage fairly than deferring it to post-training. The core thought is easy and testable: deal with a brief chain-of-thought (CoT) as an motion sampled earlier than next-token prediction and reward it by the data achieve it offers on the noticed subsequent token, measured in opposition to a no-think EMA baseline. This produces a verifier-free, dense, position-wise reward that may be utilized to bizarre textual content streams at pretraining scale.

https://github.com/NVlabs/RLP/blob/principal/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf

Mechanism: Data-Achieve Rewards with an EMA Counterfactual

RLP makes use of a single community (shared parameters) to (1) pattern a CoT coverage
𝜋
𝜃
(
𝑐
𝑡

𝑥
<
𝑡
)
π
θ

(c
t

∣x
<t

) after which (2) rating the following token
𝑝
𝜃
(
𝑥
𝑡

𝑥
<
𝑡
,
𝑐
𝑡
)
p
θ

(x
t

∣x
<t

,c
t

). A slowly up to date EMA instructor
𝑝
𝜙
(
𝑥
𝑡

𝑥
<
𝑡
)
p
ϕ

(x
t

∣x
<t

) offers a no-think counterfactual. The per-token reward is the log-likelihood ratio-

r(ct​)=logpθ​(xt​∣x<t​,ct​)−logpϕ​(xt​∣x<t​), computed underneath instructor forcing. Coaching updates solely the thought tokens utilizing a clipped surrogate with per-token significance ratios and group-relative benefits (a number of sampled ideas per context scale back variance). The target maximizes anticipated data achieve; theoretical outcomes join the anticipated reward to reductions in cross-entropy and certain it by way of marginalization over ideas.

Why this issues technically: in contrast to prior “reinforcement pretraining” variants that depend on sparse, binary correctness alerts or proxy filters, RLP’s dense, verifier-free reward attaches position-wise credit score wherever considering improves prediction, enabling updates at each token place on the whole web-scale corpora with out exterior verifiers or curated reply keys.

Understanding the Outcomes

Qwen3-1.7B-Base: Pretraining with RLP improved the general math+science common by ~19% vs the bottom mannequin and ~17% vs compute-matched steady pretraining (CPT). After an identical post-training (SFT + RLVR) throughout all variants, the RLP-initialized mannequin retained a ~7–8% relative benefit, with the biggest good points on reasoning-heavy benchmarks (AIME25, MMLU-Professional).

Nemotron-Nano-12B v2: Making use of RLP to a 12B hybrid Mamba-Transformer checkpoint yielded an total common enhance from 42.81% to 61.32% and an absolute +23% achieve on scientific reasoning, regardless that the RLP run used ~200B fewer tokens (coaching for 19.8T vs 20T tokens; RLP utilized for 250M tokens). This highlights knowledge effectivity and architecture-agnostic habits.

https://github.com/NVlabs/RLP/blob/principal/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf

RPT comparability: Below matched knowledge and compute with Omni-MATH-style settings, RLP outperformed RPT on math, science, and total averages—attributed to RLP’s steady information-gain reward versus RPT’s sparse binary sign and entropy-filtered tokens.

https://github.com/NVlabs/RLP/blob/principal/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf

Positioning vs. Put up-Coaching RL and Information Curation

Reinforcement Studying Pretraining (RLP) is orthogonal to post-training pipelines (SFT, RLVR) and exhibits compounding enhancements after normal alignment. As a result of the reward is computed from mannequin log-evidence fairly than exterior verifiers, it scales to domain-agnostic corpora (internet crawl, educational textual content, textbooks) and SFT-style reasoning corpora, avoiding the brittleness of slim curated datasets. In compute-matched comparisons (together with CPT with 35× extra tokens to match FLOPs), RLP nonetheless led on total averages, suggesting the enhancements derive from goal design, not funds.

Key Takeaways

  • RLP makes reasoning a pretraining goal: pattern a chain-of-thought earlier than next-token prediction and reward it by data achieve over a no-think EMA baseline.
  • Verifier-free, dense, position-wise sign: works on bizarre textual content streams with out exterior graders, enabling scalable pretraining updates on each token.
  • Qwen3-1.7B outcomes: +19% vs Base and +17% vs compute-matched CPT throughout pretraining; with an identical SFT+RLVR, RLP retains ~7–8% good points (largest on AIME25, MMLU-Professional).
  • Nemotron-Nano-12B v2: total common rises 42.81% → 61.32% (+18.51 pp; ~35–43% rel.) and +23 factors on scientific reasoning, utilizing ~200B fewer NTP tokens.
  • Coaching particulars that matter: replace gradients solely on thought tokens with a clipped surrogate and group-relative benefits; extra rollouts (≈16) and longer thought lengths (≈2048) assist; token-level KL anchoring affords no profit.

Conclusion

RLP reframes pretraining to instantly reward “think-before-predict” habits utilizing a verifier-free, information-gain sign, yielding sturdy reasoning good points that persist via an identical SFT+RLVR and prolong throughout architectures (Qwen3-1.7B, Nemotron-Nano-12B v2). The tactic’s goal—contrasting CoT-conditioned chance in opposition to a no-think EMA baseline—integrates cleanly into large-scale pipelines with out curated verifiers, making it a sensible improve to next-token pretraining fairly than a post-training add-on.


Take a look at the Paper, Code and Challenge Web page. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as properly.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles