Authors: Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel
Paper: https://arxiv.org/abs/2605.06732v2
Code: N/A
Model: N/A
TL;DR
WHAT was done? The authors provide a theoretical and empirical framework that decomposes the return error in model-based reinforcement learning into independent dynamics and reward components. By applying power-law scaling to these distinct error sources, they derive a closed-form solution for optimally allocating a fixed data budget between environment transitions and reward annotations.
WHY it matters? In modern paradigms like Reinforcement Learning from Human Feedback (RLHF) and robotics, reward labels (human preferences or expert evaluations) are significantly more expensive than raw environmental state transitions. This work replaces heuristic hyperparameter tuning with a mathematically rigorous strategy for data collection, proving that the different scaling behaviors of dynamics and reward models dictate fundamentally asymmetric budget allocations.
Executive summary: For research leaders and infrastructure engineers building large-scale world models, this paper delivers a critical mathematical insight: reward models learn much faster than dynamics models. Consequently, data acquisition pipelines should heavily index on transition data over reward annotations. Furthermore, the analysis demonstrates that when constrained by a fixed budget, purchasing high volumes of cheap, noisy reward labels is often mathematically superior to procuring a small batch of high-fidelity, expensive labels, provided the noise is zero-mean.
Details
The Bottleneck of Coupled World Models
The “training in imagination” paradigm, popularized by latent world models like the Dreamer family, optimizes policies entirely within a learned hallucination of the environment. Historically, bounds derived from the simulation lemma have coupled the dynamics and reward errors, or assumed access to ground-truth rewards entirely, as seen in the foundational Lipschitz continuity work by Asadi et al. This theoretical coupling masks a severe practical bottleneck. In reality, learning the physics of an environment (dynamics) and learning the objective function (rewards) are heterogeneous tasks with vastly different sample complexities and procurement costs. By treating the reward model as an independent, controllable source of error with its own dedicated sample budget, this paper establishes a structural foundation for resource allocation in environments where human or expert feedback is the primary financial constraint.


