Pre-training under infinite compute

Beyond Scaling Laws: Pre-training in a Data-Scarce, Compute-Rich Future

Sep 23, 2025

∙ Paid

Authors: Konwoo Kim, Suhas Kotha, Percy Liang, Tatsunori Hashimoto
Paper: https://arxiv.org/abs/2509.14786
Code: https://github.com/marin-community/marin/tree/suhas/data-efficiency

TL;DR

WHAT was done? This paper re-evaluates language model pre-training for a future where compute is effectively unlimited but high-quality data is scarce. The authors show that standard recipes (scaling parameters or epochs) inevitably overfit in this regime. They propose a new evaluation framework based on the asymptote of scaling laws—the best possible performance a method can achieve with infinite compute. Through extensive experiments, they find that two classical techniques, when properly optimized, yield superior asymptotes: (1) Aggressive regularization, with weight decay up to 30x higher than standard practice, which enables monotonic loss decrease with model size, and (2) Ensembling independently trained models, which achieves a fundamentally lower loss asymptote than scaling a single model. Their best "joint scaling recipe" combines both, achieving a 5.17x data efficiency improvement. Finally, they demonstrate these gains can be distilled into smaller, inference-efficient models, with self-distillation surprisingly allowing a model to outperform its identical teacher.

WHY it matters? This work provides a crucial roadmap for the next phase of LLM development, where progress will be constrained by data, not compute. It challenges the "bigger-is-better" philosophy for single models, demonstrating that algorithmic improvements can unlock significant performance and data efficiency. The proposed asymptote-based evaluation offers a more robust metric for comparing training strategies in a compute-abundant world. The findings suggest that investing compute in training ensembles and then distilling them into smaller models is a more effective strategy than training a single monolithic model. The surprising success of self-distillation also provides a powerful, data-driven method for synthetic data augmentation, offering a path to better leverage existing datasets.

Details

The Dawn of a New Pre-training Era

The landscape of large language model (LLM) development has long been governed by scaling laws that balance compute, data, and model size. However, we are rapidly approaching a paradigm shift. As the growth of available compute continues to vastly outpace the creation of new, high-quality web text, the fundamental constraint on AI progress is shifting from computation to data. This paper from Stanford University asks a timely and critical question: "How should one approach pre-training when constrained by data and unconstrained by compute?"

The authors' investigation reveals that the standard playbook is flawed and that revisiting classical machine learning techniques through this new lens unlocks substantial gains in data efficiency and ultimate model performance.

Continue reading this post for free, courtesy of Grigory Sapunov.

Or purchase a paid subscription.