Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Jul 16, 2025

∙ Paid

Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun
Paper: https://www.arxiv.org/abs/2507.10524
Code: https://github.com/raymin0223/mixture_of_recursions

TL;DR

What was done? The authors introduce Mixture-of-Recursions (MoR), a novel Transformer framework that unifies two major efficiency paradigms: parameter sharing and adaptive computation. MoR reuses a shared block of layers across multiple recursion steps for parameter efficiency. Crucially, it employs lightweight routers to dynamically assign a different number of recursion steps (i.e., computational depth) to each individual token based on its complexity. This adaptive "thinking" is complemented by integrated, memory-efficient Key-Value (KV) caching strategies that selectively cache or share KV pairs based on the routing decisions, reducing memory footprint and prefill latency.

Why it matters? MoR establishes a new Pareto frontier for language model efficiency. By intelligently allocating computation, it achieves superior or comparable performance to much larger vanilla models while using significantly fewer unique parameters and less training compute. For example, an MoR model with 167M parameters outperforms a 315M-parameter vanilla baseline when trained with the same FLOPs budget. This work demonstrates that large-model quality is attainable without incurring large-model costs, making advanced AI more accessible for training and deployment. Furthermore, the adaptive depth mechanism provides a structural basis for "latent reasoning," paving the way for more intrinsically efficient and capable models.

Details

The pursuit of more capable Large Language Models (LLMs) has long been synonymous with scaling up. But this paradigm comes with prohibitive costs, limiting access to frontier models. To solve this, research into efficient LLMs has largely followed two parallel tracks, as detailed in surveys like Efficient Transformers: A Survey (https://arxiv.org/abs/2009.06732): parameter sharing, which reduces model size by reusing weights, and adaptive computation, which allocates compute only where needed (see my series on adaptive computation time in neural networks). The recent "Mixture-of-Depths" paper (https://arxiv.org/abs/2404.02258) showed the power of the latter. Now, a new paper introduces Mixture-of-Recursions (MoR), a compelling framework that brilliantly merges these two tracks into a single, cohesive architecture, demonstrating that we can have the best of both worlds.

Keep reading with a 7-day free trial

Subscribe to ArXivIQ to keep reading this post and get 7 days of free access to the full post archives.