Tapered Language Models
Authors: Reza Bayat, Ali Behrouz, Aaron Courville
Paper: https://arxiv.org/abs/2606.23670
Code: N/A
Model: N/A
TL;DR
WHAT was done? The authors introduce Tapered Language Models (TLMs), a simple, architecture-agnostic design principle that challenges the industry-standard uniform-depth parameter allocation. Under a strictly preserved parameter and FLOP budget, the system monotonically decays the feed-forward network (FFN/MLP) intermediate dimension across layers from input to output using a smooth half-cosine schedule. This results in front-loading representational capacity where feature extraction demands it most, while slimming down later layers that primarily refine existing representations.
WHY it matters? This work exposes a massive, zero-overhead design lever that has been hidden in plain sight since the original Transformer. By simply reallocating existing parameters along the depth axis, researchers and developers can systematically lower validation perplexity and elevate downstream commonsense reasoning capabilities across multiple distinct model families and scales. It offers a “free lunch” performance upgrade without adding training or inference latencies, parameter bloat, or complex dynamic routing schemes.
Another recent example of a transformer with irregular width, the ><former, was here.
Details
The Homogeneity Bottleneck
For nearly a decade, the field of deep learning has operated under an unexamined default inherited from Attention Is All You Need: the homogeneous stack of identical blocks. Whether scaling standard transformers, recurrent-memory systems, or state-space models, the parameter allocation remains strictly uniform across depth. Every layer receives the exact same capacity regardless of its position in the network.
This architectural symmetry directly conflicts with a mounting body of evidence from mechanistic interpretability and post-training analysis. Early-exit architectures, layer-pruning techniques like The Unreasonable Ineffectiveness of the Deeper Layers, and studies such as The Remarkable Robustness of LLMs have repeatedly shown that deeper layers are highly redundant. This paper acts upon this profound structural asymmetry, establishing a clean, parameter-preserving method to redistribute capacity from the later refining stages of the network to the computationally demanding early layers.


