ArXivIQ

ArXivIQ

Group Representational Position Encoding

Unifying RoPE, ALiBi and FoX.

Jan 18, 2026
∙ Paid

Authors: Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
Paper: https://arxiv.org/abs/2512.07805
Code: https://github.com/model-architectures/GRAPE

TL;DR

WHAT was done? The authors introduce GRAPE (Group Representational Position Encoding), a unified framework that derives positional encodings from group actions. By formalizing positions as elements of a Lie group acting on the token representation space, GRAPE unifies two distinct families: multiplicative rotations (recovering RoPE via SO(d)) and additive biases (recovering ALiBi and Forgetting Transformer via unipotent actions in GL(d+k)).

WHY it matters? This work moves positional encoding from heuristic design to rigorous algebraic structure. It demonstrates that widely used methods like RoPE and ALiBi are merely special cases of a broader generator formulation. Crucially, it introduces efficient closed-form matrix exponentials for learnable subspaces (allowing for non-commuting rotations) and proves that “forgetting” mechanisms in long-context modeling are mathematically equivalent to additive group actions, offering a principled path for designing next-generation context-aware architectures.

Details

The Geometry of Sequence Modeling

In the current landscape of Large Language Models, positional encoding (PE) suffers from a dichotomy of design. On one side, multiplicative mechanisms like RoPE treat position as a rotation in the complex plane, preserving norm and offering stability. On the other side, additive mechanisms like ALiBi inject linear biases directly into attention logits to improve length extrapolation. Recently, dynamic “forgetting” mechanisms like FoX have emerged to handle infinite contexts. The bottleneck is that these methods are often treated as distinct engineering heuristics rather than derivations of a single underlying principle. This fragmentation makes it difficult to design architectures that possess the benefits of both—specifically, the orthogonality of RoPE and the decay properties of additive biases—without relying on ad-hoc combinations.

User's avatar

Continue reading this post for free, courtesy of Grigory Sapunov.

Or purchase a paid subscription.
© 2026 Grigory Sapunov · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture