ArXivIQ

ArXivIQ

Learning without training: The implicit dynamics of in-context learning

Aug 01, 2025
∙ Paid

Authors: Benoit Dherin, Hanna Mazzawi, Michael Wunder, Michael Munn, Javier Gonzalvo
Paper: https://arxiv.org/abs/2507.16003

TL;DR

WHAT was done? The authors propose a theoretical framework that explains in-context learning (ICL) as an implicit, on-the-fly weight modification process. They introduce the concept of a "contextual block" (a generalization of a transformer block) and demonstrate mathematically that the context provided in a prompt is implicitly transformed into a low-rank update to the weight matrix of the subsequent MLP layer. This means the model isn't just retrieving information; it's dynamically re-parameterizing itself during inference. The paper provides an explicit formula for this rank-1 weight update and shows that the sequential processing of context tokens resembles a stochastic gradient descent optimization on the MLP weights.

WHY it matters? This work provides a compelling and more general mechanistic explanation for the "magic" of ICL, moving it from a black-box emergent property to a quantifiable, architectural dynamic. Unlike prior work that often relied on simplified toy models (e.g., linear attention), this framework is designed to be closer to real-world transformers. It creates a powerful link between ICL and parameter-efficient fine-tuning (PEFT) methods like LoRA, suggesting that ICL might be an implicit form of the same low-rank adaptation. This insight offers a new lens for interpretability, could lead to more principled prompt engineering, and provides a foundational theory for how models "learn without training."

Details

Large Language Models (LLMs) possess a remarkable ability to learn new tasks from a few examples provided in the prompt—a phenomenon known as in-context learning (ICL). This capability, which occurs at inference time without any explicit weight updates, has been a central mystery in AI. While researchers have conjectured that LLMs might be performing some kind of implicit optimization, a significant line of theoretical work has focused on showing this happens via implicit gradient descent. However, these foundational studies, such as those by von Oswald et al. (https://proceedings.mlr.press/v202/von-oswald23a.html, see a couple of sequels here) and Akyürek et al. (https://arxiv.org/abs/2211.15661), often had to rely on simplified models with narrow assumptions, such as linear self-attention, to make the mathematics tractable. This paper offers a more general explanation, proposing that ICL is a direct consequence of how transformer blocks process contextual information.

User's avatar

Continue reading this post for free, courtesy of Grigory Sapunov.

Or purchase a paid subscription.
© 2026 Grigory Sapunov · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture