Skip to content
Go back

Less is More: Recursive Reasoning with Tiny Networks

In recent years much of the spotlight in AI has been on big models: large language models (LLMs) with billions or even trillions of parameters, trained on massive data sets, achieving superhuman performance in many tasks. But the question remains: is scale everything? Can thoughtful architecture and training design yield strong performance even with far fewer parameters?

That is precisely the question Less is More: Recursive Reasoning with Tiny Networks (Jolicoeur-Martineau, 2025) paper addresses. It shows that a very small neural network, only ~7 million parameters, can compete or exceed large models on challenging “reasoning” benchmarks (structured tasks like puzzles) by using a form of recursive reasoning. The key idea: you don’t necessarily need more layers or more width, you can ask a small model to think again, refine its answer, rather than simply predict once.

Specifically, the work benchmarks itself on tasks such as the ARC-AGI benchmark (Abstract Reasoning Corpus for “general intelligent” reasoning), as well as Sudoku and Maze tasks, which are deliberately chosen because they require pattern recognition + structured reasoning rather than plain language modelling. The results are noteworthy.

Thus this paper delivers a compelling message: less can be more, if the architecture encourages iterative refinement, self-correction, and generalization rather than straight memorisation.

architecture

The conceptual core: recursive refinement

At a conceptual level the paper contrasts two styles of reasoning architectures:

zt+1=f(zt,yt,x),yt+1=g(zt+1,yt) z_{t+1} = f(z_t, y_t, x), \qquad y_{t+1} = g(z_{t+1}, y_t)

After TT iterations we output yTy_T. Training uses supervision at multiple intermediate steps (deep supervision) so that each intermediate yty_t is encouraged to be closer to the target. This lets the network learn how to refine its answer rather than simply what the answer is.

The paper builds on earlier work called the Hierarchical Reasoning Model (HRM), which used two small networks recursing at different frequencies (a “low-level” module and a “high-level” module) in a biologically-inspired dual-timing loop.

But the novel contribution here is the Tiny Recursive Model (TRM), which strips the architecture down to a single network (two layers) that simply loops, refining its latent reasoning state and the current output. No dual modules, no fixed-point approximations, no separation of frequencies. It is elegantly minimalist, and yet the experiments show this suffices (and even outperforms more complex designs).

Mathematical formulation

Inputs and variables

Update rules For t=0,1,,T1t = 0,1,\dots,T-1:

zt+1=f(zt,yt,x;θ)yt+1=g(zt+1,yt;θ)z_{t+1} = f(z_t,y_t,x;\theta) \qquad y_{t+1} = g(z_{t+1},y_t;\theta)

Here ff and gg are parameterized (the same small network may implement both roles). In code the authors simplify by combining into one “tiny network” that takes (x,yt,zt)(x, y_t, z_t) and outputs a refined (zt+1,yt+1)(z_{t+1}, y_{t+1}).

Loss function (deep supervision) Rather than only enforce loss on the final output yTy_T, the training pushes every intermediate yty_t towards yy^\star. For example:

L=1Tt=1TCE(yt,y)+λLhalt(qt)\mathcal{L} = \frac1T \sum_{t=1}^T \mathrm{CE}(y_t,y^\star) + \lambda \mathcal{L}_{\rm halt}(q_t)

where CE(,)\mathrm{CE}(\cdot,\cdot) is the cross-entropy loss, and Lhalt\mathcal{L}_{\rm halt} is a small additional term for a halting head: a learned scalar/logit qtq_t at each iteration that predicts “should I stop refining?” This lets the model skip further loops if a correct answer is already reached.

Architecture choices

Recursion vs parameter expansion trade-off A key mathematical insight: Recursion allows a small network to behave like a deeper network, effectively “unrolling” many layers via time instead of width/depth in space. For example, iterating TT times of a 2-layer network is analogous to a 2T2T-layer deep network, except each “layer” shares parameters (the same small net reused) and training enforces intermediate correctness. This parameter sharing reduces overfitting risk and allows training on fewer examples.

Experimental highlights

Conclusion

The Tiny Recursive Model transforms reasoning into an iterative, self-corrective process. Instead of producing a single answer, the network repeatedly refines its prediction, fixing earlier errors and improving with each pass. Because the same small network is reused at every step, it achieves strong generalization with very few parameters, avoiding overfitting on limited data. Deep supervision teaches the model not only to find the correct answer but also to learn how to reach it, while recursion provides effective depth without the cost of a larger architecture. The key insight is that simplicity, one small recurrent loop, is more efficient than complex, multi-network designs.

Still, the approach remains confined to structured, grid-based problems like Sudoku or ARC. Its scalability to open-ended reasoning or larger datasets is untested, and recursive training increases memory demands. The halting mechanism and theoretical foundations also need refinement. Comparisons with large language models should be seen cautiously, as they address different tasks.

This line of work suggests that future AI systems could emphasize algorithmic structure over size. Recursive refinement and parameter sharing may lead to lightweight, deployable reasoning models and inspire new ways for large models to “think iteratively” rather than only forward once.


Share this post on:

Previous Post
Geodesic Calculus on Latent Space
Next Post
Muon Optimizer: From Conception to Modern Adaptations