Transformers Look. Children Pay Attention.

Last weekend my eight-year-old nephew Mateo and I were testing our little wheeled robot in the kitchen. The robot had a camera. The robot could "see." But every time it approached the dishwasher, it would slow down, freeze, and spin — overwhelmed by the reflective stainless steel panel, completely failing to register the wall it was about to drive into. Meanwhile Mateo, without thinking about it, was ignoring the dishwasher entirely and tracking the wheel spin, the yaw, the tiny judder that told him the left motor was working harder than the right.

That's the thing. The robot had more raw visual data than Mateo. But Mateo was paying attention. The robot was just... looking.

There's a difference. A huge one. And it sits right at the heart of one of the biggest unsolved problems in AI.

What Attention Actually Is (When It Works)

Modern transformer models have an "attention mechanism" — you've probably heard this phrase. The key innovation is that every token in a sequence can attend to every other token, weighted by learned relevance scores. It's genuinely powerful. It's why large language models can track subjects across long paragraphs, and why vision transformers can relate a texture in one corner of an image to a shape in another.

But here's what transformer attention is not: it's not selective in any biological sense. It's computed over the full context simultaneously, with relevance scores shaped by gradient descent during training. It doesn't dynamically allocate cognitive resources based on what's currently surprising, what your body is trying to do, or what you can't afford to miss right now. The attention heads don't know there's a dishwasher. They don't know they're about to crash.

Children's attention is something fundamentally different. Developmental psychologists have documented how working memory — the ability to hold task-relevant information active while suppressing distractors — transforms through childhood, tightening and becoming far more deliberate between ages 3 and 7. And the neuroscience is clear: this isn't just about growing a bigger "working memory buffer." It's a deep restructuring of how the brain weights incoming sensory signals against its own ongoing internal predictions.

The Brain Doesn't Have an Attention Slider. It Has Priors.

Here's the framework I keep coming back to: in biological cognition, attention is fundamentally about precision. According to the active inference framework developed by Friston and colleagues, attention is the process by which the brain assigns confidence weights — precisions — to different streams of prediction error (Parr, Pezzulo, and Friston, 2022). You're not choosing what to look at in a vacuum. You're choosing which mismatches between your predictions and reality are worth committing resources to fix.

This means biological attention is inseparable from your generative model of the world — your ongoing, moment-to-moment predictions about what's about to happen. And that generative model is continuously updated through the body's sensorimotor loop. When Mateo hears the left wheel struggling, his nervous system flags a precision-weighted prediction error: this doesn't match what turning should sound like. His attention snaps there. He doesn't consciously decide to look; the prediction failure redirects him automatically.

That's a fundamentally different architecture than a transformer. Transformer attention is computed in parallel across a static window; biological attention is a continuous, online process embedded in a body that's perpetually predicting and perpetually updating.

Safron et al. (2024) lay out the computational structure clearly: the Bayesian brain encodes beliefs as probability distributions and updates them via prediction errors, with the brain allocating attentional precision according to where uncertainty is both highest and most actionable. It's not just that you notice what you don't expect — you prioritize the surprises you can actually do something about. Working memory, in this picture, isn't a holding tank. It's the active maintenance of high-precision prediction errors that need resolution.

The Developmental Trajectory Is the Story

Here's something I love about the developmental data: the way attention changes across childhood isn't just about getting better at focusing. It tracks a broader optimization of the entire learning system.

Giron et al. (2023) studied 281 participants aged 5 to 55 and found that children's exploration patterns — how broadly and randomly they search new environments — closely resemble the "cooling schedule" of stochastic optimization algorithms like simulated annealing. Young children explore widely and stochastically, attending to everything with roughly equal weight. As development proceeds, exploration narrows and becomes more targeted. Multiple parameters — reward generalization, uncertainty-directed attention, pure randomness — restructure simultaneously over years of childhood.

This is not simply "children learn to focus." It's children optimizing an entire attentional resource allocation system in parallel, across interacting parameters, with childhood as the high-temperature exploration phase. It maps beautifully onto how transformers train — high learning rates early, then cooling down as representations stabilize — but in kids, it plays out over years, driven by physical interaction with the actual world.

The key word there is physical. Working memory development in children doesn't happen in isolation from the body. Children who get more hands-on object play, more opportunities to act on their environment and observe the consequences, show better attentional control. The body isn't the delivery vehicle for the brain's computations. It's a co-constructor of them.

This isn't sentimentality about "learning through play." It's the architecture. Active inference tells us that action and attention are part of the same loop — you act to generate the sensory evidence that tests your predictions, and prediction failures redirect your attention, which shapes your next action. Pull one piece out and the loop breaks.

Embodied Robots Are Starting to Get This

The most exciting recent robotics work is quietly acknowledging what the AI field spent a decade resisting: that real-world intelligence requires sensing through a body, not just processing information about one.

Shi et al. (2025) published a system called ELLMER — an embodied large-language-model-enabled robot — that grounds GPT-4's reasoning in continuous physical sensorimotor feedback. Force sensors and visual feedback aren't just extra input channels; they're the loop that makes physical reasoning coherent. The robot doesn't just plan how to grip something; it continuously updates its grip based on force prediction errors in real time. ELLMER can complete long-horizon tasks in genuinely messy, unpredictable real-world environments in a way that purely language-based reasoning systems cannot. The key insight from the authors: sensorimotor grounding isn't a feature, it's the mechanism.

That's active inference without the name on the tin. The sensorimotor loop creates the precision signals that direct attention, which refine the action, which generate new prediction errors. Loop, loop, loop.

What I find equally striking is that effective attentional resource allocation might not require massive scale. Jansen et al. (2025) showed that tiny recurrent neural networks — sometimes with as few as one to four units — outperform large Bayesian models at predicting individual human behavior across reward-learning tasks. Small, structured, recurrent circuits can discover interpretable cognitive strategies that massive, parameter-rich models miss entirely. When your cognitive budget is limited, you're forced to be efficient — and efficiency produces generalizability.

This connects to a beautiful result from Schulz et al. (2025): humans' ability to generalize from few examples may emerge precisely because our cognitive resources are constrained. An information-theoretic pressure to maximize reward using the simplest possible internal model forces the brain to discover abstract, reusable representations. Working memory isn't just a bottleneck slowing us down. The bottleneck may be the engine of generalization.

What This Means If You're Building (or Teaching) Anything

Let me make this concrete, because the implications go in a few different directions.

For AI researchers: The lesson from embodied cognition isn't "add more sensors." It's that the architectural separation between attention, prediction, and action in current systems is the design problem. Attention should be a consequence of ongoing prediction errors, dynamically weighted by what the system is trying to do in the world. That's a harder problem than adding force feedback, but it's the right one. The active inference and free energy frameworks aren't just neuroscience curiosities — they're engineering specifications for architectures that actually close the loop.

For robotics engineers: The ELLMER results are a real signal. Sensorimotor grounding isn't a nice-to-have for household or caregiving robots — it's the mechanism that keeps abstract reasoning operationally stable in the physical world. The exciting near-term design question is whether precision-weighting of sensory channels can emerge from learning rather than being hand-engineered by researchers who have to guess in advance which signals will matter.

For educators and developmental scientists: The developmental story of working memory is profoundly tied to the quality of children's physical, hands-on experience. Object manipulation, spatial play, and active motor exploration are not merely fun — they are the substrate on which attentional control is built. This has implications for how we structure early learning environments, and also for what we choose to measure when assessing cognitive development. (As always, decisions about individual children's cognitive assessments should involve a developmental pediatrician or neurologist — the group-level science is compelling, but individual development needs individual evaluation.)

For the curious: Next time you notice yourself zeroing in on exactly the right detail at exactly the right moment — the hesitation in someone's voice, the slight shimmy in a car's handling, the way a stair feels wrong under your foot — you're experiencing your brain's precision-weighting in real time. A transformer can attend to a hundred thousand tokens simultaneously. But it still can't feel the shimmy.

Mateo eventually fixed the dishwasher problem. He taped a piece of cardboard over the reflective panel. Immediate fix. Then he looked at me and said, "Maybe the robot just needs to not look at that part."

He's not wrong. But the harder question is: how do you build a system that figures out, on its own, which parts are worth looking at — and that updates that judgment in real time as the world surprises it? The answer, I'm increasingly convinced, starts with giving it something like a body, and a reason to care about what's happening to it.