Your Brain Runs on Stories. AI Runs on Text.

My six-year-old nephew declared his toy robot "sad" last week when the battery died. Not broken. Not off. Sad. I was halfway through scribbling notes about theory of mind before dinner was even on the table.

But the more I sat with it, the more I realized the interesting thing wasn't the emotional attribution — it was the narrative machinery underneath it. My nephew had already slotted the robot into a story: a protagonist with an internal state, a trajectory interrupted, a reason why things changed. He wasn't just labeling. He was narrating.

Children don't just process language. They organize experience through story. And that distinction — between parsing language and reasoning through narrative — turns out to be one of the most revealing gaps between human cognition and AI language systems.

Causality Is the Spine of Every Story

Stories aren't lists of events. They're causal chains. The dragon burned the village because it was angry. The knight set out because the villagers needed help. Strip out the causality and you have a chronicle. Keep it and you have a narrative that someone can actually follow, predict, and remember.

Children develop sensitivity to causal language earlier than we used to think. A 2025 study in Nature Human Behaviour tested 691 children on how they parse causal verbs — specifically, how they distinguish "she broke it" (a direct, proximal cause) from "she caused it to break" (a more distal cause, mediated by intermediate steps). By age 4, children already mapped "caused" to distal causes and action verbs to proximate ones (Majid et al., 2025). This is not a surface-level pattern match. It requires building a representation of causal chains — encoding who acted on what, through what mechanism, at what remove.

What develops later is even more interesting: understanding absence-based causation, as in "she caused it to break by not holding it." Representing causation through non-events is philosophically hairy, and children's minds handle it with a structured developmental lag that tells us something real about the shape of causal cognition.

AI language models learn statistical co-occurrences of words. The form of causal language is well within their reach. The underlying causal graph is not — which is why they can write grammatically appropriate causal sentences about situations they'd reason incorrectly about if you pressed them.

The Generalization Problem

Understanding a story also requires compositionality — the ability to combine known concepts into novel arrangements. "The anxious accountant befriended a migratory bird" is probably not in any training set, but you understood it immediately. You composed the meanings of those words into a coherent scenario. And if you were told that story continued with the accountant eventually relying on the bird's navigation instincts during a difficult commute, you'd find that oddly satisfying rather than incoherent.

For decades, critics argued that neural networks couldn't achieve systematic compositionality — Fodor and Pylyshyn's famous 1988 challenge, which held that networks could fake compositional behavior without actually possessing it. Lake and Baroni (2023) took this head-on with Meta-Learning for Compositionality (MLC): a training procedure that exposes a transformer to a dynamically generated stream of few-shot compositional tasks, forcing the system to actually learn how to recombine concepts. The result outperformed GPT-4 on standard compositional benchmarks and matched human performance. More importantly, it did so on genuinely novel combinations — not interpolations of seen patterns.

That's a real result. But note what level it operates at: syntactic and semantic compositionality, the ability to combine word-level meanings into sentence-level meanings. Narrative compositionality is harder. It means combining characters with desires, events with causal consequences, settings with physical affordances, and all of it into a structure that holds together across time. Syntactic compositionality is a prerequisite. It's not the finish line.

Story Grammars as Abstract Programs

Here's the thing about children and stories: they don't just follow them. They extract the underlying rules.

Ask a five-year-old to make up a story and you'll get something with recognizable structure — a protagonist, a problem, an attempt, a resolution, usually a moral. Developmental psychologists call this "story grammar," and it's not something anyone teaches explicitly. Children infer the schema from exposure, then apply it generatively.

This is structurally similar to what Rule et al. (2024) call symbolic metaprogram search: the process by which humans learn abstract rules by searching for the most compact generative description that accounts for observed examples. Their system, MAPS, not only outperforms neural networks at rule learning but predicts human errors better than other models — which is strong evidence that human learning really is something like a search for abstract, compressed programs, not surface-level pattern matching.

Story grammars, in this light, are programs: compressed representations of how narratives work that children use to both comprehend and generate novel stories. Acquiring a story grammar isn't learning a list of features. It's inducing a generative model.

What Language Models Actually Do With Language

Here's a finding that complicates the picture: neural network language models, trained on next-word prediction, can predict human fMRI brain responses to sentences — even when trained on roughly the amount of text a child might encounter by age 13 (Hosseini et al., 2024). The training objective matters more than data volume. Something about predicting the next word, applied to naturalistic language, produces internal representations that are genuinely informative about how the brain processes sentences.

That's not nothing. It suggests that statistical linguistic structure, captured through prediction, isn't just a poor approximation of language — it tracks something real.

But there's a limit that matters enormously for narrative: Hosseini et al. are measuring responses to individual sentences. The brain regions most engaged during extended narrative comprehension — default mode network areas involved in mental simulation, theory of mind, and episodic memory — aren't what language models are optimizing for. Understanding a story requires simulating a world, tracking character goals across time, and holding causal threads across paragraphs. That's a different operation than predicting the next token.

The Grounding Problem

Which brings us to Vong et al. (2024), one of the most interesting experiments in recent cognitive science. They trained a neural network on 61 hours of head-mounted camera footage from a single child aged 6–25 months — the child's actual first-person visual experience of the world. Despite the tiny data footprint, the model learned to map dozens of words to their visual referents and generalized to novel instances.

What this shows isn't that small datasets are enough. It's that the quality of grounding matters — that language anchored to sensorimotor experience, to the actual perceptual context in which words are learned, produces representations that generalize differently than language trained on text alone.

Stories make sense because they invoke bodies, sensations, goals, and physical constraints. When my nephew interprets the robot as "sad," he's drawing on his embodied knowledge of what it feels like to want to keep going and not be able to. That grounding is what turns a linguistic pattern into a narrative inference.

Large language models have read every fairy tale ever digitized. They have no model of what it feels like to be tired.

Practical Takeaways

For AI researchers and practitioners: narrative comprehension is a harder target than sentence-level accuracy. Current benchmarks for NLP largely test syntactic, semantic, and sometimes causal properties of individual sentences or short passages. They miss the schema extraction, multi-step causal tracking, and mental simulation that real narrative understanding requires. The work of Majid et al. (2025) and Lake and Baroni (2023) points toward specific testable properties — causal verb distinctions, compositional generalization, schema-consistent generation across long contexts — that should be part of any serious narrative comprehension benchmark.

For developmental researchers: the convergence between story grammar induction and symbolic metaprogram search (Rule et al., 2024) is worth taking seriously as a computational account of what children are doing when they internalize narrative schemas. If human concept learning is fundamentally a search for compact generative programs, then narrative development is one of the richest and most testable domains in which to study that process.

For educators: story grammar is a real cognitive structure with real pedagogical implications. Rich, repeated exposure to stories with clear causal chains, coherent protagonist goals, and explicit consequences is one of the most data-supported things you can do for reading comprehension development — not vocabulary drills, not phonics in isolation. The science here predates the AI debate by decades and remains underused in practice.

My nephew has refined his theory. The robot wasn't sad, he decided by the end of the week. It was sleeping. He's constructing a narrative that makes sense of its stillness, that gives it a state and a trajectory and an implied future. He's going to work out theory of mind eventually.

Language models might need a bit longer.