The Banana Phone Problem

A three-year-old picks up a banana, holds it to her ear, and says "hello?" with complete seriousness. She knows it's a banana. She knows it's not a phone. She is holding both of these facts in her mind simultaneously — and choosing, for reasons of play or narrative or sheer delight, to treat one thing as another. In that ten-second performance, she demonstrates counterfactual simulation, symbolic substitution, and a theory of mind that models how another person will react to her pretense. This is what imagination looks like in the wild: not a vague creative impulse, but a precise cognitive operation involving multiple interlocking systems.

Now open your favorite AI image generator and ask it to produce "a banana used as a telephone." It will give you a beautiful result. And then the question we need to sit with is: did anything in that process resemble what the child just did?

I've spent the last several weeks reviewing AI tools proposed for classroom use — sitting on a regional ethics review board evaluating systems that developers describe using words like "creative," "imaginative," and "generative." These are not casual marketing claims. They're presented as cognitive capabilities. And what I found, again and again, was that the documentation behind those claims almost never grappled with what imagination actually is, or whether the system could plausibly be said to have it.

This matters for more than terminological precision. The distinction between imagination and generation is one of the foundational questions in how we design, deploy, and ultimately trust AI systems — especially with children.

What Imagination Actually Is

Developmental psychologists have spent decades unpacking what happens when children pretend. The consensus is that pretend play is not merely entertainment — it's training infrastructure for counterfactual cognition.

When a child plays doctor, she is constructing and navigating a counterfactual world — a world where this pillow is a patient, where certain words mean different things, where specific outcomes follow from specific actions. She knows the real world has different rules, and she is actively suspending them in favor of a chosen alternative. This requires keeping two models of reality active at once: the real and the imagined.

The same cognitive machinery underlies empathy (simulating another person's perspective), causal reasoning (asking "what would have happened if..."), and moral judgment (imagining a world where different choices were made). Counterfactual thinking isn't separate from imagination — it is imagination. The child with the banana phone is not just playing; she is rehearsing one of the most computationally demanding operations the human mind performs.

Human rule-learning appears to work on related principles. Research on how people acquire abstract patterns suggests the mind searches for the most compact symbolic description that fits the observed evidence — a kind of implicit "what if this rule governs everything?" computation (Rule et al., 2024). When children infer that an unfamiliar word follows a new grammatical pattern after hearing just two or three examples, they are not interpolating from a distribution; they are generating a generalized hypothesis about a possible world in which that rule holds. Imagination and induction, in this account, draw on the same underlying machinery.

The AI Side: What "Generative" Actually Means

The word "generative" has two very different meanings, and the confusion between them is one of the more consequential category errors in current AI discourse.

In the developmental sense, generative means originating — producing something novel from an internal model of the world, shaped by intention, emotional stakes, and narrative logic. In the technical AI sense, "generative" means that a model produces outputs by sampling from learned probability distributions over training data.

Alison Gopnik and colleagues make this distinction with unusual clarity. Large AI models, they argue, should be understood as a new kind of cultural technology — more like writing, printing, or markets than like autonomous minds (Gopnik, Farrell, Shalizi, & Evans, 2025). LLMs distill the accumulated record of human thought; they do not originate it. When a language model produces a poem, it is recombining patterns absorbed from millions of human poems. The creativity, in this view, belongs to the humans who built the training corpus. The model is a sophisticated amplifier of cultural inheritance — remarkable and useful, but not itself an imaginer.

This is a strong claim and a contested one. Gopnik et al. are not arguing that generative AI is worthless or that it fails to produce genuinely surprising outputs. They're making a point about the nature of the process — that distillation and imagination are fundamentally different operations, even when they occasionally produce similar-looking results. I find their framing clarifying precisely because it doesn't dismiss what these systems do. It just refuses to overclaim what they are.

Where AI Gets Closer — and Where It Still Falls Short

The Theory of Mind literature offers a useful test case. Understanding a false belief — grasping that someone thinks something is true that isn't — requires exactly the kind of counterfactual simulation at the core of imagination. You must hold a model of reality alongside a model of someone else's mistaken model, and navigate the difference.

Recent benchmarking work tested GPT-4 against 1,907 humans across a comprehensive Theory of Mind battery (Strachan et al., 2024). The results were genuinely surprising: on false-belief tasks — the classic measure of counterfactual simulation about others' minds — GPT-4 performed at or above human levels. It could correctly represent what someone believes even when that belief is factually wrong.

But the same study revealed a telling gap. On faux pas recognition — situations requiring not just tracking of false beliefs but also sensitivity to social intent, emotional consequence, and the understanding that certain things should not be said — GPT-4 struggled significantly. Faux pas recognition requires something more than counterfactual simulation; it requires understanding why certain beliefs matter to the person holding them, and what it would feel like to have a gap between your model and theirs exposed in public.

This maps onto something important about imagination more broadly. Getting the logic of a counterfactual right is one thing. Understanding why it matters to the person living inside it is another. The child with the banana phone doesn't just know she's pretending — she knows it's funny, or reassuring, or narratively satisfying. The emotional and intentional valence is baked into the simulation from the beginning.

The more optimistic result comes from work on systematic compositionality. Lake and Baroni (2023) demonstrated that with the right training objective — meta-learning across diverse compositional tasks — neural networks can achieve human-like systematic generalization: the ability to understand that if you know "twice" and "skip" separately, you can figure out "twice skip" in a novel context. This is the combinatorial recombination engine that underlies much of human imagination, and it now has a credible machine analog.

But Lake and Baroni are careful not to obscure a crucial asymmetry. In children, combinatorial creativity is driven by goals. The child imagines the banana as a phone because she wants something — to role-play, to make someone laugh, to rehearse a scenario she's anxious about. The want is generative. The combinatorial system serves the intention. In AI systems, the "goal" is completing the current task or optimizing a training objective. There is no internal narrative that the generation is in service of.

Why the Distinction Matters for Education and Policy

Here is where I want to get specific, because the stakes are real.

When a school district adopts an "AI creative writing assistant" — and many are doing exactly this, right now — they are making an implicit claim about what kind of cognitive partner their students will have. If the AI is genuinely imaginative — if it can simulate alternative worlds, hold counterfactuals, understand why an outcome matters to a character — then it might plausibly scaffold genuine creative development in students. If it is instead a sophisticated pattern-completer that produces outputs that look like creative writing, the interaction is fundamentally different, and the educational value is of a completely different character.

The problem is that most of the systems I reviewed during my recent work on the ethics board offered no documentation that allowed anyone to distinguish between these two scenarios. "The model produces diverse, creative outputs" is a claim about output variance, not a claim about imagination. Conflating them has real consequences when these tools are deployed in classrooms and marketed to parents as "creative learning companions."

Gopnik et al.'s framing is useful precisely here. If large language models are cultural technology — a sophisticated record of human thought — they might be genuinely valuable as a source of diverse inputs and a mirror for student thinking. A student who uses an AI writing tool to encounter unexpected phrasings, unfamiliar narrative moves, or surprising combinations of ideas is drawing on the accumulated creativity of millions of human writers. That is real and worth something. But it is different from, and in some ways smaller than, having a partner who can genuinely simulate worlds alongside you.

The harder regulatory question is: who is responsible for making this distinction clear? Developers currently have no obligation to document the cognitive theory behind their capability claims. "Creative," "imaginative," and "generative" are used interchangeably in product documentation in ways that would not survive an afternoon of scrutiny from a developmental psychologist. Establishing standards for capability claims — what counts as evidence that a system can do X — is exactly the work that regulators and educators should be demanding before these tools reach children. (If you're involved in AI education policy and navigating these questions directly, talking to researchers in cognitive science and child development before setting standards is genuinely worth the effort.)

The Question Worth Staying With

I want to resist the easy conclusion that AI simply doesn't imagine and children simply do. The truth is messier, and more interesting.

Children don't arrive at imagination fully formed. Two-year-olds engage in simple pretend play; five-year-olds can sustain elaborate shared fantasy worlds with peers; ten-year-olds write stories that process genuine emotional complexity. Imagination develops. It is not a binary capacity but a gradual construction, supported by experience, relationship, language, and the slow accumulation of knowing what it feels like to want something and have it frustrated or fulfilled.

AI systems are developing too — in ways that occasionally surprise the people building them. The systematic generalization results from Lake and Baroni (2023) were not predicted by critics of neural networks. The Theory of Mind performance documented by Strachan et al. (2024) exceeded many researchers' expectations. The space between "sophisticated pattern completion" and "genuine imagination" may be narrower than our intuitions suggest, or it may contain a gap that no amount of scale can close. I genuinely don't know, and neither does anyone else.

What I am confident about: we are deploying these systems at scale before we have answered the question. In education especially — where the point is to cultivate the imaginative capacities of developing minds — that seems like exactly the wrong order of operations.

The banana phone, for all its absurdity, is a remarkable thing. It is a small child telling a large story about what minds can do. We should be careful about building systems that blur that story before we understand it.

For educators: When evaluating AI "creative" tools, ask developers to specify what cognitive operations the system actually performs — not just what outputs it produces. Output diversity is not imagination.

For AI researchers: The distinction between counterfactual simulation (understanding what could be true) and counterfactual generation (producing text about what could be true) deserves much more explicit treatment in capability documentation and evaluation frameworks.

For policymakers: Standards for cognitive capability claims — what counts as evidence that an AI system is "creative," "imaginative," or "reasoning" — are urgently needed before these terms are used in marketing to educational institutions.

For everyone else: Next time you watch a child pretend, notice how much is going on. That banana is doing a lot of work.