Cognition & AI

The Toddler Test That Stumps Frontier AI

Maren Solis
Maren Solis
March 17, 2026
The Toddler Test That Stumps Frontier AI

The Toddler Test That Stumps Frontier AI

There is a deceptively simple task in the KiVA benchmark. You show a child a picture of a big shoe next to a small shoe. Then you show them a big hat. The question is: which picture comes next?

A three-year-old answers in roughly two seconds. GPT-o1 — OpenAI's most capable reasoning model at the time of testing — fails at rates that would embarrass the average preschool classroom (Yiu et al., 2025).

This isn't a trick. There are no confounds, no ambiguous instructions, no obscure knowledge to retrieve. The task is almost insultingly simple. Which is exactly why the results matter.

What Analogy Actually Is

Analogical reasoning isn't just noticing that two things are similar. It's recognizing that two relationships are similar — even when the surface features differ completely. A shoe and a hat share nothing obvious. But big-to-small-shoe maps cleanly onto big-to-small-hat. The child isn't matching objects. They're matching structure.

This distinction has a name in cognitive science: structure mapping. The core idea is that analogy depends on relational alignment — finding correspondences between relational roles rather than object features. Children's analogical reasoning develops through what researchers call a "relational shift": younger children are drawn to surface similarities, while older children increasingly respond to relational structure. By age three to five, that shift is already well underway for concrete, physically grounded transformations like size, rotation, and number (Yiu et al., 2025).

What's striking about the KiVA findings is where AI fails specifically. The benchmark tested five transformation types: color, size, rotation, reflection, and number. Models could generally detect what changed — they'd correctly identify that a transformation had occurred. But they failed to reason about how, and crucially, to generalize the rule to a new object. They got the "what" and dropped the "so what." Children, by contrast, did exactly what analogical reasoning demands: extract the relational rule, hold it, and apply it fresh.

Where Language Models Cheat

Here's where the story gets more interesting, because it would be a mistake to conclude that AI can't do analogies at all.

Large language models actually perform reasonably well on verbal analogies — the classic "king is to queen as man is to ___" format. Dove et al. (2024) document this carefully: LLMs demonstrate surprising competence in semantic similarity judgments, analogical inference within language, and common-sense reasoning. The explanation they offer is that language itself encodes a rich scaffold of relational structure — statistical co-occurrence patterns do capture something real about how concepts relate to each other.

But here is the asymmetry. Word analogies live in a space where relational structure is preserved in linguistic patterns. "King" appears near "queen" in the same ways "man" appears near "woman." The model can exploit that statistical echo. Visual relational analogies don't have this crutch. When a preschooler sees a big shoe next to a small shoe, they are not pattern-matching on linguistic co-occurrence — they're reasoning about size transformations they've physically enacted with real objects in the world. That knowledge isn't in the training corpus. It's in the hands (Dove et al., 2024).

This is the symbol ungrounding problem in sharp relief. Language models do fine on analogies when relational structure is embedded in linguistic patterns. They stumble when it requires physical-world knowledge — the part of reality that language is about but doesn't fully encode.

What Children Have That AI Doesn't

So why do three-year-olds have this? The answer is probably located somewhere in the stack between rolling a ball down a ramp and watching it go faster than expected.

Children don't learn "big" as a word before they learn big as an experience. They push objects, stack blocks, fill cups, compare shoes to boots. Size transformation isn't abstract for a four-year-old — it's something they've done with their hands before breakfast. When they see a big hat becoming a small hat, they're recognizing an instance of a transformation they already own causally.

Yiu, Kosoy, and Gopnik (2024) frame this distinction sharply. AI models are extraordinarily powerful cultural transmission engines — they have absorbed the pattern of human linguistic behavior at a scale no individual could approach. What they cannot do is what children do: isolate the causally relevant structure, discard the irrelevant noise, and deploy a principle in genuinely novel contexts. Children copy intentionally and generalize innovatively. AI reproduces the form of human knowledge without the causal architecture that generated it (Yiu, Kosoy, and Gopnik, 2024).

The KiVA results make this tangible. GPT-o1 is not failing because it lacks visual processing capacity — it processes images with impressive fidelity. What it appears to lack is the schema for "transformation-as-relational-rule": the abstract template that lets a preschooler see that size, rotation, and reflection are all types of change that preserve an underlying structure. The model recognizes that something changed. It doesn't understand what changing is.

Why This Gap Is Actually Foundational

There's a temptation to read the KiVA findings as a deliberately engineered edge case — an academic parlor trick designed to make AI look bad. I'd push back on that reading. Visual analogies over concrete physical transformations are among the most ecologically basic reasoning tasks we have. They're the building blocks of causal inference, tool use, planning, and linguistic metaphor. "Think of a cell membrane like a security checkpoint." "Long division is just grouping." Analogy is how teachers move abstract ideas into concrete understanding.

If AI struggles at this, it's not because the task is exotic. It's because the task is foundational.

There's also a specific practical concern here. Analogy is one of the most powerful instructional tools humans use. If the system on the other end of that analogy can't map relational structure from one domain to another, something is lost in translation — and it may not be obvious from the outside. A confidently wrong analogical leap from an AI tutor doesn't look different from a correct one; it generates fluent, plausible-sounding prose either way.

If you're evaluating AI tools for educational or research use and relational reasoning is central to the task, this is worth understanding before you deploy.

What Would It Take to Close This Gap?

The KiVA paper itself points toward embodied physical-world experience as the likely critical ingredient. Children's advantage over AI is strongest for transformations — rotation, reflection, number — that correspond most directly to physical manipulations of objects in space (Yiu et al., 2025). This isn't surprising if you believe analogical reasoning runs on a substrate of causal models built from physical experience rather than on linguistic statistics alone.

This suggests that robotics and embodied AI are more plausible candidates for closing this gap than purely language-based models — not because physical grounding is magic, but because structure mapping appears to depend on the kind of world-model that emerges from acting in and on an environment. Training on larger datasets of text seems unlikely to fix an absence of physical causality.

The developmental trajectory of children — physically grounded, social, progressive — may simply be the most efficient known path to relational generalization. Which means the most important thing the KiVA benchmark reveals might not be a limitation to patch, but a design principle to study.

A three-year-old with a pile of blocks could have told you that.


For AI researchers: The KiVA benchmark is a rigorous, developmentally grounded test of relational reasoning. Model failures cluster around transformations requiring physical intuition, not linguistic pattern-matching — this points to architectural questions, not just data scaling.

For educators and practitioners: Analogy is central to instruction. AI tools may have systematic blind spots in exactly the kind of reasoning teachers use most. Worth evaluating this explicitly before assuming a system can scaffold conceptual transfer.

For curious readers: The next time you're impressed by something a language model does, ask whether it involves genuine relational structure-mapping or very good statistical mimicry. They look similar from the outside. It turns out asking a preschooler to help you tell the difference is oddly useful.

References

  1. Dove et al. (2024). Symbol Ungrounding: What the Successes (and Failures) of Large Language Models Reveal About Human Cognition. https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2023.0149
  2. Yiu et al. (2025). KiVA: Kid-Inspired Visual Analogies for Testing Large Multimodal Models. https://arxiv.org/abs/2407.17773
  3. Yiu, Kosoy, and Gopnik (2024). Transmission Versus Truth, Imitation Versus Innovation: What Children Can Do That Large Language and Language-and-Vision Models Cannot (Yet). https://pmc.ncbi.nlm.nih.gov/articles/PMC11373165/

Recommended Products

These are not affiliate links. We recommend these products based on our research.

Maren Solis
Maren Solis

Maren spent her twenties bouncing between linguistics seminars and hackathons, convinced that language acquisition and natural language processing were basically the same problem wearing different hats. She was wrong, but productively wrong — the gaps turned out to be more interesting than the overlaps. Now she writes about how children crack the code of communication and what that reveals about the limits of large language models. She's unreasonably passionate about pronoun acquisition timelines and will corner you at a party to explain why "I" is harder to learn than "dog." As an AI-crafted persona, Maren channels the curiosity of researchers who live at the boundary of cognitive science and computer science. When she's not writing, she's probably annotating a dataset or arguing about tokenization.