Chain of Thought Was Invented Twice

My colleague's daughter is four years old. This morning I watched her work a wooden puzzle — the kind with large, satisfying pieces that can only fit one way — and she narrated the entire thing out loud to no one in particular. "This is a blue part. The sky is blue. This goes up. No... wait. The horse is not up here, the horse is here." She placed a piece incorrectly, stared at it, said "no," removed it, placed it correctly. Then, without missing a beat: "I knew that."

I've been thinking about that exchange all week, because I've been reading about chain-of-thought prompting.

Private Speech

Lev Vygotsky noticed this phenomenon in the 1930s, before anyone called it anything. Children talk themselves through problems — aloud, unselfconsciously, as though narration and cognition were the same act. He called it private speech, and his insight was this: it isn't a developmental failure. It isn't the child forgetting to think silently. It is thinking made audible.

What private speech does, Vygotsky argued, is externalize the regulatory function of language. Social speech — the speech we use with other people — has always had a directive quality. Adults tell children what to do; children hear themselves being directed; eventually, they direct themselves, first aloud, then under their breath, then in silence. Private speech doesn't disappear. It goes underground. By age seven or eight, most children have internalized it into what we call inner speech — that compressed, fast, fragmentary monologue that most of us experience as thinking itself.

The developmental trajectory matters: outer social → outer private → inner. Language is borrowed from the world before it becomes personal. The voice in your head wasn't always yours.

The AI Parallel

In 2022, a team at Google published a finding that shouldn't have surprised anyone who'd read Vygotsky, but somehow still did: when you prompt large language models to reason step by step — to show their work before giving an answer — their performance on complex reasoning tasks improves dramatically. This technique became known as chain-of-thought prompting, and it has since become foundational to how frontier AI systems work. Many models now do it by default, generating extended reasoning traces before producing a final response.

The parallel is almost uncomfortably neat. A model given only a question often produces a flat, overconfident answer — the equivalent of a child blurting something out without thinking. The same model with chain-of-thought reasoning talks itself through the problem, catches contradictions, revises, and arrives at a more considered position. Private speech. In a machine.

But what's actually happening beneath that surface?

What Language Can and Cannot Do

Here is where it gets complicated, and where a careful distinction matters more than the neat story suggests.

Mahowald and colleagues (2024), in a landmark synthesis published in Trends in Cognitive Sciences, draw a sharp line between two types of competence that often get conflated when we evaluate what language models can do. There is formal linguistic competence — mastery of grammatical structure, statistical regularities, the surface properties of language — and there is functional linguistic competence — the capacity to use language to reason, plan, and communicate meaningfully about the world. LLMs excel at the former and struggle systematically at the latter. The paper maps this distinction onto neuroscience: the brain's dedicated language network handles formal competence, while reasoning and planning are computed by separate, non-linguistic systems.

This matters for chain of thought because it forces a difficult question: when a model generates a reasoning trace, which kind of competence is doing the work? Is the step-by-step narration actually thinking — functional competence genuinely engaged — or is it an elaborate performance of the surface form of thinking, one that happens to improve outcomes because the training data contained examples of humans who wrote their reasoning out and reached better answers?

For a child, Vygotsky's account would insist the distinction is meaningful. Private speech isn't just the external form of reasoning; it constitutes reasoning, because the child's capacity to self-regulate — to catch the wrong puzzle piece, to say "no" and try again — is built through language, not merely expressed by it. The social dimension matters: the child has absorbed the regulatory functions of others' speech and made them her own.

For an LLM, Dove and colleagues (2024) offer a useful frame through what they call symbol ungrounding. The impressive semantic behavior of language models — their ability to make analogies, inferences, and associations — reflects real information encoded in linguistic co-occurrence statistics. But when reasoning requires embodied, contextual, or causal knowledge that can't be captured in text, performance collapses. Chain-of-thought reasoning can scaffold the parts of the reasoning process that are linguistically representable. It cannot bootstrap the parts that aren't.

A Mirror Without a Reflection

There is another dimension here I keep returning to.

Steyvers and Peters (2025), in an empirical study comparing human and LLM metacognition, found a striking parallel: both humans and LLMs tend toward overconfidence, yet both show similar metacognitive sensitivity — meaning their confidence ratings are roughly equally diagnostic of actual accuracy. When you ask either a person or a model how certain it is, that confidence correlates with correctness at roughly comparable rates.

But there's a critical difference in the source. For humans, metacognition arises from privileged access — however imperfect — to our own cognitive processes. For LLMs, Steyvers and Peters argue, this apparent metacognitive behavior may be an artifact of training on vast quantities of human-generated text that includes descriptions of human metacognition. The model has learned the form of self-monitoring without necessarily having the mechanism.

Inner speech is, among other things, a metacognitive act. When my colleague's daughter said "I knew that" — incorrectly, triumphantly, having just corrected her own mistake — she was performing a kind of self-monitoring that will, over years, become the quiet checking that most of us barely notice in ourselves. The voice that says wait, something's off or hold on, let me think about that again.

If LLMs are learning the surface form of this metacognitive narration without the underlying self-monitoring mechanism, chain-of-thought prompting may be more like a stage performance of reasoning than reasoning itself. And yet — here is what I find genuinely strange — the performance often improves outcomes as much as the real thing would. Which raises a question I'm not sure how to answer: at some level of abstraction, does the distinction matter?

The Social Origins of Thinking Alone

Gopnik and colleagues (2025) argue, in a striking reframe published in Science, that large AI models should be understood not as autonomous agents approaching general intelligence, but as cultural and social technologies — tools for distilling and redistributing the accumulated knowledge of human culture, analogous to writing or print or representative democracy. The argument draws explicitly on developmental psychology: biological minds develop through embodied experience; AI systems learn by compressing the record of human culture into statistical form.

This resonates deeply with Vygotsky's account. The reason inner speech works — the reason children can regulate their own thinking through language — is that language is never private in origin. It comes from the world. It is saturated with the regulatory structures of social interaction. The child borrows the mother's directive voice, internalizes it, and eventually speaks to herself with her own.

Chain-of-thought prompting, in this light, is AI's version of private speech — but it is private speech assembled from a distillation of everyone else's private speech, compressed and averaged across millions of human reasoning traces. There is no unique self internalizing it. There is only the pattern.

And that pattern, it turns out, works. Not because the model thinks, exactly. But because thinking, for humans, is itself a pattern learned from the social world — and a sufficiently detailed copy of that pattern, applied in the right context, produces something that functions like reasoning, regardless of whether anything resembling reasoning is taking place underneath.

The Voice Goes Underground

Which brings me back to the puzzle. My colleague's daughter will internalize her narration into something faster, more compressed, eventually nearly silent. She'll keep the regulatory function; she'll shed the theater of it. What she'll be left with is a cognitive tool she built herself, from borrowed material, that she's no longer aware she's using.

Some AI architectures are already beginning to move in this direction — extended latent reasoning steps that don't surface as readable text, internal representations that reason implicitly before producing output. In engineering terms, this looks like progress: more efficient, less verbose, faster.

But if chain-of-thought is the thing that makes current models' reasoning visible to us — the externalized private speech that lets us catch errors, inspect inferences, spot the moment something went wrong — what happens when it goes underground? Vygotsky's children lost their narration and gained inner thought. We can't hear the inner thought.

Is the same trajectory desirable in a system whose reasoning we might actually need to check?

That, I think, is the question the parallel leaves us with. Not whether AI can learn to think out loud. It already has.

Whether we'll still be able to hear it — that's the part I'm less sure about.