Who Taught the Machine to Be Good?

I spent last week sitting on a regional ethics review board, evaluating AI learning tools proposed for use in public school curricula. By the third day, a pattern had emerged that I couldn't shake: system after system had been tested for accuracy, engagement metrics, and curriculum alignment. Almost none had documentation about whether — or how — the systems had been evaluated for the values they implicitly taught.

Not intentional values. Not explicitly political ones. I mean something subtler: whose notion of "correct" gets reinforced? When a child gives a wrong answer, how does the system respond — with patience or efficiency? When two students offer different cultural framings of a historical event, which one gets marked as more "complete"?

These aren't edge cases. They're the moral texture of every teaching interaction. And we've somehow convinced ourselves that deploying AI in classrooms is primarily a technical question.

It sent me back to first principles. How do children actually learn to be good? And what does that tell us about the gap between human moral development and what we currently call "AI alignment"?

The Slow Work of Learning Right from Wrong

Developmental psychologists have spent decades asking how morality forms in the human mind. What they've found is both humbling and remarkable.

Infants as young as six months show preferences for "helpers" over "hinderers" in simple puppet shows — suggesting that something like a proto-moral sensibility is operational before language, before school, before any deliberate instruction. By age two, children spontaneously help strangers pick up dropped objects without prompting. By three and four, they're navigating fairness violations with genuine indignation.

But here's what makes this genuinely mysterious: these aren't just reflexes. Children are active moral theorists. As Gopnik (2024) shows in a sweeping review of causal reasoning across childhood, even infants use something like Bayesian causal inference to update their models of how the world works — including the social world. Children aren't just absorbing rules; they're constructing explanations. They want to know why sharing is good, why hitting is wrong. And crucially, Gopnik (2024) argues that children's advantage over AI isn't processing power — it's a "wide prior": a willingness to entertain possibilities that would seem implausible to a system trained to minimize prediction error.

This matters because moral development isn't just about learning a list of prohibitions. It involves building a rich internal model of minds — your own and others'. A five-year-old who understands that her friend is sad because she didn't get a turn is doing something cognitively sophisticated: modeling intentions, predicting emotions, running counterfactuals. She's asking, in effect, what would have happened if I had shared? That's moral reasoning. It's also, not coincidentally, some of the hardest things AI systems fail to do reliably.

The AI Alignment Problem, in Plain English

"Alignment" is the field's word for the challenge of getting AI systems to pursue goals that humans actually value, rather than proxies for those goals that happen to be measurable. It sounds technical, but the underlying worry is genuinely moral: a system optimized for the wrong thing can cause serious harm even when it "succeeds."

The dominant approach is Reinforcement Learning from Human Feedback — RLHF — in which human raters evaluate model outputs and those ratings shape the model's behavior. It's not dissimilar, at a formal level, from operant conditioning. You produce something humans rate highly, you're more likely to produce it again.

But notice what's missing. Children learn morality through relationship — through the emotional weight of disappointing someone who loves them, through the visceral experience of being treated unfairly, through the slow accumulation of social belonging and rupture and repair. The felt experience of moral transgression isn't incidental to moral learning; it may be the whole mechanism.

A remarkable recent study from Google DeepMind trained a reinforcement learning agent to manage resource allocation for nearly 5,000 real human participants in a multiplayer trust game. The AI discovered a surprisingly effective social strategy — generous when resources were abundant, quick to sanction free-riders when they were scarce — that outperformed traditional game-theoretic mechanisms at sustaining long-term cooperation (Koster et al., 2025). Functionally, the system had learned something that looked like a moral norm.

And yet. The agent didn't understand cooperation. It didn't care about the welfare of the humans it was coordinating. It had discovered a policy that worked — a behavioral regularity that produced a measurable outcome — but its "moral" reasoning was entirely instrumental. There was no inner life, no sense that the other players mattered in themselves.

This is the alignment problem in miniature: the behavior can look right without the understanding being right. And in most real-world contexts, we can't fully specify in advance what "right behavior" even means.

Whose Values Are We Aligning To?

Here's where the developmental parallel becomes genuinely uncomfortable.

When children learn morality, they're not absorbing a universal code handed down from nowhere. They're internalizing the particular moral culture of their family, community, and society. A child raised in a collectivist context learns different moral intuitions than one raised in an individualist one — and neither is simply wrong. Moral development is culturally embedded.

The same is true, whether we acknowledge it or not, for AI systems. The human raters in RLHF pipelines come from somewhere. They hold views. They have blind spots. When a large language model learns to be "helpful, harmless, and honest," it's learning to be those things according to the sensibilities of a specific pool of raters — often young, often concentrated in certain countries, often from a narrow slice of education and income.

I'm not making an argument that this invalidates AI safety work. I'm making an argument that "alignment" is not a solved problem you hand off to engineers once you decide it matters. It's an ongoing, culturally contested, historically contingent negotiation — much like moral education itself.

This is something the developmental science makes vivid. The question isn't just whether children learn moral values; it's which ones, from whom, under what conditions, and with what consequences for their developing sense of self. We ask those questions obsessively when it comes to parenting, curriculum design, and media exposure. We barely ask them when deploying AI systems in those same classrooms.

What Good Alignment Might Actually Require

One useful theoretical lens here is Active Inference — the framework developed by Parr, Pezzulo, and Friston (2022), which proposes that the brain isn't passively receiving information and storing it. Instead, it's actively generating predictions about the world — including the social world — and acting in ways that minimize surprise. Moral knowledge, in this account, isn't separable from embodied social experience. The sense of wrongness that stops you from cheating when no one is watching isn't a rule you look up — it's an affective prediction embedded in your entire history of social encounters.

Parr, Pezzulo, and Friston (2022) argue that even social cognition — understanding intentions, coordinating with others, navigating trust — reduces to this same underlying architecture. And what follows from that is unsettling for AI alignment work: the kind of moral regulation humans exercise may be literally impossible to reproduce through reward signal alone, because it depends on a form of embodied, relational learning that reward signals can't fully capture.

None of this means AI alignment is hopeless. The DeepMind cooperation findings show that AI can discover socially useful behavioral strategies that human designers wouldn't have engineered (Koster et al., 2025). That's genuinely interesting. But discovering a strategy and understanding its moral weight are different things — and in education, in healthcare, in civic life, that difference is not academic.

What the developmental literature insists on is intellectual honesty: moral learning is slow, relational, culturally specific, and deeply intertwined with a child's growing sense of who they are. Building systems that embody values is not primarily an optimization problem. It's a question about which values, whose interests, and what kind of future we're trying to produce.

What We Owe the Children in Those Classrooms

When I sat in that ethics review room and saw AI tutoring systems with no documentation of value-embedded testing — no accounting for whose sense of "helpful feedback" or "correct answer" was baked into the model — I didn't feel angry. I felt a particular kind of dread that comes from recognizing a question that nobody is officially responsible for asking.

Children in those classrooms will spend hours with these systems, at ages when moral intuitions are being formed, when trust in authority is calibrated, when a child first learns whether the world tends toward fairness or doesn't. We should be asking, with real seriousness: what is this system teaching, beyond the explicit curriculum?

If we wouldn't put a human teacher in a classroom without examining their values, their assumptions, their sense of what a child deserves — we shouldn't accept less from the systems we're building to replace or augment them.

That's not a technical question. It's a moral one. And we don't get to outsource it.