Core Concepts AI
Posts
Can Language Models Learn to Doubt Themselves?

Can Language Models Learn to Doubt Themselves?

Meta’s researchers are rethinking how models reason—not by making them bigger, but by letting them keep multiple answers alive before choosing one.

September 29, 2025

In partnership with

Counting the R’s in Strawberry

Last year, a viral clip showed ChatGPT stumbling on a seemingly very easy task: : counting the R’s in the word ‘strawberry.’ It confidently declared there were two instead of three.

It’s a neat example of how brittle these large language model systems can be.

The reason for this big old whoops-a-daisy is mostly technical: LLMs don’t “see” text at the level of individual letters. They process it as tokens, chunks of text like “straw” and “berry.” And because of this, they can stumble on tasks that require strict letter-by-letter reasoning.

If asked to spell the word out step by step, the model sometimes skipped or double-counted, then carried that mistake through to the final answer….which means that every step thereafter was doomed.

The model kept plodding forward, confident but wrong. (How very human of it!)

Researchers at Meta are exploring an alternative. Rather than locking models into that rigid march, they experimented with the idea of keeping multiple options alive at once: different counts of R’s, parallel reasoning paths, even a few half-mistaken ones.

At a fundamental level, this fuzzier process doesn’t force the model to bet everything on a single sequence. It’s closer to sketching in pencil instead of ink.

The lines tentative at first, open to revision, before darkening into a final answer.

Gif by theoffice on Giphy

The Crowd, Not the Lone Singer

In their research paper, Soft Tokens, Hard Truths: Continuous Relaxations for Sampling, Training and Inference in Reasoning Models, the researchers introduce this concept of soft tokens: probability-weighted mixtures of token embeddings, with added noise, as a way to make chain-of-thought reasoning more flexible during training.

This allows reinforcement learning without curated human reasoning traces and, in theory and at least somewhat in practice, improves robustness/diversity of model outputs.

In plain language: Don’t make the model lock in one hard choice at every step. Instead, let it hold several possibilities in play.

Keep it soft.

Think of it less like a recital hall, more like a karaoke bar: messy voices overlapping, some off-key, some right on pitch, but together closer to the truth than any lone voice, because here each carries a probability, sometimes faint, sometimes strong. Heard together, perhaps a richer sound.

For the LLM model, letting these voices overlap during training keeps more reasoning paths alive instead of collapsing too early into one brittle line. The researchers stir in a little randomness….noise that nudges the model off the obvious track. Here, the effect looks less like rote memorization and more like practice: messy, full of detours, the way a student circles through wrong turns before landing on the right method.

Most crucially, this doesn’t require step-by-step “gold standard” reasoning traces from humans. The model just checks its work against the final answer—right, wrong, or somewhere in between—and adjusts.

No need to hand it an instruction manual for how to think. That sets it apart from the trend of bolting on fixes: external tools, massive databases, agent scaffolding. Those patches help, but they live outside the model’s core. The Meta team’s approach works from the inside out. If it holds, it could mean sturdier reasoning without needing so much duct tape around the edges.

Lottery Tickets and Pencil Marks

In Meta’s research, the results weren’t dramatic at first glance. On the simplest test (solve a problem in one shot) the soft-trained models performed about the same as the hard-thinking ones.

No fireworks, no obvious gap.

The difference showed up when the researchers pulled multiple answers from the same model. Here, the soft-trained versions pulled ahead. With more reasoning paths “living” inside, they had a better chance of hitting the right solution across a batch of tries. You can think of it less like a single bet and more like buying a handful of lottery tickets…each run another chance to land on the winning number.

Soft training also made the models less brittle. The hard-trained ones tended to overfit: too much swagger on familiar ground, too much stumbling on anything new. The softer approach kept more of the model’s balance, leaving it steadier when the problems shifted.

And an intriguing twist: the best recipe wasn’t pure softness. It was a hybrid. Let the model sketch in pencil during training, but make it switch back to ink for the final answer.

Messy drafts in practice, clean work on the exam.

Ernest Hemingway said it another way: “Write drunk, edit sober.”

Gif by orlainberlin on Giphy

Reasoning in Drafts

If this method proves itself beyond math, the implications widen fast.

Picture a reasoning assistant that doesn’t just hand you one neat chain of logic but sketches several, then shows you the strongest. A model that, before giving you a financial forecast, a grant proposal, or an engineering calculation, has already worked through a handful of drafts internally.

You’d only see the polished version, but behind it would be a messier rehearsal. False starts, alternate routes, a quieter study group arguing until they converge.

Writers and artists work the same way. Back to Hemingway, for instance, who is said to have drafted forty-seven different endings to A Farewell to Arms. Lots of painters sketch in charcoal before laying down oil. The draft is where the hard thinking happens; the final version is just the surface.

Soft tokens echo that same rehearsal–performance split.

In theory, this research could also mean sturdier general-purpose models. Right now a small change in how you phrase a question (a flipped clause, a reordered fact) can knock systems off balance. But a model trained to keep more options alive may bend where others snap.

This lessens the need for expensive interventions: no more piles of annotated human reasoning, no exotic infrastructure. Soft tokening slides into existing training pipelines.

When the World Isn’t Algebra

The promise is real, but so are the caveats.

Most of the experiments so far live in the fairly clean world of math. (Note: I am no mathematician, so don’t trust any claims I make about math here!) The point, though, is that algebra make relatively tidy testbeds: the answers are unambiguous and scoring is instant.

But of course the world outside of equations is rarely so clean. Does soft-token training help with writing code, drafting policy, or guiding a conversation? That’s still unknown.

There’s also the reward problem. In math, correctness is binary: three R’s in “strawberry,” not two. Pretty darn clearn. But in fuzzier domains, what counts as a “good” answer is murkier.

How persuasive should an argument be? How useful is a business plan? And how to quantify?

Without reliable feedback, the benefits of soft reasoning may fade.

There’s the cost part, too. Even if the overhead is lighter than other reinforcement-learning setups, training still eats up days of high-end hardware. Probably that keeps it out of reach for most smaller labs and startups.

So for now, it’s not plug-and-play. It’s more of a promising sketch than a ready-made blueprint.

Beyond the Math Trick

To be useful, soft tokening must also find its place alongside the field’s other obsessions: retrieval systems, tool use, agent frameworks…most of the energy lately has gone into bolting on new scaffolding. But soft reasoning by its nature works deeper down, inside the model’s own logic. The two approaches aren’t rivals so much as complements. A sturdier core paired with sharper tools could change what these systems are capable of.

It also differs from the prevailing “human-in-the-loop” fixes. For example, Anthropic’s Constitutional AI lean heavily on human judgment: annotators grading answers, curating rules, whereas soft-token training would sidestep at least some of that dependence. Instead of teaching the model to mimic our step-by-step logic, it rewards the paths that land on the right result.

There’s a larger question hanging overhead, too. What does it even mean to say a model “reasons”?

When we use that word, we usually mean more than just getting the right answer, don’t we? Reasoning implies a kind of flexibility: weighing alternatives, noticing when a step doesn’t add up, circling back to repair a mistake. Human reasoning has detours, blind alleys, moments of hesitation. Machines don’t really have that. Not yet, at leasrt. They stack tokens, one after another, and we project the shape of “thought” onto the output.

Soft-token training won’t resolve this philosophical knot. But it does suggest something practical: maybe reliability won’t come from forcing machines to mimic our cleanest, most linear explanations.

Maybe it will come from letting them linger in uncertainty a little longer, testing parallel paths the way people do unconsciously, before they settle on an answer.

In the Gray Space

“Soft tokens” are less about flash or scale and more about loosening the rigidity that makes today’s systems brittle. Training a model to keep several possibilities alive at once is closer to how people work through problems—drafting, revising, circling back—before arriving at something solid.

The broader point, I think, is simple: progress may come not from bolting on new features but from rethinking the rhythm of thought itself. A system that can pause, weigh alternatives, and revise has a better shot at reliability than one that charges straight ahead.

If that holds, the future of reasoning machines won’t be found in clean lines or airtight explanations. It will come from the uncertain middle ground, where paths overlap and disappear, mistakes surface, and the right answer emerges only after the wrong ones have had their say.

Core Concepts is sponsored by North Light AI. At North Light, we help organizations move past the hype and put AI to work in ways that are practical, defensible, and human-centered. Whether you’re building governance frameworks, training your team, or exploring new products, our focus is on clarity and impact—not complexity for its own sake.

If you’re ready to move from experimenting with AI to using it with confidence, we’d love to talk. Reach out to start a conversation about how North Light AI can support your goals.

NorthLightAI.com

The holiday targeting playbook marketers are using this season

Black Friday and holiday ad campaigns are make-or-break moments. While competitors burn budget on broad audiences, you can zero in on shoppers ready to buy exactly what you sell.

Speedeon’s Holiday Digital Audience Guide unlocks 100+ hyper-focused segments—from Amazon holiday shoppers and Cyber Monday fanatics to gift card purchasers and families starting new holiday traditions.

These aren’t generic lists. They’re proven, conversion-driven digital audiences designed for immediate activation across Meta, Google and more.

Let us help you wrap up your best-performing holiday campaigns yet.

See the Holiday Audiences