If AI is so smart, why can’t it count the R’s in Strawberry or Blueberry?

An AI-generated image of the number of “R”s in Strawberry

Every time a new version of AI comes out, there’s a lot of chatter about how our AI assistants fail to properly count the Rs in words like “Strawberry” or “Blueberry.” Another viral #AIFail example. Is this a bug – or just a misunderstanding of how AI works?

My hypothesis was that this is a challenging problem for generative AI because of “AI Tokens” (AI doesn’t actually use “words” as we humans know them). When tokenization happens, a word could be broken up, the letters counted, and then when the parts are aggregated, the count is wrong.

That’s what I thought might be happening, but to doublecheck, I asked ChatGPT 5.0 pro to help me to better understand. Here’s what I learned.

Hey ChatGPT: If AI is so intelligent, why can’t it count the R’s in Strawberry or Blueberry?

Here’s what’s going on…

When a model fumbles questions like “How many r’s are in ‘blueberry’?” here’s what’s happening:

Tokenization of word-fragments hides character boundaries

Modern LLMs read and write TOKENS (common chunks of text), not single characters. A word like blueberry is often split into pieces like “blue” + “berry”. It could also be “blueb” + “erry” (it’s not always at the syllable break). To count letters, the model would need to “look inside” each token and do character-level bookkeeping. That’s doable in principle, but the model isn’t explicitly trained to do it, so it’s fragile.

AI’s training objective doesn’t reward exact counting

LLMs are optimized to predict THE NEXT TOKEN THAT SOUNDS RIGHT, not to execute deterministic algorithms. “How many r’s?” is an algorithmic task (scan → tally → report). Without explicit supervision or a tool to do the counting, the model often relies on pattern-based guesses instead of systematic counting.

Compression and heuristics blur repeated letters

Transformers compress information across layers. Repeated letters like the “rr” in berry can get represented as a general “double-letter” feature rather than two distinct, indexable characters. When asked for a number, the model may retrieve the concept “double r in berry” and answer “2,” overlooking that strawberry also has an additional “r” earlier—leading to the classic “2 instead of 3” mistake.

Decoding randomness and overconfident priors

Even with the right internal signals, sampling (temperature/top‑p) can push the model toward a plausible but wrong numeral. Lowering temperature helps—but doesn’t fix—the underlying reasoning issue.

Aggregation isn’t procedural

My original hypothesis imagined a two-step “count per token, then add.” Models don’t actually run explicit loops like that; they emit a single answer in a single attempt. So even if a token like “berry” strongly associates with “has two r’s,” there’s no guaranteed “+1 from ‘straw’” step unless the prompt forces character-by-character reasoning.

Here’s how to get reliable answers anyway

  • Force character-level work: Ask the model to write the word with indexed characters (e.g., “1:B, 2:l, …”) and then total the r’s. Making it enumerate reduces the chance it will shortcut.

  • Use a tool for the counting: In production, route “count letters/digits/symbols” to a tiny deterministic function and have the model call it. That eliminates the probabilistic failure mode entirely.

  • Make the output structured: Request a JSON map like {"r": 2} so the model focuses on the specific operation.

Bottom line: tokenization is a big reason these slip-ups happen, but the deeper cause is that LLMs are PATTERN LEARNERS, not algorithm executors.

Unless you force a character-level procedure or hand the task to a simple tool, they’ll sometimes give a fluent, confident, and wrong number.

Next
Next

The hack to REINVENT meetings and instantly create 25% more time