The jagged shape of machine intelligence

There is a tempting move to make here, which is to call this hypocrisy - to declare that because the system fails at things that look simple, it can't really be doing the complicated thing well either. That move is wrong and it costs teams a lot of money. The system can absolutely draft passable strategy. It can also get 9.11 versus 9.9 wrong. Neither of those facts is inconsistent with the other, and the place that human intuition leads us astray is in assuming they ought to be.
Why the gap exists
Human intelligence is roughly monotonic with difficulty. Things we think of as hard - proving theorems, running companies, writing novels - are hard for most humans. Things we think of as simple - counting, basic arithmetic, reading off a scale - are easy for most humans. Our intuition about intelligence is built from a lifetime of observing that the same people who can do the hard things usually can't fail at the easy ones.
Pattern-matching engines do not have this property. Their capability surface is determined by training distribution, tokenisation, and the sampling strategy at inference time. The model is good at what shows up richly in its training data and what its tokeniser can represent cleanly. It is bad at what doesn't. 'Hard' and 'simple' are human categories projected onto a system that doesn't share them.
The decimal comparison example is a nice illustration. 9.11 versus 9.9 is not hard. It is tokenised in a way that makes the numeric comparison happen over an unhelpful subword boundary, and the model's prior for the sequence '9.11' is coloured by calendar dates and iPhone model numbers. There is nothing about the difficulty of the task that explains the failure. The failure is explained entirely by the system's representation layer, which does not know that it is doing arithmetic.
The grain metaphor
The most useful framing I've found for explaining this to non-technical audiences is the grain of the wood. A plane moving along the grain cuts cleanly. The same plane cutting across the grain tears and splinters. The wood is not inconsistent. It has a structure, and the structure rewards work in some directions more than others. A skilled joiner plans the cut to follow the grain. They don't conclude, from a failed cross-cut, that the wood is stupid.
Current AI systems have a grain. The grain follows the training distribution and the model's architectural biases. Working with the grain, they are remarkable. Working against it, they fail predictably, often at tasks that seem elementary. The craft of using these systems well is largely the craft of recognising which direction the grain runs for any given problem, and routing your work accordingly.
The wood has a structure. The structure rewards work in some directions more than others.
What to do about it
The concrete operational implication is that if you're building something on top of a model, you need to map the grain before you ship. That means: take your top ten use cases, run them through the model a hundred times each, and note which ones it fails at, how it fails, and whether the failure has a pattern. That's not optional work - it's the work. The product managers and engineers who skip this step are the ones who end up building features that demo beautifully and degrade in production.
For tasks the model is bad at, you have two options. You can route around them with tool-use - let the model delegate arithmetic to a calculator, counting to a script, factual lookup to a database. This works. It is also not what 'let the LLM do it' instincts will lead to, so it has to be an explicit design choice. The second option is to accept that the task is outside the model's grain and route it to a human or a rule-based system. Neither of these is an indictment of the model. Both are applications of the grain principle.
What changes over time
The grain shifts with each major model release. The specific failure modes that were characteristic of a model a year ago are not necessarily the ones of the model today. This has two practical consequences. The first is that any static mental model of what the system can and can't do will be wrong within twelve months, and the teaching material you write will be partly obsolete by the time you deliver it. The second is that the structure of the failures is more durable than the specific instances - teaching 'what failure modes look like' is more stable than teaching 'what fails'.
The meta-skill is calibration. Learning to update, roughly monthly, your sense of where the model's grain runs. Teams that do this find the next release is an upgrade. Teams that don't discover, six months later, that half their product design is now unnecessary and the other half needs rewriting because the model can now do what they'd written around.
A practical filter
If you want one filter to take away: any time someone says 'AI is stupid because it can't do X,' treat it as useful information about the representation, not about the capability. If X is a well-defined, human-simple task the model fails at, the failure is probably tokenisation, training distribution, or sampling. None of those are 'stupidity'. All of them are fixable through routing, tool-use, or a different model. The correct response is calibration and engineering, not disappointment.
The correct response to the opposite observation is the same. When the model does something surprising and impressive, try to understand why. Was it because the task is well-represented in the training data? Because the reasoning was straightforward enough to follow gradients? Because the tokenisation was kind? Understanding the success is as important as understanding the failure, and the shape of both tells you where to invest your next quarter.



Discussion