The industry has spent two years racing for the largest possible context window and the most opaque possible reasoning chain. There is an argument — unfashionable, but I think increasingly correct — that the next wave of useful work will come from the opposite direction: agents small enough to reason about, narrow enough to trust, and legible enough that their mistakes can be explained rather than merely apologised for.
I want to be precise about what I mean by legibility. A legible agent is one whose decisions you can reconstruct after the fact without needing a second, larger model to interpret the first. Its reasoning fits in your head. Not because the problem is simple — the problems are often remarkably hard — but because the agent's approach to the problem has been deliberately constrained until it becomes tractable to a human reviewer. This is a design choice, not a limitation. It is, I would argue, the most important design choice we are currently failing to make.
Why legibility builds trust
Trust in software has always been a function of predictability. We trust a compiler because we can read its error messages, trace its optimizations, and build a mental model of what it will do with our code before we run it. We trust version control because the diff is right there. The entire history of reliable software is a history of making the machine's decisions visible to the humans who depend on them.
Large, opaque agents break this contract. When a model with a 200,000-token context window synthesizes an answer from a hundred documents, no human can verify the reasoning path. You can check the output against your expectations, but that is testing, not trust. Trust means you understand why it gave the answer it gave, and you can predict what it would do in a case you have not yet seen. That requires a model whose reasoning you can actually follow.
Small context windows force better design
There is a paradox at work in the race for longer context. The more tokens a model can hold, the less pressure there is to structure the information it receives. When you can dump an entire codebase into a prompt, you stop thinking about which parts of the codebase are actually relevant. The context window becomes a substitute for design.
Small agents cannot afford this luxury. When your context budget is tight, you are forced to build proper retrieval, to index your knowledge, to decompose your problems into pieces that each fit in a single, reviewable prompt. These are not compromises. They are engineering disciplines, and they produce systems that are faster, cheaper, more testable, and — crucially — systems whose failure modes a human can diagnose without a research team.
Mistakes worth making
Every agent makes mistakes. The question is not whether your agent will be wrong, but what happens when it is. A large opaque model fails mysteriously: the answer is wrong, the reasoning is inaccessible, and your only recourse is to re-run the query and hope for a different roll of the dice. A small legible model fails transparently: you can see which premise it relied on, which retrieval it used, and where the chain of reasoning broke. You can fix it. You can prevent the same mistake from happening again. You can write a test.
An explained mistake is an investment. An unexplained mistake is just a cost. If we want agents that improve over time — not through retraining, but through the accumulated engineering effort of the people who use them — we need agents whose failures teach us something. That means agents small enough to understand, narrow enough to test, and legible enough that when they break, the breakage tells a story we can act on.
The race to the ceiling is exciting. But the real work — the work that compounds — is happening closer to the floor, where the agents are small, the context is tight, and every decision has to earn its place in the window.