Documentation-first programming

Alasdair Allan

19 May 2026

I find it somewhat ironic that the part of the job that most developers hate is about to become the whole job. Because, at least as far as I can figure out, we are now in the era of documentation-first programming.

Act I: The specification is the program

A few months ago I started building a programming language. Designed for machines to write, not humans. Every function has a mandatory contract specifying what it requires, what it guarantees, what effects it performs. The compiler proves these contracts are satisfied using an SMT solver, or it tells the model exactly what’s wrong and how to fix it. There are no variable names. There are no style choices. There is one canonical way to write any given program.

The interesting thing about this isn’t the language itself. It’s what the design forces you to confront. If an AI is writing the code, what exactly is the human writing?

The answer, it turns out, is documentation. Specifications. Contracts. Requirements. The things that tell the machine what to build and allow the next model to come along, or the next human, to understand what was built and why. The contracts are the specification, and the specification drives both generation and verification. The code is a byproduct. The specification is the program.

Despite my background, I didn’t set out to make a point about documentation here. I set out to build a language where models make fewer mistakes, and the initial results suggest I might be on to something. But the design kept pushing me toward the same conclusion: the only durable human artefact in AI-assisted development is the specification. Everything else is generated, and everything generated is disposable.

This isn’t just a personal observation. Formal verification, the practice of mathematically proving that software does what its specification says, is having a moment. Perhaps the moment.

Historically, formal software verification hasn’t worked. The seL4 microkernel, at just 8,700 lines of C, required twenty person-years and 200,000 lines of proof code. There are 23 lines of proof for every line of implementation. Nobody was going to do that at scale.

But late last year Martin Kleppmann predicted that things were converging to make formal verification mainstream for the first time; AI-generated code desperately needs verification to skip human review, while at the same time making verification dramatically cheaper as models can now write proof scripts. Kleppmann sees verification’s mathematical precision as the natural counterweight to LLMs’ probabilistic nature.

There’s a new term for it, “vericoding.” Coined in a paper by Bursuc, Tegmark et al. and referenced by Kleppmann, it describes using LLMs to generate formally verified code, and is being positioned explicitly as the opposite of vibe coding.

But here’s where it gets interesting. Ask anyone working in formal verification what the bottleneck is, and you get the same answer. It’s not code generation, it’s the specification writing. Humans still need to articulate what code should do. The proofs can be automated, the code can be generated, but someone has to write down what “correct” means.

The specification is the program. The rest is machinery. The obvious counterargument, of course, is that a sufficiently detailed specification is just code, written in English.

Gabriella Gonzalez made this point sharply back in March, dismantling the OpenAI Symphony project, whose “specification” turned out to be pseudocode in markdown form. It was made up of a collection of prose dumps, database schemas, algorithms, and sections explicitly added to babysit the model’s code generation, sitting alongside literal code. The specification was a sixth the size of the Elixir implementation.

Gonzalez argued that if you try to make a specification precise enough to reliably generate a working implementation, you must necessarily contort the document into code or something strongly resembling code.

She invoked Dijkstra, who warned decades ago that switching to communication in natural language between human and machine would not simplify the human’s life, alongside Borges and his one-paragraph story “On Exactitude in Science,” about an empire whose cartographers produced a map so detailed it was the same size as the territory it mapped.

A specification that is a one-to-one representation of the code is not a specification. It’s just code with worse tooling.

Fred Brooks made the same underlying argument in “No Silver Bullet,” all the way back in 1986. He argued that that “the hardest single part of building a software system is deciding precisely what to build. No other part of the conceptual work is as difficult as establishing the detailed technical requirements.” Also: “the hardest part of the software task is arriving at a complete and consistent specification, and much of the essence of building a program is in fact the debugging of the specification.“

Brooks’s distinction between essential and accidental complexity is the intellectual foundation here; the essential difficulty of software is the specification, not the coding. Every tool that reduces accidental complexity – so high-level languages, compilers, AI code generation – just makes the essential complexity more visible.

These are serious arguments from serious people. And they’re right, if you think the specification needs to be detailed enough to mechanically produce a unique implementation. But that’s not what I’m arguing.

The specification doesn’t need to be precise enough to generate code on the first pass. It needs to be precise enough to verify that generated code is correct, and to provide context for the next modification. That’s meaningfully different.

In Vera contracts don’t describe implementations, instead they describe constraints. This produces a specification that says that a function must return a sorted list, and must not perform any IO is not code. It’s a checkable claim about that code. The gap between those two things, between implementation and verification, is exactly where a human adds value.

Documentation-first doesn’t mean creating specifications is easy. It just means that specification is the hard part. It always was. We just used to do it in our heads, and call it “experience.”

Unfortunately, experience only lives in one head at a time.

Act II: The day-to-day

Stripe surveyed two thousand professionals and found developers spend 17.3 hours per week on maintenance: that’s 42% of their working time. While Software.com tracked a quarter of a million developers and found that on average they actively write code for just 52 minutes per day, with only 10% of developers coding for more than two hours a day. Sonar and Tidelift found less than a third of developer time goes to writing new code, with another third going to managing what already exists. The Stack Overflow’s survey in 2024 found technical debt was developers’ number one frustration, and that they spend more than 30 minutes daily just searching for answers.

The part of the job most developers hate is already most of the job. They just don’t call it documentation.

Instead they call it reading the code, and figuring out why something works a particular way; or, a lot of the time, tracking down who changed what and when. All of which is documentation work, or rather, work that would be unnecessary if the documentation existed and was reliably current.

Now add the AI which is increasingly writing the code. A quarter of Y Combinator’s Winter 2025 batch had codebases that were 95% AI-generated. The vibe-coding movement, Andrej Karpathy’s term for fully delegating to the model and “forgetting that the code even exists,” has gone from a novelty neologism to Collins Dictionary’s Word of the Year.

But despite that, it seems that engineers might still be in demand. The numbers are striking. There are over 67,000 open engineering roles at tech companies globally right now, the most in three years, up 78% from the 2023 trough. Product manager openings are also at three-year high, up 75%. Interestingly perhaps, recruiter openings are approaching 2022 peak levels, a leading indicator that sustained hiring demand is building, not contracting.

Aaron Levie, the CEO of cloud storage company Box, pointed at the data and argued that we were seeing the effect of the Jevons Paradox, the economic theory that suggests that as technology advances and increases efficiencies, making us less reliant on a particular resource, our consumption of that resource often increases rather than decreases. (Jevons was talking about steam engines and coal: the observation still holds.) As AI makes programming more efficient, we’re doing more programming and may actually be employing more programmers.

It’s perhaps more interesting to look at what job category is not growing. While product manager and engineering recruitment has surged, recruitment for design roles has been flat since early 2023. The current theory says that AI is allowing engineers to move so fast that there’s less opportunity and less desire to involve the traditional design process. The ratio of product manager to designer demand has flipped: since mid-2023 there are more open PM roles than design roles, and that gap is widening.

This is the market telling you something.

The implementation layer (visual design, code generation, the mechanical act of producing artefacts) is being commoditised; while the specification layer (requirements, strategy, architecture, context) is becoming more valuable. Product managers, who deal in specifications; and engineers, who deal in system design, maintenance and integration, are in demand. Designers, who deal in visual implementation, are not. The job market has already priced in documentation-first. It just hasn’t named it yet.

Levie argues the surviving engineers will “understand what to prompt, how to review when an agent goes off the rails, how to guide back, how to maintain the system that was built, how to fix the ongoing bugs.” That’s not coding. That’s specification, context curation, and documentation. That’s the job. If AI writes the code, and developers were already spending most of their time not writing code but understanding and maintaining it, then the human’s job was never really coding. It was always specification and context. We just called it coding because we had to do that part too.

Unfortunately AI doesn’t generate good documentation when it generates code.

Back in December last year CodeRabbit analysed 470 GitHub pull requests: 320 were co-authored along with models, with the remaining 150 being entirely human-written. They found the AI-generated PRs had a 70% increase in major issues, and a 40% increase in critical issues. Logic and correctness issues rose 75%, and performance inefficiencies (excessive I/O, repeated file reads) appeared nearly eight times more often. The single biggest category gap was readability, with over three times more readability problems in the AI-generated code.

IEEE Spectrum reported something more troubling: newer models have developed a more insidious failure pattern. Instead of crashing, they generate code that fails silently, producing fake output that matches the expected format and quietly removing safety checks. This is harder to catch than a crash, and it produces no documentation trail.

The spiral works like this. AI writes code without meaningful documentation. The code works, or at least appears to work. Six months later, someone (or some other AI) needs to modify it. There’s no specification to work from, no record of design decisions, no explanation of constraints. The next AI reads the code and infers intent from structure, which is exactly why models are bad at scale. It makes changes that satisfy local constraints but violate global invariants that were never written down. The system breaks in ways that are subtle, delayed, and expensive.

Teams across the industry are, as in this article from Harsh on dev.to, hitting this wall. Feature development halts, not because of a security breach or an outage, but because the codebase has become so tangled with AI-generated code that nobody who has “written” it can confidently modify it anymore. Six months of celebrated velocity, followed by weeks of full stop.

The diagnosis Harsh offers is precise: “Writing code was never the bottleneck. Understanding code is the bottleneck. Debugging code is the bottleneck. Modifying code you didn’t write — or that you wrote but don’t understand — is the bottleneck. AI made the fast part faster. It made the slow parts slower.“

There’s an emerging term for this: cognitive debt.

This is not the technical debt we’ve always had, where you cut a corner and understand that you cut it. Instead it’s a deeper problem where the code works, passes tests, looks clean, and sits in production until it breaks in a way nobody can diagnose. It’s the gap between what the codebase does, and what the team actually knows about it.

It’s a vicious feedback loop. Well-documented, well-specified systems get more value from AI. The model has clear constraints to work within, clear contracts to satisfy, clear context to reason about. Poorly documented systems get worse, faster. AI amplifies what’s already there. Strong foundations get amplified into faster shipping. Weak foundations get amplified into faster debt accumulation. All of this means that documentation-first isn’t just a way of working. It’s a precondition for AI-assisted development working at all.

A recent study ran a clean experiment. Researchers split a coding problem across two LLM agents, each independently implementing parts of the same class. They progressively stripped detail from the specification, full docstrings at L0, bare function signatures at L3, and then measured what happened to integration accuracy.

A single agent working from a full specification hit 89% integration accuracy. Strip the spec down to bare signatures and that drops to 56%, a meaningful but graceful degradation. Two agents working from a full specification hit 58%. Strip the spec down and that collapses to 25%. The persistent coordination gap between single-agent and two-agent performance is 25 to 39 percentage points. The result hold true across models, tasks, and runs.

The gap decomposes into two effects. Coordination cost contributes 16 percentage points. Information asymmetry contributes 11. Even when both agents have the same partial specification, they can’t independently choose compatible internal structures, list versus dict, one invariant versus another, one naming convention versus another. They can’t prompt their way out of it by sharing more messages, because the missing artefact isn’t a message. It’s a shared decision.

The most damning finding is what the agents did next. They built an AST-based conflict detector that achieves 97% precision at flagging incompatible designs without any extra LLM calls.

Then they ran a recovery experiment: restoring the full specification alone recovered the single-agent ceiling of 89%. Adding conflict reports on top provided no measurable benefit. As the aeshift commentary on the paper put it, conflict detection is a smoke alarm, not a coordination strategy. If your multi-agent workflow expects the smoke alarm to prevent the fire, you’ll keep shipping incompatible pieces and calling it agent misalignment.

The bottleneck isn’t agent intelligence. It isn’t tooling. It isn’t message passing. It’s the specification. That richness is what separates 89% from 25%, and nothing else recovers the gap.

Act III: What happens when you get it wrong

Back in March, Amazon’s retail website went down. Four high-severity incidents in a single week. The worst, on March 5th, caused a 99% drop in US order volume. That’s roughly 6.3 million lost orders in six hours.

The Financial Times reported that AI tools were behind the outages and that Amazon had imposed new approval requirements on junior engineers. Amazon pushed back hard in a public statement where they said the FT reporting contained “inaccuracies.”

Amazon said that only one of the incidents involved AI tooling in any way. That incident was not caused by AI-written code. It was caused by an engineer following inaccurate advice that an AI tool inferred from an outdated internal wiki. No new approval requirements were introduced, and the other outages were separate, unrelated operational issues.

I don’t think it matters whether the outages were caused by AI tools or not. The interesting thing here is what Amazon confirmed rather than what they denied. An AI tool read an outdated internal Wiki. It inferred advice from stale documentation. An engineer followed that advice. The site went down.

That’s not a story about AI writing bad code. It’s a story about context. The AI couldn’t distinguish a current wiki page from an obsolete one. It couldn’t assess whether the advice it was synthesising still applied to the system as it exists today. It treated everything in its context window with equal weight, and the engineer who followed the advice either didn’t have the experience to question it, or trusted the tool’s confidence more than their own uncertainty.

There’s a term for the issue the problems at Amazon demonstrated so handily: context engineering. It was coined by Tobi Lütke, the CEO of Shopify, in the middle of last year. Endorsed by Karpathy almost immediately, context engineering describes the core skill of working with models, as “the art of providing all the context for the task to be plausibly solvable by the LLM.”

Anthropic then formalised it in a blog post, defining context engineering as “the set of strategies for curating and maintaining the optimal set of tokens during LLM inference.” They also introduced a term I particularly like, context rot.

As context grows, the model’s ability to accurately recall information degrades. Not because the information is wrong; but because there’s too much of it, or because parts of it are in the wrong places, or because the stale parts are indistinguishable from the current parts. Most agent failures are not model failures. They are context failures.

Amazon’s outage is a context failure in its purest form. The Wiki page wasn’t wrong when it was written: it became wrong over time. Nobody updated it, nobody marked it as obsolete, and when a tooling ingested it, it couldn’t distinguish current operational guidance from a historical artefact.

Act IV: What happens when you get it right

The most surprising open source success story of the year was OpenClaw, the open source AI agent that went from zero to the most-starred project on GitHub in just four months. It’s also at the core a documentation-first system. It runs locally, connects to your messaging apps, and can execute tasks on your machine; and its architecture is, at least to a first approximation, just a bunch of markdown files.

This isn’t a new observation. Bret Taylor, co-founder of Sierra, chair of the OpenAI board and one of the creators of Google Maps, made exactly this observation in a recent interview where he said that OpenClaw was “just a bunch of markdown files, and the memory feels right, even though it’s a little bit kludgey.” He contrasted it with multi-agent architectures which, while they look elegant on a whiteboard, fail in practice because you “stuff all the context in the subagents, and the one on top has no ability to actually not sound robotic.“

The most successful AI-native open-source project in history isn’t successful because of sophisticated architecture. It’s successful because its specifications are readable, its interfaces are composable, and agents can understand what things do by reading what’s written about them.

Taylor sees two possible futures. In one, open source becomes less important because we all independently recreate the functionality with coding agents. Why use someone else’s library when you can generate your own in minutes? In the other, open-source platforms survive by being agent-hackable. They end up shipping with what he calls an “exceptional agent harness” that makes them easy to extend and improve with AI.

Taylor is also, it’s worth noting, growing more sceptical of MCP as a meaningful part of the future. The formal protocol matters less than the documentation that describes what things do. OpenClaw’s skills system (markdown files with descriptions that agents can read) offers a more practical interface than a protocol specification. This is becoming a commonly held view amongst developers using agents to develop code, but here at least I disagree with him. Instead I think MCP servers can serve as important guardrails, giving deterministic outcomes to tool calls and moderating the inherent non-determinism of working with agents relying on skills alone. MCP has its place, and its place is user- rather than developer-facing.

Act V: Documentation first programming

This isn’t a prediction. It’s a description of what’s already happened. The human’s job in AI-assisted development is, increasingly, to write specifications, maintain context, curate documentation, and define contracts. The code is generated. The tests can be generated from the specifications. The proofs can be automated. What can’t be generated is the specification itself. The specification is the articulation of what “correct” means, what constraints apply, and what the system should and shouldn’t be doing.

Manny Silva, who leads documentation at Skyflow, has been writing about documentation-first development through his “Docs as Tests” methodology. He is treating documentation as testable assertions about product behaviour. He’s built tooling that parses documentation, executes the described procedures, and validates that the product matches what the docs say. It’s good work, but I think he’s coming at it from the wrong end.

Silva treats documentation as a mirror of the product, a quality-assurance mechanism that verifies the product matches what was written about it. The documentation-first thesis is something more radical: here, the documentation is the source of the product. Not a reflection of what was built, but the specification from which it is built. The contracts come first, the code follows. The documentation isn’t a test of the product; it is the product.

SaaS was built on the premise that software was scarce. That it was hard to build and expensive to maintain. AI is collapsing the scarcity of code.

The Jevons paradox matters: collapsing the cost of code doesn’t mean less code. It means more. Software expands into every domain that couldn’t previously afford it, and every new domain that software enters needs specifications, needs documentation, needs someone to articulate what “correct” means in a context that’s never been formalised before. As the scarcity of code collapses, the scarcity of specification increases.

Taylor’s two futures for open source map onto this directly. In one future, where everyone regenerates functionality with coding agents, we get a world of disposable code with no institutional memory: the Amazon outage at scale. In the other future, where open-source projects ship with agent harnesses and composable specifications, we get a world where the documentation layer is the durable value and the code is the disposable part.

The SaaS companies that will survive the current repricing are the ones whose value lies in trust, compliance, large sets of proprietary data and domain expertise, not in the code itself. The open-source projects that will matter are the ones with specifications rich enough to be agent-hackable, not the ones with the most stars or the cleanest code. And the developers who will thrive are the ones who can write specifications, maintain context, and curate documentation: not the ones who can write code fastest, because that race is already lost.

The part of the job most developers hate has become the whole job. The irony is that it was always the most important part. We just couldn’t see it, because we were too busy writing the code.

View all posts Back to top