Three camps alike in dignity

Alasdair Allan

20 May 2026

🪧 A full catalogue of the field can now be found at agentlanguages.dev.

In the last six months, programming languages designed for models rather than humans have started to appear. Work on most began entirely independently, initiated by people who had never heard of each other, but all roughly during the same window of time. I know this, because I built one of them.

Vera is a statically typed purely functional language with mandatory contracts on every function, typed effects, Z3 formal verification, and no variable names. I started thinking about it over Christmas, sat down and asked myself whether it was even the right question to ask in January, then released the language anyway in February. By April I has published the first benchmark results showing flagship models writing Vera at 93% correctness despite zero training data, matching their performance with Python.

I thought I was working on a niche experiment. But within days of publishing those results, a pull request landed on the VeraBench repository from another language designer asking to add support for his language. VeraBench could measure whether Vera works, but not why it works, and not which design decisions are doing the heavy lifting. With a second language built on the same bet in the benchmark, that changes.

So I looked around, and so far I’ve found sixteen other people building languages for models.

The disagreement Vera has been having with itself since February is now a disagreement Vera has with the rest of the field. But not everyone in that field agrees on what the problem is, let alone how to fix it.

It turns out we divide pretty cleanly into three camps.

Three camps alike in dignity

The syntactic camp believes the problem is representational, that LLMs fail because existing programming languages are ambiguous at the token level, and strips that ambiguity out. The verification camp believes the problem is semantic coherence, that the model can produce plausible-looking code which is subtly wrong, and makes contracts mechanically checkable. The orchestration camp reframes the problem as agent coordination rather than language design at all.

The verification camp is the most crowded, the most internally fractious, and the one where I think the interesting fight is currently happening. We agree on the diagnosis. We disagree sharply on the prescription, and the disagreements are concrete enough to be testable.

Vera is in the verification camp but bridges into the orchestration camp as well, as LLM inference is a first-class typed effect in Vera, so a function that calls another model can and must declare <Inference> in its effect signature. Agent orchestration becomes something the compiler can verify rather than something you hope that a random framework will handle correctly. No other language in the space treats model-calling-model as a typed, constrained operation, and as far as I know Vera is the only one that spans the camps.

The syntactic camp

Probably the most extreme example of the syntactic camp is X07. It eliminates text syntax entirely in favour of JSON ASTs that agents edit through structured RFC 6902 patches. Although there are others, NERD replaces all operators with English keywords on the theory that LLM tokenisers handle words better than symbols, while Magpie surfaces SSA form as the user-facing syntax.

Laze optimises in the other direction entirely, stripping the language down to a minimal indentation-based syntax with no punctuation and compiling to C, on the theory that the bottleneck is generation speed rather than correctness.

The syntactic camp’s bet is that if you make the representation unambiguous enough, the model’s existing capabilities are sufficient. It is the simplest diagnosis and the easiest to test. I think it is also the least likely to be right, because the failure modes I see in LLM-generated code are not primarily tokenisation errors; they are errors of intent and coherence at scale.

The verification camp

This is where Vera sits along with its closest siblings and, I think, the sharpest most testable disagreements.

Szymon Teżewski‘s Aver is the closest sibling to my own language. He arrived at the verification-camp diagnosis independently and built Aver in Rust with co-located verify blocks, Lean 4 proof export, effect lists declared on function signatures, and a distinctive decision block construct that embeds architectural decision records directly in source. He announced Aver about a week before I released Vera. The two languages have the same diagnosis and almost entirely different prescriptions. Aver keeps named variables. Vera dropped them. Aver verifies through co-located example tests with optional Lean export. Vera verifies with Z3 directly through three tiers of static, guided, and runtime fallback. Aver has match-only branching with no if/else. Vera has both. Aver transpiles to Rust for deployment. Vera compiles to WebAssembly. Different choices at almost every level, but the same underlying bet.

Raffael Schneider‘s BHC and hx, two related projects under his arcanist.sh, are the second body of work whose timeline overlapped mine almost exactly. He had also been working through the implications since December, also during a late-night Claude Code session, and arrived at a verification-camp diagnosis from a Haskell starting point.

His bet is that the right move is to fix Haskell’s tooling and runtime rather than build a new language from scratch, on the theory that declarative types-as-proofs already plays to LLM strengths, and that the reason Haskell has not already won this argument is toolchain friction rather than language fit. He is building hx, a Rust-based wrapper for cabal, stack, and ghcup, alongside a clean-slate Haskell compiler called BHC targeting the Haskell 2026 Platform spec with six runtime profiles for different deployment scenarios. He published a long, thoughtful post about the field and named Vera explicitly as the canonical example of his first approach, which he treated generously, saying, “The person who built Vera saw the same thing I saw, probably around the same time. They chose the first approach. I chose the second. Somebody, eventually, will build the third.“”

Magnus Knutas built Prove, which takes the verification thesis and inverts it. Where Vera and Aver and arcanist.sh are all trying to make code easier for LLMs to write correctly, Prove is deliberately AI-resistant. Its licence explicitly prohibits use as AI training data, and its design philosophy is to make the language hard for pattern-matching approaches to generate. This is fascinating precisely because it inverts the diagnosis: Knutas makes the same observation that LLM code generation has structural failure modes, but comes to the opposite conclusion about what to do about it. Prove is technically capable, with binary installers and a tree-sitter grammar, and it is exactly the kind of project that makes the field interesting.

Viktor Kikot recently announced Pact, which sits close to Aver in design: intent annotations on every function, explicit effect declarations, pipeline syntax, Rust compiler, MCP server for agent integration. Same instincts, different vocabulary.

Then there is MoonBit. The MoonBit team have been quietly making the AI-friendly type-system bet since 2023, and have published a peer-reviewed ICSE paper in 2024 on a real-time semantics-aware token sampler that constrains LLM generation to produce syntactically and semantically valid code as it is generated. This is the most technically novel contribution in the entire space and the only piece of peer-reviewed published academic work I have found.

MoonBit is also by far the most mature project in the landscape, with a full toolchain, four compilation backends, and three years of accumulated training data. They are positioning the language as general-purpose with AI-friendliness as a first-class constraint, which is a different framing from Vera, but a closely related bet.

The orchestration camp

Pel and Marsha reframe the problem as agent coordination rather than language design at all. Adjacent to these, Plumbing by William Waites provides the graph-level wiring that connects agents into typed, streaming pipelines with static well-formedness checking, defining the coordination infrastructure that orchestration-camp languages operate within. On the academic side, Quasar from UPenn takes a different angle entirely, transpiling from a Python subset and adding automated parallelisation and uncertainty quantification for agent code actions, with published results showing 42% execution time reduction and 52% fewer user approval interactions on a real visual question-answering agent.

There are at least half a dozen others in various states of readiness, from serious solo engineering efforts, to thought experiments, to landing pages with no implementation behind them. The cluster is real even if not every member of it is.

The disagreement that actually matters

Aver kept variable names but added Lean proofs, intent annotations, and decision blocks. MoonBit bet on a strong type system alone plus its semantics-aware sampler. arcanist.sh argues you should take Haskell, which already has the right shape, and fix the surrounding machinery. Prove deliberately makes the language harder to generate. Pact kept names and added intent blocks and pipeline syntax. Vera dropped variable names entirely.

Vera is the only language in the entire landscape that did this. Every other project in the verification camp, including Aver and arcanist.sh and MoonBit and Prove and Pact, kept named variables.

The De Bruijn slot reference notation (@Int.0, @Int.1) that anyone reading Vera code notices first is the most contrarian position in the field on what I think is the most important question: what to do about the fact that LLMs are bad at maintaining naming coherence at scale?

Either I am right that naming coherence is the single largest source of LLM error and the De Bruijn decision is doing real work, or my colleagues in the verification camp are right that the verification machinery is doing all the heavy lifting and the binding strategy is incidental.

The current VeraBench results are are silent on this. They establish that Vera produces results competitive with Python on flagship models, but they cannot isolate the contribution of any specific Vera design decision from any other.

Szymon Teżewski‘s pull request changes that. Aver is the cleanest comparison point Vera has in the entire space because the verification machinery is otherwise so similar. Both languages have effect systems. Both have contracts. Both target agent workflows. Both are pure by default. Both are statically typed. The thing that meaningfully differs between them is the binding strategy. A Vera-versus-Aver run on the same problems is therefore unusually well-suited to isolating the effect of the De Bruijn decision from the effect of verification more broadly. For the first time, the disagreement inside the verification camp should have a number attached to it.

A polite disagreement with Raffael

Raffael Schneider sees Vera and arcanist.sh as two different points on the same timeline. Vera as the near-term “explicit and unambiguous” bet, and arcanist.sh as the medium-term “declarative types-as-proofs” bet. The implication is that Vera is making the procedural-but-unambiguous move, and arcanist.sh is making the higher-abstraction declarative move, and that the field is going to evolve through Vera-shaped languages first and then arcanist.sh-shaped languages second.

I do not think this is quite right. Vera’s requires, ensures, and effects clauses are declarative in exactly the sense Raffael means. They describe what a function is rather than how it computes its result. Z3 checks the proof. The De Bruijn slot references exist precisely so that the model does not waste effort tracking variable names that contribute nothing to correctness. The thing that makes Vera look procedural in the function body Raffael quoted is the body itself, not the contract layer above it, and the contract layer is where the actual work of specification lives.

Vera and arcanist.sh are not two points on a timeline. They are two prongs of the same near-term bet, taking slightly different paths to the same place. We agree that declarative types-as-proofs is the right abstraction for AI-authored code. We disagree about whether to start from Haskell and rebuild the toolchain or start from scratch and design every decision around LLM cognition from the ground up. Both bets are defensible. Both are testable. Neither is the sequential successor of the other.

This is a small disagreement and a friendly one, but it matters because the framing changes how the field understands itself. If Vera and arcanist.sh are sequential, then the convergence story is two people making different bets on what comes next. If they are parallel, then the convergence story is two people making different versions of the same bet on what is happening now. The second framing is the one I think is accurate, and it is the one that makes the field feel more like an emerging consensus and less like a sequence of phases.

What happens now?

MoonBit support is the planned next addition to VeraBench. MoonBit is the strongest available counterargument to Vera precisely because it has similar goals and three years of training data accumulation that Vera lacks. The honest version of that comparison will need to flag the training-data asymmetry prominently, because a head-to-head Vera-versus-MoonBit score conflates two variables: design quality and corpus volume. But the question of how much of Vera’s design advantage survives against a mature alternative is one of the most interesting open questions, and VeraBench is going to be where the first real data point lands.

Beyond benchmarks, the field is starting to attract production interest. I have had conversations in the last few weeks with teams exploring how Vera’s contract verification could interact with zero-knowledge proof systems for agent-to-agent trust, and with engineering organisations that already generate significant amounts of code no human will ever read, who want to evaluate Vera for that workflow. These are early signals, not adoption. But they suggest the thesis is landing with people who have real problems, not just with people who find language design interesting.

The broader work continues in the directions I’ve talked about before. Pass@k evaluation for stable rates, and larger problem sets that stress the contract system harder. More models. Longitudinal tracking as both the language and the models evolve.

But to be brutally honest, the field is young enough that almost any rigorous measurement is novel. Aver is now integrated into VeraBench, making it the first external language to join the benchmark suite, but head-to-head results have not yet been published. MoonBit published compilation rates from its sampler. NERD published token counts. Quasar published agent-system performance improvements. arcanist.sh has not yet published benchmarks because the toolchain is still being built. Vera’s willingness to put numbers on the table is itself a contribution to the field, and the next year is going to produce a lot more of them.

We are not the only ones

Back in January I wondered whether building a programming language for AI was solving the right problem. The answer three months after Vera’s release is that I still do not know for certain, but I now know I am not the only person betting that it is. Sixteen other people have seemingly independently arrived at variations on the same thesis, in roughly the same window of time. Some of them made implementation choices I think are wrong. Some made choices I am genuinely uncertain about. Almost all of them disagree with Vera about something important. But that is exactly what a healthy young field looks like, and I didn’t start writing Vera to be right, I started writing it to try and figure out what the question was.

Update: Then there were eighteen. Vercel just dropped Zero. It shares the “one canonical form” instinct and the structured-diagnostics-for-agents approach, but no mandatory contracts, no formal verification, no typed effects. Zero is betting that regularity and actionable error messages are sufficient. Vera is betting they’re not, and that you need the compiler to prove the code satisfies its spec. Both bets are testable.

Update: Then there were nineteen, AILANG is another verification-camp entry. Purely functional, capability-based effects (including an AI effect that echoes Vera’s typed Inference), Hindley-Milner types, deterministic execution, and written in Go. The gap from Vera’s perspective, no mandatory contracts, no SMT verification. Strong types but no proofs.

Update: Then there were twenty, Boruna straddles the orchestration and verification camps. It’s a Rust-based workflow execution engine with its own language (.ax files), a bytecode VM with capability-gated side effects, and hash-chained tamper-evident audit logs. The pitch: when a regulator asks what exactly ran, what the model saw, and what it returned, you can prove it. Boruna orchestrates workflows and then verifies the orchestration was done correctly. Vera verifies individual functions and then orchestrates them. Same impulse, opposite entry point.

Update: The list is now a site. I’ve been tracking this field first as research, then as a blog post, and now as a public, community-edited catalogue at agentlanguages.dev. Twenty-one projects so far and counting spread across the three camps. The catalogue is editorially curated but open to pull requests on GitHub. If you’re building a language for AI agents, add yourself.

View all posts Back to top