NURL generation accuracy & token report

Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.

Reproducible with any model — see bench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containing noprimer is the no-reference condition.

Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data — against (order-of-magnitude) tens of millions of lines of Python and millions of Rust. The NURL rows are a language the model has never seen once.

Headline — primer v3: the targeted import cue works, then the constraint shifts

Primer v3 keeps the focused ~1.3k-token reference and adds one prominent, general rule: stdlib functions (above all nurl_str_*) need their $ import — stated at the very top and shown in the canonical worked example. It is 1,604 tokens (a ~270-token addition, not the 14k grammar dump that failed — see RESULTS_old.md and primers/nurl_old.md). This run adds a fourth model: mercury-2, Inception Labs' diffusion LLM (via generate_inception.py), at two reasoning efforts.

The targeted cue worked. Import-omission — the dominant held-out failure in every prior run — collapsed across all models: sonnet 20→0, opus →0, haiku 6→1, mercury 0.

Held-out (4 string tasks, 5 samples, primer v3):

model	NURL compile	NURL correct	Python	Rust
`claude-sonnet-4-6`	80%	80%	100%	100%
`claude-opus-4-8`	70%	70%	100%	100%
`claude-haiku-4-5`	55%	45%	100%	100%
`mercury-2` (diffusion, reasoning=medium)	15%	15%	100%	100%
`mercury-2` (diffusion, reasoning=high)	30%	25%	100%	100%

Per-model trajectory across primer revisions (NURL compile %, same 4 held-out tasks):

model	v2 (1,330 tok)	grammar (14,098)	v3 (1,604)
sonnet-4.6	60%	0%	80%
opus-4.8	85%	75%	70%
haiku-4.5	50%	60%	55%

But fixing imports did not raise the ceiling — the binding constraint moved. With imports handled, the next ingrained habit becomes the dominant failure, and it differs by model:

sonnet had only the import problem → jumps to its best (80%).
opus now fails by parenthesising operator expressions for grouping (( . counts d ), ( >= c 65 )) — all 6 failures — a C/Python habit NURL rejects (( is a call, never grouping).
haiku regresses onto mutability (3 failures — = on an immutable binding).
mercury is dominated by the same grouping-paren habit (| . & >= == *).

Reasoning effort helps the diffusion model — to a point. Re-running mercury-2 with reasoning_effort: high (vs medium) roughly doubled its NURL accuracy (15→30% compile, 15→25% correct), with no change to Python/Rust (both 100% at either effort). But on the longest task (brackets) the high-effort model spent its entire 8,192-token budget reasoning and returned empty output every single time (0/5, median 1 token) — a budget interaction, not pure inability. It still trails every Claude model.

The diffusion model, sharply. mercury-2 writes Python and Rust at 100% — a capable code model — yet reaches at most 30% on NURL, almost entirely because it cannot stop wrapping operator expressions in parentheses. The cleanest demonstration that the gap is out-of-distribution habit, not raw capability.

Honest bottom line. A focused, targeted primer cue reliably fixes the specific failure it targets (v3 killed import-omission everywhere, the diffusion model included), and more reasoning effort buys a real but partial lift. But first-try accuracy on a language with zero training presence is gated by a sequence of in-distribution habits — imports, then grouping-parentheses, then mutability — each surfacing as the last is fixed. No model reaches Python/Rust parity. A regular grammar is learnable from a focused page and individual failure modes are fixable with targeted cues — not that NURL matches mainstream languages for first-try generation.

This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.

Primer — prompt overhead (charged once per NURL request)

The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1604 tokens (cl100k) / 1605 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.