NURL generation accuracy & token report

Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.

Reproducible with any model — see bench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containing noprimer is the no-reference condition.

Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data, against millions of lines of Python and Rust. The NURL rows are a language the model has never seen, working only from the one-page reference below.

Headline — held-out accuracy across three models

NURL first-pass writability scales with model capability, but no model reaches Python/Rust parity on the held-out string tasks. Same 4 held-out tasks (words, brackets, csv_sum, histogram — not used to tune the primer), primer v2, 5 samples each. haiku/sonnet ran at temp 0.7; opus-4.8 rejects the temperature parameter so it ran at its default sampling (its samples were still diverse, so the rate is meaningful):

model	NURL compile	NURL correct	Python	Rust
`claude-opus-4-8`	17/20 (85%)	15/20 (75%)	20/20 (100%)	20/20 (100%)
`claude-sonnet-4-6`	12/20 (60%)	12/20 (60%)	20/20 (100%)	20/20 (100%)
`claude-haiku-4-5`	10/20 (50%)	10/20 (50%)	20/20 (100%)	20/20 (100%)

The compile trend is monotonic in capability (50 → 60 → 85%) — stronger models internalise the one-page reference better — but even opus-4.8 leaves 25 points on the table versus Python and Rust, which are perfect for every model.

One failure, shared by all three models, explains the gap: forgetting the $ import for stdlib string functions (nurl_str_len / nurl_str_get), plus once opus inventing a nurl_str_byte that does not exist. The arithmetic in-sample tasks never used stdlib imports, which is exactly why their tuning (below) overstated NURL's accuracy.

The in-sample tuning arc (haiku), for context

condition (haiku-4.5, 5 samples, temp 0.7)	compile	correct
v1 primer · 7 arithmetic tasks	29/35 (82%)	28/35 (80%)
v2 primer · same 7 tasks	34/35 (97%)	30/35 (85%)
v2 primer · 4 held-out string tasks	10/20 (50%)	10/20 (50%)

The v2 immutability rule lifted the in-sample set to 97%, but it did not generalise (held-out 50%) because string tasks expose a failure the arithmetic tasks never did. Held-out numbers are the honest ones; in-sample tuning overstates.

Bottom line. From one page, a model that has seen zero lines of NURL writes correct code at a rate that climbs from 50% (haiku) to 75% (opus) on held-out string tasks — genuinely usable and clearly capability-bound, but not yet at Python/Rust parity. The defensible public claim is "a regular grammar a model can learn from one page," with capability and stdlib-import caveats — not "as good as mainstream languages."

This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.

Primer — prompt overhead (charged once per NURL request)

The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1330 tokens (cl100k) / 1334 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.