Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.
Reproducible with any model — seebench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containingnoprimeris the no-reference condition.
Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data — against (order-of-magnitude) tens of millions of lines of Python and millions of Rust. The NURL rows are a language the model has never seen once.
Primer v3 keeps the focused ~1.3k-token reference and adds one prominent, general rule: stdlib functions (above all nurl_str_*) need their $ import — stated at the very top and shown in the canonical worked example. It is 1,604 tokens (a ~270-token addition, not the 14k grammar dump that failed — see RESULTS_old.md and primers/nurl_old.md). This run adds a fourth model: mercury-2, Inception Labs' diffusion LLM (via generate_inception.py), at two reasoning efforts.
The targeted cue worked. Import-omission — the dominant held-out failure in every prior run — collapsed across all models: sonnet 20→0, opus →0, haiku 6→1, mercury 0.
Held-out (4 string tasks, 5 samples, primer v3):
| model | NURL compile | NURL correct | Python | Rust |
|---|---|---|---|---|
claude-sonnet-4-6 | 80% | 80% | 100% | 100% |
claude-opus-4-8 | 70% | 70% | 100% | 100% |
claude-haiku-4-5 | 55% | 45% | 100% | 100% |
mercury-2 (diffusion, reasoning=medium) | 15% | 15% | 100% | 100% |
mercury-2 (diffusion, reasoning=high) | 30% | 25% | 100% | 100% |
Per-model trajectory across primer revisions (NURL compile %, same 4 held-out tasks):
| model | v2 (1,330 tok) | grammar (14,098) | v3 (1,604) |
|---|---|---|---|
| sonnet-4.6 | 60% | 0% | 80% |
| opus-4.8 | 85% | 75% | 70% |
| haiku-4.5 | 50% | 60% | 55% |
But fixing imports did not raise the ceiling — the binding constraint moved. With imports handled, the next ingrained habit becomes the dominant failure, and it differs by model:
( . counts d ), ( >= c 65 )) — all 6 failures — a C/Python habit NURL rejects (( is a call, never grouping).= on an immutable binding).| . & >= == *).
Reasoning effort helps the diffusion model — to a point. Re-running mercury-2 with reasoning_effort: high (vs medium) roughly doubled its NURL accuracy (15→30% compile, 15→25% correct), with no change to Python/Rust (both 100% at either effort). But on the longest task (brackets) the high-effort model spent its entire 8,192-token budget reasoning and returned empty output every single time (0/5, median 1 token) — a budget interaction, not pure inability. It still trails every Claude model.
The diffusion model, sharply. mercury-2 writes Python and Rust at 100% — a capable code model — yet reaches at most 30% on NURL, almost entirely because it cannot stop wrapping operator expressions in parentheses. The cleanest demonstration that the gap is out-of-distribution habit, not raw capability.
Honest bottom line. A focused, targeted primer cue reliably fixes the specific failure it targets (v3 killed import-omission everywhere, the diffusion model included), and more reasoning effort buys a real but partial lift. But first-try accuracy on a language with zero training presence is gated by a sequence of in-distribution habits — imports, then grouping-parentheses, then mutability — each surfacing as the last is fixed. No model reaches Python/Rust parity. A regular grammar is learnable from a focused page and individual failure modes are fixable with targeted cues — not that NURL matches mainstream languages for first-try generation.
This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.
The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1604 tokens (cl100k) / 1605 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.
claude-sonnet-4-6__v3 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 16/20 (80%) | 16/20 (80%) | 179 + 1604 primer | 178 |
| python | 20/20 (100%) | 20/20 (100%) | 35 | 36 |
| rust | 20/20 (100%) | 20/20 (100%) | 73 | 73 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 5/5 (100%) | 5/5 (100%) | 178 |
| words | python | 5/5 (100%) | 5/5 (100%) | 36 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 58 |
| brackets | nurl | 5/5 (100%) | 5/5 (100%) | 483 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 124 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 235 |
| csv_sum | nurl | 5/5 (100%) | 5/5 (100%) | 163 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 35 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 71 |
| histogram | nurl | 1/5 (20%) | 1/5 (20%) | 234 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 75 |
</details>
claude-opus-4-8__v3 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 14/20 (70%) | 14/20 (70%) | 189 + 1604 primer | 188 |
| python | 20/20 (100%) | 20/20 (100%) | 35 | 35 |
| rust | 20/20 (100%) | 20/20 (100%) | 88 | 89 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 5/5 (100%) | 5/5 (100%) | 176 |
| words | python | 5/5 (100%) | 5/5 (100%) | 32 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 96 |
| brackets | nurl | 4/5 (80%) | 4/5 (80%) | 325 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 118 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 198 |
| csv_sum | nurl | 5/5 (100%) | 5/5 (100%) | 163 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 38 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 74 |
| histogram | nurl | 0/5 (0%) | 0/5 (0%) | 213 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 81 |
</details>
claude-haiku-4-5-20251001__v3 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 11/20 (55%) | 9/20 (45%) | 227 + 1604 primer | 226 |
| python | 20/20 (100%) | 20/20 (100%) | 41 | 41 |
| rust | 20/20 (100%) | 20/20 (100%) | 85 | 86 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 0/5 (0%) | 0/5 (0%) | 208 |
| words | python | 5/5 (100%) | 5/5 (100%) | 38 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 105 |
| brackets | nurl | 4/5 (80%) | 4/5 (80%) | 327 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 143 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 213 |
| csv_sum | nurl | 5/5 (100%) | 5/5 (100%) | 178 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 42 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 75 |
| histogram | nurl | 2/5 (40%) | 0/5 (0%) | 499 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 85 |
</details>
mercury-2__v3 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 3/20 (15%) | 3/20 (15%) | 227 + 1604 primer | 226 |
| python | 20/20 (100%) | 20/20 (100%) | 42 | 43 |
| rust | 20/20 (100%) | 20/20 (100%) | 94 | 95 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 0/5 (0%) | 0/5 (0%) | 184 |
| words | python | 5/5 (100%) | 5/5 (100%) | 35 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 98 |
| brackets | nurl | 0/5 (0%) | 0/5 (0%) | 480 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 135 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 224 |
| csv_sum | nurl | 2/5 (40%) | 2/5 (40%) | 168 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 42 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 75 |
| histogram | nurl | 1/5 (20%) | 1/5 (20%) | 258 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 42 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 82 |
</details>
mercury-2__v3high (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 6/20 (30%) | 5/20 (25%) | 91 + 1604 primer | 91 |
| python | 20/20 (100%) | 20/20 (100%) | 52 | 52 |
| rust | 20/20 (100%) | 20/20 (100%) | 93 | 94 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 0/5 (0%) | 0/5 (0%) | 193 |
| words | python | 5/5 (100%) | 5/5 (100%) | 35 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 99 |
| brackets | nurl | 0/5 (0%) | 0/5 (0%) | 1 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 135 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 234 |
| csv_sum | nurl | 4/5 (80%) | 3/5 (60%) | 163 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 38 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 85 |
| histogram | nurl | 2/5 (40%) | 2/5 (40%) | 20 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 60 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 81 |
</details>
bench/TOKEN_EFFICIENCY.md. NURL's case is correctness and learnability, not token thrift; we are not hiding that it costs more tokens.tasks.json (verify each new expected output with bench/verify.sh first) and add models to firm it up.