Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.
Reproducible with any model — seebench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containingnoprimeris the no-reference condition.
Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data, against millions of lines of Python and Rust. The NURL rows are a language the model has never seen, working only from the one-page reference below.
NURL first-pass writability scales with model capability, but no model reaches Python/Rust parity on the held-out string tasks. Same 4 held-out tasks (words, brackets, csv_sum, histogram — not used to tune the primer), primer v2, 5 samples each. haiku/sonnet ran at temp 0.7; opus-4.8 rejects the temperature parameter so it ran at its default sampling (its samples were still diverse, so the rate is meaningful):
| model | NURL compile | NURL correct | Python | Rust |
|---|---|---|---|---|
claude-opus-4-8 | 17/20 (85%) | 15/20 (75%) | 20/20 (100%) | 20/20 (100%) |
claude-sonnet-4-6 | 12/20 (60%) | 12/20 (60%) | 20/20 (100%) | 20/20 (100%) |
claude-haiku-4-5 | 10/20 (50%) | 10/20 (50%) | 20/20 (100%) | 20/20 (100%) |
The compile trend is monotonic in capability (50 → 60 → 85%) — stronger models internalise the one-page reference better — but even opus-4.8 leaves 25 points on the table versus Python and Rust, which are perfect for every model.
One failure, shared by all three models, explains the gap: forgetting the $ import for stdlib string functions (nurl_str_len / nurl_str_get), plus once opus inventing a nurl_str_byte that does not exist. The arithmetic in-sample tasks never used stdlib imports, which is exactly why their tuning (below) overstated NURL's accuracy.
| condition (haiku-4.5, 5 samples, temp 0.7) | compile | correct |
|---|---|---|
| v1 primer · 7 arithmetic tasks | 29/35 (82%) | 28/35 (80%) |
| v2 primer · same 7 tasks | 34/35 (97%) | 30/35 (85%) |
| v2 primer · 4 held-out string tasks | 10/20 (50%) | 10/20 (50%) |
The v2 immutability rule lifted the in-sample set to 97%, but it did not generalise (held-out 50%) because string tasks expose a failure the arithmetic tasks never did. Held-out numbers are the honest ones; in-sample tuning overstates.
Bottom line. From one page, a model that has seen zero lines of NURL writes correct code at a rate that climbs from 50% (haiku) to 75% (opus) on held-out string tasks — genuinely usable and clearly capability-bound, but not yet at Python/Rust parity. The defensible public claim is "a regular grammar a model can learn from one page," with capability and stdlib-import caveats — not "as good as mainstream languages."
This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.
The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1330 tokens (cl100k) / 1334 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.
claude-opus-4-8__heldout (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 17/20 (85%) | 15/20 (75%) | 195 + 1330 primer | 194 |
| python | 20/20 (100%) | 20/20 (100%) | 38 | 38 |
| rust | 20/20 (100%) | 20/20 (100%) | 88 | 89 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 3/5 (60%) | 3/5 (60%) | 195 |
| words | python | 5/5 (100%) | 5/5 (100%) | 32 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 96 |
| brackets | nurl | 5/5 (100%) | 3/5 (60%) | 416 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 118 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 201 |
| csv_sum | nurl | 5/5 (100%) | 5/5 (100%) | 44 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 38 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 74 |
| histogram | nurl | 4/5 (80%) | 4/5 (80%) | 238 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 81 |
</details>
claude-sonnet-4-6__heldout (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 12/20 (60%) | 12/20 (60%) | 197 + 1330 primer | 196 |
| python | 20/20 (100%) | 20/20 (100%) | 35 | 36 |
| rust | 20/20 (100%) | 20/20 (100%) | 73 | 73 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 5/5 (100%) | 5/5 (100%) | 178 |
| words | python | 5/5 (100%) | 5/5 (100%) | 36 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 58 |
| brackets | nurl | 0/5 (0%) | 0/5 (0%) | 481 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 124 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 235 |
| csv_sum | nurl | 2/5 (40%) | 2/5 (40%) | 155 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 35 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 71 |
| histogram | nurl | 5/5 (100%) | 5/5 (100%) | 223 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 75 |
</details>
claude-haiku-4-5-20251001__heldout (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 10/20 (50%) | 10/20 (50%) | 215 + 1330 primer | 215 |
| python | 20/20 (100%) | 20/20 (100%) | 40 | 41 |
| rust | 20/20 (100%) | 20/20 (100%) | 95 | 96 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| words | nurl | 4/5 (80%) | 4/5 (80%) | 179 |
| words | python | 5/5 (100%) | 5/5 (100%) | 38 |
| words | rust | 5/5 (100%) | 5/5 (100%) | 112 |
| brackets | nurl | 1/5 (20%) | 1/5 (20%) | 385 |
| brackets | python | 5/5 (100%) | 5/5 (100%) | 151 |
| brackets | rust | 5/5 (100%) | 5/5 (100%) | 213 |
| csv_sum | nurl | 5/5 (100%) | 5/5 (100%) | 184 |
| csv_sum | python | 5/5 (100%) | 5/5 (100%) | 42 |
| csv_sum | rust | 5/5 (100%) | 5/5 (100%) | 75 |
| histogram | nurl | 0/5 (0%) | 0/5 (0%) | 255 |
| histogram | python | 5/5 (100%) | 5/5 (100%) | 29 |
| histogram | rust | 5/5 (100%) | 5/5 (100%) | 85 |
</details>
claude-haiku-4-5-20251001__primer2 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 34/35 (97%) | 30/35 (85%) | 195 + 1330 primer | 193 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| fib | nurl | 5/5 (100%) | 5/5 (100%) | 59 |
| collatz | nurl | 4/5 (80%) | 4/5 (80%) | 161 |
| rot13 | nurl | 5/5 (100%) | 4/5 (80%) | 195 |
| matmul | nurl | 5/5 (100%) | 4/5 (80%) | 623 |
| quicksort | nurl | 5/5 (100%) | 3/5 (60%) | 388 |
| sieve | nurl | 5/5 (100%) | 5/5 (100%) | 292 |
| lcg | nurl | 5/5 (100%) | 5/5 (100%) | 77 |
</details>
claude-haiku-4-5-20251001__t07 (5 sample(s)/task, as generated)| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 29/35 (82%) | 28/35 (80%) | 183 + 1330 primer | 181 |
| python | 35/35 (100%) | 30/35 (85%) | 110 | 110 |
| rust | 33/35 (94%) | 28/35 (80%) | 131 | 131 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| fib | nurl | 5/5 (100%) | 5/5 (100%) | 59 |
| fib | python | 5/5 (100%) | 5/5 (100%) | 52 |
| fib | rust | 5/5 (100%) | 5/5 (100%) | 68 |
| collatz | nurl | 2/5 (40%) | 2/5 (40%) | 154 |
| collatz | python | 5/5 (100%) | 5/5 (100%) | 100 |
| collatz | rust | 5/5 (100%) | 5/5 (100%) | 131 |
| rot13 | nurl | 4/5 (80%) | 4/5 (80%) | 183 |
| rot13 | python | 5/5 (100%) | 0/5 (0%) | 142 |
| rot13 | rust | 5/5 (100%) | 0/5 (0%) | 146 |
| matmul | nurl | 3/5 (60%) | 3/5 (60%) | 602 |
| matmul | python | 5/5 (100%) | 5/5 (100%) | 122 |
| matmul | rust | 5/5 (100%) | 5/5 (100%) | 313 |
| quicksort | nurl | 5/5 (100%) | 4/5 (80%) | 381 |
| quicksort | python | 5/5 (100%) | 5/5 (100%) | 242 |
| quicksort | rust | 3/5 (60%) | 3/5 (60%) | 320 |
| sieve | nurl | 5/5 (100%) | 5/5 (100%) | 247 |
| sieve | python | 5/5 (100%) | 5/5 (100%) | 122 |
| sieve | rust | 5/5 (100%) | 5/5 (100%) | 130 |
| lcg | nurl | 5/5 (100%) | 5/5 (100%) | 77 |
| lcg | python | 5/5 (100%) | 5/5 (100%) | 71 |
| lcg | rust | 5/5 (100%) | 5/5 (100%) | 64 |
</details>
claude-haiku-4-5-20251001__noprimer (5 sample(s)/task, NURL primer withheld (raw recall))| language | first-pass compile | correct output | median tokens cl100k | median tokens o200k |
|---|---|---|---|---|
| nurl | 0/35 (0%) | 0/35 (0%) | 163 | 163 |
<details><summary>per-task detail</summary>
| task | language | compile | correct | median tokens cl100k |
|---|---|---|---|---|
| fib | nurl | 0/5 (0%) | 0/5 (0%) | 67 |
| collatz | nurl | 0/5 (0%) | 0/5 (0%) | 140 |
| rot13 | nurl | 0/5 (0%) | 0/5 (0%) | 165 |
| matmul | nurl | 0/5 (0%) | 0/5 (0%) | 317 |
| quicksort | nurl | 0/5 (0%) | 0/5 (0%) | 348 |
| sieve | nurl | 0/5 (0%) | 0/5 (0%) | 207 |
| lcg | nurl | 0/5 (0%) | 0/5 (0%) | 79 |
</details>
bench/TOKEN_EFFICIENCY.md. NURL's case is correctness and learnability, not token thrift; we are not hiding that it costs more tokens.tasks.json (verify each new expected output with bench/verify.sh first) and add models to firm it up.