← Playground bench/genacc/RESULTS_old.md
raw

NURL generation accuracy & token report

Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.

Reproducible with any model — see bench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containing noprimer is the no-reference condition.

Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data, against millions of lines of Python and Rust. The NURL rows are a language the model has never seen, working only from the one-page reference below.

Headline — held-out accuracy across three models

NURL first-pass writability scales with model capability, but no model reaches Python/Rust parity on the held-out string tasks. Same 4 held-out tasks (words, brackets, csv_sum, histogram — not used to tune the primer), primer v2, 5 samples each. haiku/sonnet ran at temp 0.7; opus-4.8 rejects the temperature parameter so it ran at its default sampling (its samples were still diverse, so the rate is meaningful):

modelNURL compileNURL correctPythonRust
claude-opus-4-817/20 (85%)15/20 (75%)20/20 (100%)20/20 (100%)
claude-sonnet-4-612/20 (60%)12/20 (60%)20/20 (100%)20/20 (100%)
claude-haiku-4-510/20 (50%)10/20 (50%)20/20 (100%)20/20 (100%)

The compile trend is monotonic in capability (50 → 60 → 85%) — stronger models internalise the one-page reference better — but even opus-4.8 leaves 25 points on the table versus Python and Rust, which are perfect for every model.

One failure, shared by all three models, explains the gap: forgetting the $ import for stdlib string functions (nurl_str_len / nurl_str_get), plus once opus inventing a nurl_str_byte that does not exist. The arithmetic in-sample tasks never used stdlib imports, which is exactly why their tuning (below) overstated NURL's accuracy.

The in-sample tuning arc (haiku), for context

condition (haiku-4.5, 5 samples, temp 0.7)compilecorrect
v1 primer · 7 arithmetic tasks29/35 (82%)28/35 (80%)
v2 primer · same 7 tasks34/35 (97%)30/35 (85%)
v2 primer · 4 held-out string tasks10/20 (50%)10/20 (50%)

The v2 immutability rule lifted the in-sample set to 97%, but it did not generalise (held-out 50%) because string tasks expose a failure the arithmetic tasks never did. Held-out numbers are the honest ones; in-sample tuning overstates.

Bottom line. From one page, a model that has seen zero lines of NURL writes correct code at a rate that climbs from 50% (haiku) to 75% (opus) on held-out string tasks — genuinely usable and clearly capability-bound, but not yet at Python/Rust parity. The defensible public claim is "a regular grammar a model can learn from one page," with capability and stdlib-import caveats — not "as good as mainstream languages."

This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.

Primer — prompt overhead (charged once per NURL request)

The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1330 tokens (cl100k) / 1334 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.

claude-opus-4-8__heldout (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl17/20 (85%)15/20 (75%)195 + 1330 primer194
python20/20 (100%)20/20 (100%)3838
rust20/20 (100%)20/20 (100%)8889

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl3/5 (60%)3/5 (60%)195
wordspython5/5 (100%)5/5 (100%)32
wordsrust5/5 (100%)5/5 (100%)96
bracketsnurl5/5 (100%)3/5 (60%)416
bracketspython5/5 (100%)5/5 (100%)118
bracketsrust5/5 (100%)5/5 (100%)201
csv_sumnurl5/5 (100%)5/5 (100%)44
csv_sumpython5/5 (100%)5/5 (100%)38
csv_sumrust5/5 (100%)5/5 (100%)74
histogramnurl4/5 (80%)4/5 (80%)238
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)81

</details>

claude-sonnet-4-6__heldout (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl12/20 (60%)12/20 (60%)197 + 1330 primer196
python20/20 (100%)20/20 (100%)3536
rust20/20 (100%)20/20 (100%)7373

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl5/5 (100%)5/5 (100%)178
wordspython5/5 (100%)5/5 (100%)36
wordsrust5/5 (100%)5/5 (100%)58
bracketsnurl0/5 (0%)0/5 (0%)481
bracketspython5/5 (100%)5/5 (100%)124
bracketsrust5/5 (100%)5/5 (100%)235
csv_sumnurl2/5 (40%)2/5 (40%)155
csv_sumpython5/5 (100%)5/5 (100%)35
csv_sumrust5/5 (100%)5/5 (100%)71
histogramnurl5/5 (100%)5/5 (100%)223
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)75

</details>

claude-haiku-4-5-20251001__heldout (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl10/20 (50%)10/20 (50%)215 + 1330 primer215
python20/20 (100%)20/20 (100%)4041
rust20/20 (100%)20/20 (100%)9596

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl4/5 (80%)4/5 (80%)179
wordspython5/5 (100%)5/5 (100%)38
wordsrust5/5 (100%)5/5 (100%)112
bracketsnurl1/5 (20%)1/5 (20%)385
bracketspython5/5 (100%)5/5 (100%)151
bracketsrust5/5 (100%)5/5 (100%)213
csv_sumnurl5/5 (100%)5/5 (100%)184
csv_sumpython5/5 (100%)5/5 (100%)42
csv_sumrust5/5 (100%)5/5 (100%)75
histogramnurl0/5 (0%)0/5 (0%)255
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)85

</details>

claude-haiku-4-5-20251001__primer2 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl34/35 (97%)30/35 (85%)195 + 1330 primer193

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
fibnurl5/5 (100%)5/5 (100%)59
collatznurl4/5 (80%)4/5 (80%)161
rot13nurl5/5 (100%)4/5 (80%)195
matmulnurl5/5 (100%)4/5 (80%)623
quicksortnurl5/5 (100%)3/5 (60%)388
sievenurl5/5 (100%)5/5 (100%)292
lcgnurl5/5 (100%)5/5 (100%)77

</details>

claude-haiku-4-5-20251001__t07 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl29/35 (82%)28/35 (80%)183 + 1330 primer181
python35/35 (100%)30/35 (85%)110110
rust33/35 (94%)28/35 (80%)131131

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
fibnurl5/5 (100%)5/5 (100%)59
fibpython5/5 (100%)5/5 (100%)52
fibrust5/5 (100%)5/5 (100%)68
collatznurl2/5 (40%)2/5 (40%)154
collatzpython5/5 (100%)5/5 (100%)100
collatzrust5/5 (100%)5/5 (100%)131
rot13nurl4/5 (80%)4/5 (80%)183
rot13python5/5 (100%)0/5 (0%)142
rot13rust5/5 (100%)0/5 (0%)146
matmulnurl3/5 (60%)3/5 (60%)602
matmulpython5/5 (100%)5/5 (100%)122
matmulrust5/5 (100%)5/5 (100%)313
quicksortnurl5/5 (100%)4/5 (80%)381
quicksortpython5/5 (100%)5/5 (100%)242
quicksortrust3/5 (60%)3/5 (60%)320
sievenurl5/5 (100%)5/5 (100%)247
sievepython5/5 (100%)5/5 (100%)122
sieverust5/5 (100%)5/5 (100%)130
lcgnurl5/5 (100%)5/5 (100%)77
lcgpython5/5 (100%)5/5 (100%)71
lcgrust5/5 (100%)5/5 (100%)64

</details>

claude-haiku-4-5-20251001__noprimer (5 sample(s)/task, NURL primer withheld (raw recall))

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl0/35 (0%)0/35 (0%)163163

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
fibnurl0/5 (0%)0/5 (0%)67
collatznurl0/5 (0%)0/5 (0%)140
rot13nurl0/5 (0%)0/5 (0%)165
matmulnurl0/5 (0%)0/5 (0%)317
quicksortnurl0/5 (0%)0/5 (0%)348
sievenurl0/5 (0%)0/5 (0%)207
lcgnurl0/5 (0%)0/5 (0%)79

</details>

How to read this (interpretation)