← Playground bench/genacc/RESULTS.md
raw

NURL generation accuracy & token report

Generated by bench/genacc/score.py. Every figure is first-pass: the model's first and only attempt, with no human edits and no self-repair. compile = built with zero changes; correct = printed the exact expected bytes (a program that compiles but prints the wrong answer is a miss); tokens = the BPE token cost of the program the model emitted.

Reproducible with any model — see bench/genacc/README.md. The conditions below are whichever solution directories were passed in; a name containing noprimer is the no-reference condition.

Read every NURL number in this light: public LLMs contain zero lines of NURL in their training data — against (order-of-magnitude) tens of millions of lines of Python and millions of Rust. The NURL rows are a language the model has never seen once.

Headline — primer v3: the targeted import cue works, then the constraint shifts

Primer v3 keeps the focused ~1.3k-token reference and adds one prominent, general rule: stdlib functions (above all nurl_str_*) need their $ import — stated at the very top and shown in the canonical worked example. It is 1,604 tokens (a ~270-token addition, not the 14k grammar dump that failed — see RESULTS_old.md and primers/nurl_old.md). This run adds a fourth model: mercury-2, Inception Labs' diffusion LLM (via generate_inception.py), at two reasoning efforts.

The targeted cue worked. Import-omission — the dominant held-out failure in every prior run — collapsed across all models: sonnet 20→0, opus →0, haiku 6→1, mercury 0.

Held-out (4 string tasks, 5 samples, primer v3):

modelNURL compileNURL correctPythonRust
claude-sonnet-4-680%80%100%100%
claude-opus-4-870%70%100%100%
claude-haiku-4-555%45%100%100%
mercury-2 (diffusion, reasoning=medium)15%15%100%100%
mercury-2 (diffusion, reasoning=high)30%25%100%100%

Per-model trajectory across primer revisions (NURL compile %, same 4 held-out tasks):

modelv2 (1,330 tok)grammar (14,098)v3 (1,604)
sonnet-4.660%0%80%
opus-4.885%75%70%
haiku-4.550%60%55%

But fixing imports did not raise the ceiling — the binding constraint moved. With imports handled, the next ingrained habit becomes the dominant failure, and it differs by model:

Reasoning effort helps the diffusion model — to a point. Re-running mercury-2 with reasoning_effort: high (vs medium) roughly doubled its NURL accuracy (15→30% compile, 15→25% correct), with no change to Python/Rust (both 100% at either effort). But on the longest task (brackets) the high-effort model spent its entire 8,192-token budget reasoning and returned empty output every single time (0/5, median 1 token) — a budget interaction, not pure inability. It still trails every Claude model.

The diffusion model, sharply. mercury-2 writes Python and Rust at 100% — a capable code model — yet reaches at most 30% on NURL, almost entirely because it cannot stop wrapping operator expressions in parentheses. The cleanest demonstration that the gap is out-of-distribution habit, not raw capability.

Honest bottom line. A focused, targeted primer cue reliably fixes the specific failure it targets (v3 killed import-omission everywhere, the diffusion model included), and more reasoning effort buys a real but partial lift. But first-try accuracy on a language with zero training presence is gated by a sequence of in-distribution habits — imports, then grouping-parentheses, then mutability — each surfacing as the last is fixed. No model reaches Python/Rust parity. A regular grammar is learnable from a focused page and individual failure modes are fixable with targeted cues — not that NURL matches mainstream languages for first-try generation.

This headline is curated. Everything from "Primer — prompt overhead" onward is regenerated by score.py --md; re-add this headline after regenerating.

Primer — prompt overhead (charged once per NURL request)

The NURL reference (primers/nurl.md) handed to the model in the primed condition is 1604 tokens (cl100k) / 1605 tokens (o200k). Python and Rust get no such reference. This overhead is spent on every NURL request, separately from the program the model writes — it is broken out here so it is not hidden.

claude-sonnet-4-6__v3 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl16/20 (80%)16/20 (80%)179 + 1604 primer178
python20/20 (100%)20/20 (100%)3536
rust20/20 (100%)20/20 (100%)7373

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl5/5 (100%)5/5 (100%)178
wordspython5/5 (100%)5/5 (100%)36
wordsrust5/5 (100%)5/5 (100%)58
bracketsnurl5/5 (100%)5/5 (100%)483
bracketspython5/5 (100%)5/5 (100%)124
bracketsrust5/5 (100%)5/5 (100%)235
csv_sumnurl5/5 (100%)5/5 (100%)163
csv_sumpython5/5 (100%)5/5 (100%)35
csv_sumrust5/5 (100%)5/5 (100%)71
histogramnurl1/5 (20%)1/5 (20%)234
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)75

</details>

claude-opus-4-8__v3 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl14/20 (70%)14/20 (70%)189 + 1604 primer188
python20/20 (100%)20/20 (100%)3535
rust20/20 (100%)20/20 (100%)8889

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl5/5 (100%)5/5 (100%)176
wordspython5/5 (100%)5/5 (100%)32
wordsrust5/5 (100%)5/5 (100%)96
bracketsnurl4/5 (80%)4/5 (80%)325
bracketspython5/5 (100%)5/5 (100%)118
bracketsrust5/5 (100%)5/5 (100%)198
csv_sumnurl5/5 (100%)5/5 (100%)163
csv_sumpython5/5 (100%)5/5 (100%)38
csv_sumrust5/5 (100%)5/5 (100%)74
histogramnurl0/5 (0%)0/5 (0%)213
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)81

</details>

claude-haiku-4-5-20251001__v3 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl11/20 (55%)9/20 (45%)227 + 1604 primer226
python20/20 (100%)20/20 (100%)4141
rust20/20 (100%)20/20 (100%)8586

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl0/5 (0%)0/5 (0%)208
wordspython5/5 (100%)5/5 (100%)38
wordsrust5/5 (100%)5/5 (100%)105
bracketsnurl4/5 (80%)4/5 (80%)327
bracketspython5/5 (100%)5/5 (100%)143
bracketsrust5/5 (100%)5/5 (100%)213
csv_sumnurl5/5 (100%)5/5 (100%)178
csv_sumpython5/5 (100%)5/5 (100%)42
csv_sumrust5/5 (100%)5/5 (100%)75
histogramnurl2/5 (40%)0/5 (0%)499
histogrampython5/5 (100%)5/5 (100%)29
histogramrust5/5 (100%)5/5 (100%)85

</details>

mercury-2__v3 (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl3/20 (15%)3/20 (15%)227 + 1604 primer226
python20/20 (100%)20/20 (100%)4243
rust20/20 (100%)20/20 (100%)9495

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl0/5 (0%)0/5 (0%)184
wordspython5/5 (100%)5/5 (100%)35
wordsrust5/5 (100%)5/5 (100%)98
bracketsnurl0/5 (0%)0/5 (0%)480
bracketspython5/5 (100%)5/5 (100%)135
bracketsrust5/5 (100%)5/5 (100%)224
csv_sumnurl2/5 (40%)2/5 (40%)168
csv_sumpython5/5 (100%)5/5 (100%)42
csv_sumrust5/5 (100%)5/5 (100%)75
histogramnurl1/5 (20%)1/5 (20%)258
histogrampython5/5 (100%)5/5 (100%)42
histogramrust5/5 (100%)5/5 (100%)82

</details>

mercury-2__v3high (5 sample(s)/task, as generated)

languagefirst-pass compilecorrect outputmedian tokens cl100kmedian tokens o200k
nurl6/20 (30%)5/20 (25%)91 + 1604 primer91
python20/20 (100%)20/20 (100%)5252
rust20/20 (100%)20/20 (100%)9394

<details><summary>per-task detail</summary>

tasklanguagecompilecorrectmedian tokens cl100k
wordsnurl0/5 (0%)0/5 (0%)193
wordspython5/5 (100%)5/5 (100%)35
wordsrust5/5 (100%)5/5 (100%)99
bracketsnurl0/5 (0%)0/5 (0%)1
bracketspython5/5 (100%)5/5 (100%)135
bracketsrust5/5 (100%)5/5 (100%)234
csv_sumnurl4/5 (80%)3/5 (60%)163
csv_sumpython5/5 (100%)5/5 (100%)38
csv_sumrust5/5 (100%)5/5 (100%)85
histogramnurl2/5 (40%)2/5 (40%)20
histogrampython5/5 (100%)5/5 (100%)60
histogramrust5/5 (100%)5/5 (100%)81

</details>

How to read this (interpretation)