Tokeniser-aware token efficiency

Generated by bench/token_efficiency.py. Lower is better. Every benchmark computes a byte-identical result in all four languages (gated by bench/verify.sh), so these are genuinely matched programs, not loose translations.

Headline (the honest result)

On today's production tokenisers, NURL is not more token-efficient than Python — it is substantially less so. Across 8 matched algorithmic benchmarks NURL costs a median ~1.7× the tokens of Python (cl100k/o200k/gpt2 all agree) and wins on only 1 of 8 (lcg, a four-line arithmetic loop). It is usually the most token-heavy of the four languages — heavier than Rust on most rows.

The cause is out-of-distribution fragmentation, not verbosity alone. Every public BPE was trained on Python/Rust/JS corpora, so def , for i in range, let mut, +=, => are pre-merged into one token each, while NURL's glyph-and-prefix surface (@, →, ^, ~, : ~ i, ??, *i) has never been seen and decomposes to near-character granularity. Note the bytes-on-disk table below: on several benches NURL is fewer characters than Python yet more tokens — character-count was the wrong proxy all along.

Implication for the project copy: the raw token-count claim does not survive measurement and should be dropped. The defensible LLM-native arguments are (a) grammar regularity — a one-page LL(k≤4) grammar with local semantics — and (b) first-pass compile success, which this study does not measure. See bench/RESULTS.md and the ROADMAP's "generation-accuracy comparison" item.

Encoders used:

cl100k — cl100k_base. The tokeniser GPT-3.5 / GPT-4 / Claude legacy use.
o200k — o200k_base. GPT-4o and the o-series.
gpt2 — the smaller BPE that LLaMA derivatives share their byte-pair shape with at this granularity. Llama 3's own tokeniser is HF-gated; gpt2 is used as a proxy.

All counts use encode_ordinary — special-token sequences inside string literals stay literal data.

`cl100k` tokens per benchmark

bench	NURL	Python	Rust	Node	NURL/Python
`lcg`	122	148	85	155	0.82×
`sieve`	294	154	194	149	1.91×
`json_parse`	153	94	1143	93	1.63×
`fib`	107	58	69	54	1.84×
`collatz`	180	113	135	123	1.59×
`matmul`	414	189	250	256	2.19×
`quicksort`	367	244	287	272	1.50×
`rot13`	178	99	123	115	1.80×
summary					median 1.71× · NURL fewer tokens than Python in 1/8

`o200k` tokens per benchmark

bench	NURL	Python	Rust	Node	NURL/Python
`lcg`	122	149	86	158	0.82×
`sieve`	291	155	194	149	1.88×
`json_parse`	150	94	1140	95	1.60×
`fib`	106	58	69	54	1.83×
`collatz`	180	113	135	123	1.59×
`matmul`	414	189	250	256	2.19×
`quicksort`	367	244	287	272	1.50×
`rot13`	176	98	122	116	1.80×
summary					median 1.70× · NURL fewer tokens than Python in 1/8

`gpt2` tokens per benchmark

bench	NURL	Python	Rust	Node	NURL/Python
`lcg`	151	159	117	167	0.95×
`sieve`	428	208	350	175	2.06×
`json_parse`	239	135	2065	120	1.77×
`fib`	126	80	106	69	1.57×
`collatz`	234	144	205	155	1.62×
`matmul`	709	275	426	358	2.58×
`quicksort`	515	322	427	356	1.60×
`rot13`	238	117	184	141	2.03×
summary					median 1.70× · NURL fewer tokens than Python in 1/8

Source size (bytes on disk, reference)

bench	NURL	Python	Rust	Node
`lcg`	368	430	251	413
`sieve`	809	387	592	365
`json_parse`	483	314	3940	318
`fib`	327	170	199	157
`collatz`	480	311	398	332
`matmul`	1133	495	708	661
`quicksort`	975	618	806	679
`rot13`	570	319	424	378

Notes on interpretation

The out-of-distribution penalty is the whole story. These tokenisers were trained on Python / Rust / JS corpora, so their BPE merges common multi-character sequences in those languages into single tokens (def , function , for i in range, let mut, =>). NURL source is in none of these corpora, so its keyword- and glyph-shaped tokens decompose to near-character granularity. That is a property of the training corpus, not of the grammar's regularity — a tokeniser that had seen NURL would compress it far better (testing that is a separate, "what-could-be" experiment and is deliberately out of scope here).

This does not mean NURL has no LLM-native advantage — it means the advantage is not raw token count. Two claims are tangled in the project copy and must be separated: (1) "fewer tokens on today's tokenisers" — refuted here; (2) "a regular, locally-decodable grammar is easier for a model to generate correctly" — untested by this study and the more honest banner to fly. The next step the ROADMAP already lists is a controlled first-pass-compile-success comparison.

Caveat on the corpus. These are eight small, arithmetic- and systems-leaning micro-benchmarks, a regime that rewards terse infix math and penalises NURL's explicit prefix/type surface. They are not a representative sample of all agent-written code (no parsers, web handlers, glue, or data plumbing). They are, however, enough to retire the unqualified "more token-efficient" claim.

The Rust column for json_parse (1140–2065 tokens) reflects Serde derive macros and explicit type signatures that other languages do not require for the same task — a reminder that token count is task- and style-dependent in every language.