← Playground bench/TOKEN_EFFICIENCY.md
raw

Tokeniser-aware token efficiency

Generated by bench/token_efficiency.py. Lower is better. Every benchmark computes a byte-identical result in all four languages (gated by bench/verify.sh), so these are genuinely matched programs, not loose translations.

Headline (the honest result)

On today's production tokenisers, NURL is not more token-efficient than Python — it is substantially less so. Across 8 matched algorithmic benchmarks NURL costs a median ~1.7× the tokens of Python (cl100k/o200k/gpt2 all agree) and wins on only 1 of 8 (lcg, a four-line arithmetic loop). It is usually the most token-heavy of the four languages — heavier than Rust on most rows.

The cause is out-of-distribution fragmentation, not verbosity alone. Every public BPE was trained on Python/Rust/JS corpora, so def , for i in range, let mut, +=, => are pre-merged into one token each, while NURL's glyph-and-prefix surface (@, , ^, ~, : ~ i, ??, *i) has never been seen and decomposes to near-character granularity. Note the bytes-on-disk table below: on several benches NURL is fewer characters than Python yet more tokens — character-count was the wrong proxy all along.

Implication for the project copy: the raw token-count claim does not survive measurement and should be dropped. The defensible LLM-native arguments are (a) grammar regularity — a one-page LL(k≤4) grammar with local semantics — and (b) first-pass compile success, which this study does not measure. See bench/RESULTS.md and the ROADMAP's "generation-accuracy comparison" item.

Encoders used:

All counts use encode_ordinary — special-token sequences inside string literals stay literal data.

cl100k tokens per benchmark

benchNURLPythonRustNodeNURL/Python
lcg122148851550.82×
sieve2941541941491.91×
json_parse153941143931.63×
fib1075869541.84×
collatz1801131351231.59×
matmul4141892502562.19×
quicksort3672442872721.50×
rot13178991231151.80×
summarymedian 1.71× · NURL fewer tokens than Python in 1/8

o200k tokens per benchmark

benchNURLPythonRustNodeNURL/Python
lcg122149861580.82×
sieve2911551941491.88×
json_parse150941140951.60×
fib1065869541.83×
collatz1801131351231.59×
matmul4141892502562.19×
quicksort3672442872721.50×
rot13176981221161.80×
summarymedian 1.70× · NURL fewer tokens than Python in 1/8

gpt2 tokens per benchmark

benchNURLPythonRustNodeNURL/Python
lcg1511591171670.95×
sieve4282083501752.06×
json_parse23913520651201.77×
fib12680106691.57×
collatz2341442051551.62×
matmul7092754263582.58×
quicksort5153224273561.60×
rot132381171841412.03×
summarymedian 1.70× · NURL fewer tokens than Python in 1/8

Source size (bytes on disk, reference)

benchNURLPythonRustNode
lcg368430251413
sieve809387592365
json_parse4833143940318
fib327170199157
collatz480311398332
matmul1133495708661
quicksort975618806679
rot13570319424378

Notes on interpretation

The out-of-distribution penalty is the whole story. These tokenisers were trained on Python / Rust / JS corpora, so their BPE merges common multi-character sequences in those languages into single tokens (def , function , for i in range, let mut, =>). NURL source is in none of these corpora, so its keyword- and glyph-shaped tokens decompose to near-character granularity. That is a property of the training corpus, not of the grammar's regularity — a tokeniser that had seen NURL would compress it far better (testing that is a separate, "what-could-be" experiment and is deliberately out of scope here).

This does not mean NURL has no LLM-native advantage — it means the advantage is not raw token count. Two claims are tangled in the project copy and must be separated: (1) "fewer tokens on today's tokenisers" — refuted here; (2) "a regular, locally-decodable grammar is easier for a model to generate correctly" — untested by this study and the more honest banner to fly. The next step the ROADMAP already lists is a controlled first-pass-compile-success comparison.

Caveat on the corpus. These are eight small, arithmetic- and systems-leaning micro-benchmarks, a regime that rewards terse infix math and penalises NURL's explicit prefix/type surface. They are not a representative sample of all agent-written code (no parsers, web handlers, glue, or data plumbing). They are, however, enough to retire the unqualified "more token-efficient" claim.

The Rust column for json_parse (1140–2065 tokens) reflects Serde derive macros and explicit type signatures that other languages do not require for the same task — a reminder that token count is task- and style-dependent in every language.