Generated bybench/token_efficiency.py. Lower is better. Every benchmark computes a byte-identical result in all four languages (gated bybench/verify.sh), so these are genuinely matched programs, not loose translations.
On today's production tokenisers, NURL is not more token-efficient than Python — it is substantially less so. Across 8 matched algorithmic benchmarks NURL costs a median ~1.7× the tokens of Python (cl100k/o200k/gpt2 all agree) and wins on only 1 of 8 (lcg, a four-line arithmetic loop). It is usually the most token-heavy of the four languages — heavier than Rust on most rows.
The cause is out-of-distribution fragmentation, not verbosity alone. Every public BPE was trained on Python/Rust/JS corpora, so def , for i in range, let mut, +=, => are pre-merged into one token each, while NURL's glyph-and-prefix surface (@, →, ^, ~, : ~ i, ??, *i) has never been seen and decomposes to near-character granularity. Note the bytes-on-disk table below: on several benches NURL is fewer characters than Python yet more tokens — character-count was the wrong proxy all along.
Implication for the project copy: the raw token-count claim does not survive measurement and should be dropped. The defensible LLM-native arguments are (a) grammar regularity — a one-page LL(k≤4) grammar with local semantics — and (b) first-pass compile success, which this study does not measure. See bench/RESULTS.md and the ROADMAP's "generation-accuracy comparison" item.
Encoders used:
cl100k — cl100k_base. The tokeniser GPT-3.5 / GPT-4 / Claude legacy use.o200k — o200k_base. GPT-4o and the o-series.gpt2 — the smaller BPE that LLaMA derivatives share their byte-pair shape with at this granularity. Llama 3's own tokeniser is HF-gated; gpt2 is used as a proxy.
All counts use encode_ordinary — special-token sequences inside string literals stay literal data.
cl100k tokens per benchmark| bench | NURL | Python | Rust | Node | NURL/Python |
|---|---|---|---|---|---|
lcg | 122 | 148 | 85 | 155 | 0.82× |
sieve | 294 | 154 | 194 | 149 | 1.91× |
json_parse | 153 | 94 | 1143 | 93 | 1.63× |
fib | 107 | 58 | 69 | 54 | 1.84× |
collatz | 180 | 113 | 135 | 123 | 1.59× |
matmul | 414 | 189 | 250 | 256 | 2.19× |
quicksort | 367 | 244 | 287 | 272 | 1.50× |
rot13 | 178 | 99 | 123 | 115 | 1.80× |
| summary | median 1.71× · NURL fewer tokens than Python in 1/8 |
o200k tokens per benchmark| bench | NURL | Python | Rust | Node | NURL/Python |
|---|---|---|---|---|---|
lcg | 122 | 149 | 86 | 158 | 0.82× |
sieve | 291 | 155 | 194 | 149 | 1.88× |
json_parse | 150 | 94 | 1140 | 95 | 1.60× |
fib | 106 | 58 | 69 | 54 | 1.83× |
collatz | 180 | 113 | 135 | 123 | 1.59× |
matmul | 414 | 189 | 250 | 256 | 2.19× |
quicksort | 367 | 244 | 287 | 272 | 1.50× |
rot13 | 176 | 98 | 122 | 116 | 1.80× |
| summary | median 1.70× · NURL fewer tokens than Python in 1/8 |
gpt2 tokens per benchmark| bench | NURL | Python | Rust | Node | NURL/Python |
|---|---|---|---|---|---|
lcg | 151 | 159 | 117 | 167 | 0.95× |
sieve | 428 | 208 | 350 | 175 | 2.06× |
json_parse | 239 | 135 | 2065 | 120 | 1.77× |
fib | 126 | 80 | 106 | 69 | 1.57× |
collatz | 234 | 144 | 205 | 155 | 1.62× |
matmul | 709 | 275 | 426 | 358 | 2.58× |
quicksort | 515 | 322 | 427 | 356 | 1.60× |
rot13 | 238 | 117 | 184 | 141 | 2.03× |
| summary | median 1.70× · NURL fewer tokens than Python in 1/8 |
| bench | NURL | Python | Rust | Node |
|---|---|---|---|---|
lcg | 368 | 430 | 251 | 413 |
sieve | 809 | 387 | 592 | 365 |
json_parse | 483 | 314 | 3940 | 318 |
fib | 327 | 170 | 199 | 157 |
collatz | 480 | 311 | 398 | 332 |
matmul | 1133 | 495 | 708 | 661 |
quicksort | 975 | 618 | 806 | 679 |
rot13 | 570 | 319 | 424 | 378 |
The out-of-distribution penalty is the whole story. These tokenisers were trained on Python / Rust / JS corpora, so their BPE merges common multi-character sequences in those languages into single tokens (def , function , for i in range, let mut, =>). NURL source is in none of these corpora, so its keyword- and glyph-shaped tokens decompose to near-character granularity. That is a property of the training corpus, not of the grammar's regularity — a tokeniser that had seen NURL would compress it far better (testing that is a separate, "what-could-be" experiment and is deliberately out of scope here).
This does not mean NURL has no LLM-native advantage — it means the advantage is not raw token count. Two claims are tangled in the project copy and must be separated: (1) "fewer tokens on today's tokenisers" — refuted here; (2) "a regular, locally-decodable grammar is easier for a model to generate correctly" — untested by this study and the more honest banner to fly. The next step the ROADMAP already lists is a controlled first-pass-compile-success comparison.
Caveat on the corpus. These are eight small, arithmetic- and systems-leaning micro-benchmarks, a regime that rewards terse infix math and penalises NURL's explicit prefix/type surface. They are not a representative sample of all agent-written code (no parsers, web handlers, glue, or data plumbing). They are, however, enough to retire the unqualified "more token-efficient" claim.
The Rust column for json_parse (1140–2065 tokens) reflects Serde derive macros and explicit type signatures that other languages do not require for the same task — a reminder that token count is task- and style-dependent in every language.