{"name":"dict_coder.nu","source":"// ============================================================\n// dict_coder.nu — Mini text compressor & decompressor\n//\n// Builds a dictionary as it scans the input: each new word gets a\n// fresh integer ID, recurring words reuse it. The output is a flat\n// list of IDs that can be expanded back into the original text by\n// indexing into the dictionary vector. The savings depend on how\n// repetitive the input is — natural language usually compresses\n// well because a small core vocabulary covers most occurrences.\n//\n// Showcases the NURL stdlib generics:\n//   - HashMap[s i]  (word → ID lookup, fast O(1) average)\n//   - Vec[s]        (ID → word reverse lookup, fast O(1) random access)\n//   - Vec[i]        (compressed stream)\n//   - Option[?T]    (map_get returns Some/None — pattern-matched with ??)\n//   - Closures      (hash + eq passed per-call, since NURL has no\n//                    type-class dispatch)\n//\n// Ownership note: the words below are raw string-literal pointers,\n// so the HashMap and Vecs only borrow them — bare `map_free` /\n// `vec_free` is correct. For an owned-String dictionary you would\n// reach for `vec_free_with` + `map_free_with` instead.\n// ============================================================\n\n$ `stdlib/core/vec.nu`\n$ `stdlib/core/option.nu`\n$ `stdlib/std/hashmap.nu`\n\n// ── Closure wrappers ────────────────────────────────────────────────\n// NURL doesn't auto-promote @-functions to closure values, so the stock\n// hash_string / eq_string from hashmap.nu need a `\\`-wrapper at the\n// call site. Defined once at top level and reused everywhere.\n\n@ make_hash → ( @ i s ) {\n    ^ \\ s w → i { ^ ( hash_string w ) }\n}\n\n@ make_eq → ( @ b s s ) {\n    ^ \\ s a s b → b { ^ ( eq_string a b ) }\n}\n\n// ── Pretty-printing helpers ─────────────────────────────────────────\n\n@ banner s title → v {\n    ( nurl_print `\\n── ` )\n    ( nurl_print title )\n    ( nurl_print ` ` )\n    : ~ i pad 0\n    ~ < pad 50 {\n        ( nurl_print `─` )\n        = pad + pad 1\n    }\n    ( nurl_print `\\n` )\n}\n\n@ kv_line s key i val → v {\n    ( nurl_print `  ` )\n    ( nurl_print key )\n    ( nurl_print `: ` )\n    ( nurl_print ( nurl_str_int val ) )\n    ( nurl_print `\\n` )\n}\n\n// ── Encoder ─────────────────────────────────────────────────────────\n// For each word in the input slice:\n//   * if it is already in the dictionary, emit its existing ID\n//   * otherwise assign the next free ID, push to both lookup tables,\n//     and emit that\n// The dictionary is grown in place; the compressed stream is returned\n// implicitly via the (Vec i) handle (mutations propagate through the\n// shared heap control block — same model as String).\n\n@ encode [s words ( HashMap s i ) word_to_id ( Vec s ) id_to_word ( Vec i ) compressed → i {\n    : ( @ i s ) hs ( make_hash )\n    : ( @ b s s ) es ( make_eq )\n\n    : ~ i new_words 0\n\n    ~ word words {\n        : ?i looked_up ( map_get [s i] word_to_id word hs es )\n\n        : ~ i id 0\n        ?? looked_up {\n            T existing → { = id existing }\n            F → {\n                // First sighting: ID = current dictionary size.\n                = id ( vec_len [s] id_to_word )\n                ( map_set [s i] word_to_id word id hs es )\n                ( vec_push [s] id_to_word word )\n                = new_words + new_words 1\n            }\n        }\n\n        ( vec_push [i] compressed id )\n    }\n\n    ^ new_words\n}\n\n// ── Decoder ─────────────────────────────────────────────────────────\n// Walks the compressed Vec[i] and prints the matching dictionary entry\n// for every ID. A malformed ID (out of dictionary range) is rendered as\n// `??` rather than aborted — the kind of resilience a real codec wants.\n\n@ decode ( Vec s ) id_to_word ( Vec i ) compressed → v {\n    : i n ( vec_len [i] compressed )\n    : ~ i j 0\n    ~ < j n {\n        : ?i lookup_id ( vec_get [i] compressed j )\n        ?? lookup_id {\n            T id → {\n                : ?s lookup_word ( vec_get [s] id_to_word id )\n                ?? lookup_word {\n                    T w → ( nurl_print w )\n                    F → ( nurl_print `??` )\n                }\n            }\n            F → ( nurl_print `??` )\n        }\n        ( nurl_print ` ` )\n        = j + j 1\n    }\n    ( nurl_print `\\n` )\n}\n\n// ── Frequency report ────────────────────────────────────────────────\n// One pass over the compressed stream tallies how often each ID appears.\n// Implemented with map_fold to demonstrate higher-order stdlib calls:\n// the closure receives (acc, key, val) for every dictionary entry and\n// returns the running maximum frequency. We do the actual counting by\n// sweeping the compressed Vec into a fresh HashMap[i i].\n\n@ tally_freqs ( Vec i ) compressed → ( HashMap i i ) {\n    : ( @ i i ) hi \\ i x → i { ^ ( hash_int x ) }\n    : ( @ b i i ) ei \\ i a i b → b { ^ ( eq_int a b ) }\n\n    : ( HashMap i i ) freqs ( map_new [i i] )\n    : i n ( vec_len [i] compressed )\n    : ~ i k 0\n    ~ < k n {\n        : ?i opt_id ( vec_get [i] compressed k )\n        ?? opt_id {\n            T id → {\n                : ?i prev ( map_get [i i] freqs id hi ei )\n                : ~ i count 1\n                ?? prev {\n                    T c → { = count + c 1 }\n                    F → {}\n                }\n                ( map_set [i i] freqs id count hi ei )\n            }\n            F → {}\n        }\n        = k + k 1\n    }\n    ^ freqs\n}\n\n@ report_top_word ( HashMap i i ) freqs ( Vec s ) id_to_word → v {\n    : ( @ i i i i ) pick_max \\ i acc i k i v → i {\n        ^ ? > v acc v acc\n    }\n    : i max_count ( map_fold [i i i] freqs 0 pick_max )\n\n    ( nurl_print `  most frequent ID hit count: ` )\n    ( nurl_print ( nurl_str_int max_count ) )\n    ( nurl_print ` (` )\n\n    // Find any ID matching that count and render its word. We don't have\n    // a built-in `map_find` yet, so iterate id_to_word and probe.\n    : i dict_size ( vec_len [s] id_to_word )\n    : ( @ i i ) hi \\ i x → i { ^ ( hash_int x ) }\n    : ( @ b i i ) ei \\ i a i b → b { ^ ( eq_int a b ) }\n    : ~ i id 0\n    : ~ b printed F\n    ~ & < id dict_size ! printed {\n        : ?i c ( map_get [i i] freqs id hi ei )\n        ?? c {\n            T n → ?== n max_count {\n                : ?s w ( vec_get [s] id_to_word id )\n                ?? w {\n                    T s → { ( nurl_print `'` ) ( nurl_print s ) ( nurl_print `'` ) }\n                    F → ( nurl_print `?` )\n                }\n                = printed T\n            } {}\n            F → {}\n        }\n        = id + id 1\n    }\n    ( nurl_print `)\\n` )\n}\n\n// ── Entry point ─────────────────────────────────────────────────────\n\n@ main → i {\n    ( nurl_print `NURL dict_coder — minimal vocabulary compression\\n` )\n\n    // A fragment with deliberate repetition — the closer it gets to\n    // \"lots of the same words\", the better the compression ratio.\n    : [s words [s |\n    `the` `quick` `brown` `fox` `jumps` `over` `the` `lazy` `dog`\n    `the` `dog` `barks` `the` `fox` `runs` `the` `fox` `is` `quick`\n    `the` `quick` `fox` `is` `clever` `the` `dog` `is` `loyal`\n    `the` `quick` `brown` `fox` `is` `back`\n    ]\n\n    // Construct the three working tables. They live for the whole run\n    // and are freed in reverse order at the end.\n    : ( HashMap s i ) word_to_id ( map_new [s i] )\n    : ( Vec s ) id_to_word ( vec_new [s] )\n    : ( Vec i ) compressed ( vec_new [i] )\n\n    ( banner `Encoding` )\n    : i n_input . words length\n    : i n_new ( encode words word_to_id id_to_word compressed )\n\n    // Honest size metrics: text format counts each byte plus a\n    // separator; compressed format is 4 bytes per ID *plus* the\n    // dictionary itself (each unique word's bytes + a separator).\n    : ~ i raw_bytes 0\n    ~ w words {\n        = raw_bytes + raw_bytes + ( nurl_str_len w ) 1\n    }\n    : ~ i dict_bytes 0\n    : i dict_size ( vec_len [s] id_to_word )\n    : ~ i di 0\n    ~ < di dict_size {\n        : ?s ow ( vec_get [s] id_to_word di )\n        : s entry ( opt_unwrap_or [s] ow `` )\n        = dict_bytes + dict_bytes + ( nurl_str_len entry ) 1\n        = di + di 1\n    }\n    // Pick the narrowest fixed-width integer that fits every ID — for\n    // small dictionaries one byte is plenty, and that's where dictionary\n    // coding actually starts paying off.\n    : i id_width ? > dict_size 65536 4 ? > dict_size 256 2 1\n    : i stream_bytes * ( vec_len [i] compressed ) id_width\n    : i comp_bytes + stream_bytes dict_bytes\n    : i ratio_pct ? > raw_bytes 0 / * comp_bytes 100 raw_bytes 0\n\n    ( kv_line `input tokens          ` n_input )\n    ( kv_line `unique words          ` n_new )\n    ( kv_line `compressed stream len ` ( vec_len [i] compressed ) )\n    ( kv_line `ID width (bytes)      ` id_width )\n    ( kv_line `raw bytes (text)      ` raw_bytes )\n    ( kv_line `bytes: ID stream      ` stream_bytes )\n    ( kv_line `bytes: dictionary     ` dict_bytes )\n    ( kv_line `bytes: total comp     ` comp_bytes )\n    ( kv_line `compression % of raw  ` ratio_pct )\n\n    ( banner `Decoding` )\n    ( nurl_print `  ` )\n    ( decode id_to_word compressed )\n\n    ( banner `Frequencies` )\n    : ( HashMap i i ) freqs ( tally_freqs compressed )\n    ( report_top_word freqs id_to_word )\n    ( map_free [i i] freqs )\n\n    ( banner `Cleanup` )\n    ( map_free [s i] word_to_id )\n    ( vec_free [s] id_to_word )\n    ( vec_free [i] compressed )\n    ( nurl_print `  done.\\n` )\n\n    ^ 0\n}\n","bytes":10029}