Embedded Intelligence Lab
EMILAB
Pilot framework

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

Studying whether compact symbolic codes can be expanded by an LLM into task-relevant meaning, while exact and high-risk commitments remain protected outside the lossy channel.

N. Trukhina, V. Vashkelis
|
May 2026
|
arXiv:2605.24541
Semantic packet
protected + lossy channels
LLM-DM
PROTECTED
exact numberssafety limitssource spanscritical atoms
SEMANTICZIP ASCII
@TRIP LIS/4d/Oct.early/$mod
Sintra BASE:Baixa|Chiado
NO:nightlife,car
OUT:d2d,transit,rain,costs
compact cues
->
typed atoms + task meaning
5
Diagnostic cases
6
Formats compared
0.956
Best WAR
39.4%
Balanced gain
46.5%
Largest useful gain
LLM round-trip
Decoder setting

Abstract

SemanticZip studies lossy text compression for LLM systems: compact codes are not expected to reconstruct the original bytes, but to let an LLM recover the semantic commitments needed for downstream behavior. The pilot formalizes LLM-mediated decompression, compares six representation regimes over five diagnostic cases, and evaluates token gain against atom-level recoverability. The central design rule is stratified compression: protect exact or high-risk commitments, and only semantically zip predictable low-risk context.

Evaluation Pipeline

The work treats compression as a round-trip experiment: encode context, decode it with an LLM, reconstruct typed semantic atoms, then score what survived.

01

Protect commitments

Separate exact, safety-critical, source-grounded, or numerical commitments from the lossy channel.

02

Compress predictable context

Render low-risk background context into compact symbolic codes, dictionaries, or minified CCL forms.

03

Decode with an LLM

Treat the model as a semantic decompressor rather than requiring byte-identical reconstruction.

04

Recover typed atoms

Ask an independent decoder to reconstruct canonical semantic atoms from each compressed representation.

05

Score the round trip

Measure token gain, Critical Atom Recall, Weighted Atom Recall, and precision across diagnostic cases.

Compression vs Recoverability

The pilot shows a clear gradient rather than a universal frontier: natural language recovers best, minified CCL balances compression and recall, and ASCII SemanticZip gives the largest useful token reduction.

Formato200k gainWAR
Structured prose
Recoverability ceiling
19.1%0.956
CCL-Core
Typed protected format
8.7%0.948
JSON
Canonical but verbose
-3.4%0.933
CCL-Min
Best middle point
39.4%0.874
SemanticZip ASCII
Largest useful gain
46.5%0.802
SemanticZip emoji
Ambiguous symbols
34.7%0.698
1.00
0.90
0.80
0.65
0
10
20
30
40
50
Structured prose
CCL-Core
JSON
CCL-Min
SemanticZip ASCII
SemanticZip emoji
Weighted Atom Recallo200k token gain vs original (%)

Full Pilot Results

Token gain is measured relative to the original prompt or history. CAR and WAR are computed after independent LLM decompression into canonical atoms.

Formato200kcl100kCARWARPrecision
Structured prose19.1%18.8%0.9610.9560.967
CCL-Core8.7%8.9%0.9550.9480.897
JSON-3.4%0.6%0.9440.9330.894
CCL-Min39.4%40.1%0.8780.8740.933
SemanticZip ASCII46.5%46.1%0.7940.8020.975
SemanticZip emoji34.7%31.8%0.6840.6980.928

Key Findings

Practical takeaways from the pilot compression and decompression experiment.

01

Structured language remains the easiest decode target

Structured prose gives the highest round-trip recoverability in the pilot because the decoder already handles natural language well.

02

CCL-Min is the strongest balanced format

Minified CCL keeps enough explicit structure for recovery while reducing o200k tokens by 39.4% and retaining WAR=0.874.

03

ASCII shorthand beats emoji-heavy notation

SemanticZip ASCII nearly halves token count at WAR=0.802, while emoji-heavy compression is less compact and less recoverable in this run.

04

Lossy compression belongs behind a protected channel

Exact numbers, legal constraints, medical facts, safety boundaries, and private-data commitments should remain protected rather than zipped.

When SemanticZip makes sense

SemanticZip is useful when exact wording is unnecessary, the domain is familiar, the task structure repeats, and conventions can be shared. Good targets include agent memory, project-state summaries, low-risk preferences, planning templates, and compact code-generation specs.

When to keep data protected

exactNumbers, identifiers, source quotations, constraints, and boundaries.
riskyLegal, medical, safety, privacy, and policy commitments.
Design principle: zip background context, protect commitments.

Diagnostic Cases

The cases are deliberately small and human-authored, but each one stresses a different kind of semantic commitment. The numbers below report the LLM round-trip for the aggressive SemanticZip ASCII representation in each case.

01
travel

Lisbon travel planning

Compress a four-day Lisbon itinerary request with lodging, transit, budget, weather, and preference constraints.

43.4%
gain
0.889
CAR
0.914
WAR
1.000
precision
SemanticZip ASCII input
TRIP:LIS/4d/Oct↑/$mod P:walk+foodL+books+views +Sintra B:Bai|Chi !far,!club,!car OUT:d2d+tr+rain+€
LLM decoded as

Trip in Lisbon for 4 days in October with moderate budget; activities include walking, local food, bookstores, viewpoints, and Sintra; base preferences are Baixa or Chiado; exclude far lodging, clubs, and car rental; output should cover day-by-day plan, transit, rain, and costs.

What happened

SemanticZip ASCII preserved the important commitments with high selectivity: 16/18 atom hits, CAR 0.889, WAR 0.914, precision 1.000.

Takeaway

Domain shorthand worked well because Lisbon, 4d, Sintra, transit, budget, and avoidance cues are stable and easy for the decoder to expand.

Best comparison: Structured prose, JSON, and CCL-Core reached full LLM recall; ASCII was the most compact useful rendering.
02
code generation

JavaScript canvas physics

Compress a standalone HTML canvas simulation spec with gravity, damping, collision handling, pointer lock, drag controls, and visual effects.

52.9%
gain
0.813
CAR
0.794
WAR
1.000
precision
SemanticZip ASCII input
CODE/🚀/1F G:m1m2/r² F:.99 PLock I:1HTML C:sun1+pl15+bh6 !libs,!assets P:grav+COL+ROT+raf U:drag→v=-Δ FX:pulse+trail+beam OUT:codeonly
LLM decoded as

Code task with gravity m1m2/r², damping .99, pointer lock, one HTML file, 1 sun, 15 planets, 6 black holes, no libraries or assets, gravity/collision/rotation/RAF physics, drag producing negative delta velocity, pulse/trail/beam effects, and code-only output.

What happened

CCL-Core recovered every atom. CCL-Min kept WAR 0.941, while SemanticZip ASCII remained usable at CAR 0.813, WAR 0.794, precision 1.000.

Takeaway

The symbolic code kept the shape of the program, but dense physics and UI abbreviations started dropping requirements.

Best comparison: CCL-Core was the safest compact representation for this technical spec.
03
data

Python data cleaning

Compress a pandas-only customer-orders script request with deduplication, date parsing, filtering, revenue computation, grouping, and output constraints.

37.3%
gain
0.923
CAR
0.949
WAR
0.875
precision
SemanticZip ASCII input
PY/pd_only orders.csv: dedup#order_id/latest; email↓; date→UTC; drop qty<0; rev=qty*price; grp:cust×month; OUT:rev+count>monthly_customer_revenue.csv; err; !services
LLM decoded as

Use pandas-only Python on orders.csv; deduplicate by order_id keeping latest; convert dates to UTC; drop qty<0; compute revenue as qty*price; group customers by month; output revenue and count to monthly_customer_revenue.csv; do not use services. One visible error: email↓ was decoded as descending sort rather than lowercase.

What happened

SemanticZip ASCII recovered 12/13 atoms with WAR 0.949. Emoji compression fell sharply to WAR 0.610.

Takeaway

ASCII anchors such as pd_only, UTC, qty<0, and rev=qty*price remained precise enough for the decoder.

Best comparison: Structured prose, JSON, and CCL-Core reached full recall; ASCII gave strong compression with one missing atom.
04
UI

React SaaS dashboard

Compress a React dashboard specification with navigation, sidebar, KPI cards, SVG chart, customer table, filter, responsive styling, dark-mode friendliness, and accessibility.

56.6%
gain
0.429
CAR
0.411
WAR
1.000
precision
SemanticZip ASCII input
REACT/DASH/saas !chartlibs mock UI:topnav+side+kpi4+svgline+custTbl+dateFilt S:clean+resp+dark+a11y OUT:1comp
LLM decoded as

Decoded mostly as two broad atoms: a SaaS dashboard UI with chartlibs/mock/topnav/side/kpi4/svgline/customer table/date filter packed into one field, plus style clean/responsive/dark/a11y and one-component output. Many separate UI commitments were not recovered.

What happened

This was the failure case for aggressive SemanticZip: ASCII recovered only 6/14 atoms, CAR 0.429, WAR 0.411, despite precision 1.000.

Takeaway

UI specifications contain many independent commitments. Compressing them into a short cluster caused omissions rather than hallucinations.

Best comparison: Structured prose was strongest; CCL-Min was a safer compact compromise than ASCII for this case.
05
research

Research-outline prompt

Compress a paper-outline request about SemanticZip, its distinction from Context Codec, safety-critical channels, metrics, limitations, and future work.

54.8%
gain
0.917
CAR
0.940
WAR
1.000
precision
SemanticZip ASCII input
OUTLINE/SZip=lossyText+LLMdecomp ≠CC(commitments) SEC:mot+formal+regimes+SAFE-lossless+metrics+RW+limits+future !SOTA
LLM decoded as

Paper outline for SemanticZip as lossy text with LLM decompression; distinguish it from Context Codec commitments; include motivation, formal setup, compression regimes, safe lossless channels, metrics, related work, limitations, and future work; do not claim SOTA.

What happened

ASCII recovered 11/12 atoms with WAR 0.940 and precision 1.000. CCL-Min and structured prose reached full recovery.

Takeaway

The decoder handled compact research-outline notation well because section labels and conceptual anchors were explicit.

Best comparison: CCL-Min reached full recovery with strong compression; ASCII was nearly as recoverable and slightly shorter.

Known Limitations

Pilot scale

The study uses five author-constructed diagnostic cases, not a benchmark-scale corpus.

Single decoder setup

Round-trip recovery should be tested across multiple models, dates, prompts, and decoding configurations.

Author-defined scoring

Gold atoms, criticality weights, aliases, and fuzzy-match thresholds are defined by the authors.

No prompt-compression baselines yet

Future work must compare against LLMLingua, LongLLMLingua, Selective Context, and related systems.

Tokenizer dependence

Token gains are reported for cl100k_base and o200k_base; other tokenizers may score symbols differently.

Utility is proxied

Atom recall is informative, but task-level behavioral validation is still needed.

References

[1]N. Trukhina and V. Vashkelis, "SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors," arXiv:2605.24541, 2026.
[2]N. Trukhina and V. Vashkelis, "Compress the Context, Keep the Commitments: A Context Codec for Efficient and Verifiable LLM State," arXiv:2605.01826, 2026.
[3]H. Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models," EMNLP, 2023.
[4]H. Jiang et al., "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression," ACL, 2024.
[5]N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.

Interested in this research?

We are open to collaborations on LLM memory, prompt compression, semantic evaluation, and efficient AI systems.

Get in Touch