Pilot framework

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

Studying whether compact symbolic codes can be expanded by an LLM into task-relevant meaning, while exact and high-risk commitments remain protected outside the lossy channel.

May 2026

arXiv:2605.24541

Paper (PDF)arXiv

Semantic packet

protected + lossy channels

LLM-DM

PROTECTED

exact numberssafety limitssource spanscritical atoms

SEMANTICZIP ASCII

@TRIP LIS/4d/Oct.early/$mod
Sintra BASE:Baixa|Chiado
NO:nightlife,car
OUT:d2d,transit,rain,costs

compact cues

typed atoms + task meaning

Diagnostic cases

Formats compared

0.956

Best WAR

39.4%

Balanced gain

46.5%

Largest useful gain

LLM round-trip

Decoder setting

Abstract

SemanticZip studies lossy text compression for LLM systems: compact codes are not expected to reconstruct the original bytes, but to let an LLM recover the semantic commitments needed for downstream behavior. The pilot formalizes LLM-mediated decompression, compares six representation regimes over five diagnostic cases, and evaluates token gain against atom-level recoverability. The central design rule is stratified compression: protect exact or high-risk commitments, and only semantically zip predictable low-risk context.

Evaluation Pipeline

The work treats compression as a round-trip experiment: encode context, decode it with an LLM, reconstruct typed semantic atoms, then score what survived.

Protect commitments

Separate exact, safety-critical, source-grounded, or numerical commitments from the lossy channel.

Compress predictable context

Render low-risk background context into compact symbolic codes, dictionaries, or minified CCL forms.

Decode with an LLM

Treat the model as a semantic decompressor rather than requiring byte-identical reconstruction.

Recover typed atoms

Ask an independent decoder to reconstruct canonical semantic atoms from each compressed representation.

Score the round trip

Measure token gain, Critical Atom Recall, Weighted Atom Recall, and precision across diagnostic cases.

Compression vs Recoverability

The pilot shows a clear gradient rather than a universal frontier: natural language recovers best, minified CCL balances compression and recall, and ASCII SemanticZip gives the largest useful token reduction.

Format	o200k gain	WAR
Structured prose Recoverability ceiling	19.1%	0.956
CCL-Core Typed protected format	8.7%	0.948
JSON Canonical but verbose	-3.4%	0.933
CCL-Min Best middle point	39.4%	0.874
SemanticZip ASCII Largest useful gain	46.5%	0.802
SemanticZip emoji Ambiguous symbols	34.7%	0.698

1.00

0.90

0.80

0.65

Structured prose

CCL-Core

JSON

CCL-Min

SemanticZip ASCII

SemanticZip emoji

Weighted Atom Recallo200k token gain vs original (%)

Full Pilot Results

Token gain is measured relative to the original prompt or history. CAR and WAR are computed after independent LLM decompression into canonical atoms.

Format	o200k	cl100k	CAR	WAR	Precision
Structured prose	19.1%	18.8%	0.961	0.956	0.967
CCL-Core	8.7%	8.9%	0.955	0.948	0.897
JSON	-3.4%	0.6%	0.944	0.933	0.894
CCL-Min	39.4%	40.1%	0.878	0.874	0.933
SemanticZip ASCII	46.5%	46.1%	0.794	0.802	0.975
SemanticZip emoji	34.7%	31.8%	0.684	0.698	0.928

Key Findings

Practical takeaways from the pilot compression and decompression experiment.

Structured language remains the easiest decode target

Structured prose gives the highest round-trip recoverability in the pilot because the decoder already handles natural language well.

CCL-Min is the strongest balanced format

Minified CCL keeps enough explicit structure for recovery while reducing o200k tokens by 39.4% and retaining WAR=0.874.

ASCII shorthand beats emoji-heavy notation

SemanticZip ASCII nearly halves token count at WAR=0.802, while emoji-heavy compression is less compact and less recoverable in this run.

Lossy compression belongs behind a protected channel

Exact numbers, legal constraints, medical facts, safety boundaries, and private-data commitments should remain protected rather than zipped.

When SemanticZip makes sense

SemanticZip is useful when exact wording is unnecessary, the domain is familiar, the task structure repeats, and conventions can be shared. Good targets include agent memory, project-state summaries, low-risk preferences, planning templates, and compact code-generation specs.

When to keep data protected

exactNumbers, identifiers, source quotations, constraints, and boundaries.

riskyLegal, medical, safety, privacy, and policy commitments.

Design principle: zip background context, protect commitments.

Diagnostic Cases

The cases are deliberately small and human-authored, but each one stresses a different kind of semantic commitment. The numbers below report the LLM round-trip for the aggressive SemanticZip ASCII representation in each case.

travel

Lisbon travel planning

Compress a four-day Lisbon itinerary request with lodging, transit, budget, weather, and preference constraints.

43.4%

gain

0.889

CAR

0.914

WAR

1.000

precision

SemanticZip ASCII input

TRIP:LIS/4d/Oct↑/$mod P:walk+foodL+books+views +Sintra B:Bai|Chi !far,!club,!car OUT:d2d+tr+rain+€

LLM decoded as

Trip in Lisbon for 4 days in October with moderate budget; activities include walking, local food, bookstores, viewpoints, and Sintra; base preferences are Baixa or Chiado; exclude far lodging, clubs, and car rental; output should cover day-by-day plan, transit, rain, and costs.

What happened

SemanticZip ASCII preserved the important commitments with high selectivity: 16/18 atom hits, CAR 0.889, WAR 0.914, precision 1.000.

Takeaway

Domain shorthand worked well because Lisbon, 4d, Sintra, transit, budget, and avoidance cues are stable and easy for the decoder to expand.

Best comparison: Structured prose, JSON, and CCL-Core reached full LLM recall; ASCII was the most compact useful rendering.

code generation

JavaScript canvas physics

Compress a standalone HTML canvas simulation spec with gravity, damping, collision handling, pointer lock, drag controls, and visual effects.

52.9%

gain

0.813

CAR

0.794

WAR

1.000

precision

SemanticZip ASCII input

CODE/🚀/1F G:m1m2/r² F:.99 PLock I:1HTML C:sun1+pl15+bh6 !libs,!assets P:grav+COL+ROT+raf U:drag→v=-Δ FX:pulse+trail+beam OUT:codeonly

LLM decoded as

Code task with gravity m1m2/r², damping .99, pointer lock, one HTML file, 1 sun, 15 planets, 6 black holes, no libraries or assets, gravity/collision/rotation/RAF physics, drag producing negative delta velocity, pulse/trail/beam effects, and code-only output.

What happened

CCL-Core recovered every atom. CCL-Min kept WAR 0.941, while SemanticZip ASCII remained usable at CAR 0.813, WAR 0.794, precision 1.000.

Takeaway

The symbolic code kept the shape of the program, but dense physics and UI abbreviations started dropping requirements.

Best comparison: CCL-Core was the safest compact representation for this technical spec.

data

Python data cleaning

Compress a pandas-only customer-orders script request with deduplication, date parsing, filtering, revenue computation, grouping, and output constraints.

37.3%

gain

0.923

CAR

0.949

WAR

0.875

precision

SemanticZip ASCII input

PY/pd_only orders.csv: dedup#order_id/latest; email↓; date→UTC; drop qty<0; rev=qty*price; grp:cust×month; OUT:rev+count>monthly_customer_revenue.csv; err; !services

LLM decoded as

Use pandas-only Python on orders.csv; deduplicate by order_id keeping latest; convert dates to UTC; drop qty<0; compute revenue as qty*price; group customers by month; output revenue and count to monthly_customer_revenue.csv; do not use services. One visible error: email↓ was decoded as descending sort rather than lowercase.

What happened

SemanticZip ASCII recovered 12/13 atoms with WAR 0.949. Emoji compression fell sharply to WAR 0.610.

Takeaway

ASCII anchors such as pd_only, UTC, qty<0, and rev=qty*price remained precise enough for the decoder.

Best comparison: Structured prose, JSON, and CCL-Core reached full recall; ASCII gave strong compression with one missing atom.

React SaaS dashboard

Compress a React dashboard specification with navigation, sidebar, KPI cards, SVG chart, customer table, filter, responsive styling, dark-mode friendliness, and accessibility.

56.6%

gain

0.429

CAR

0.411

WAR

1.000

precision

SemanticZip ASCII input

REACT/DASH/saas !chartlibs mock UI:topnav+side+kpi4+svgline+custTbl+dateFilt S:clean+resp+dark+a11y OUT:1comp

LLM decoded as

Decoded mostly as two broad atoms: a SaaS dashboard UI with chartlibs/mock/topnav/side/kpi4/svgline/customer table/date filter packed into one field, plus style clean/responsive/dark/a11y and one-component output. Many separate UI commitments were not recovered.

What happened

This was the failure case for aggressive SemanticZip: ASCII recovered only 6/14 atoms, CAR 0.429, WAR 0.411, despite precision 1.000.

Takeaway

UI specifications contain many independent commitments. Compressing them into a short cluster caused omissions rather than hallucinations.

Best comparison: Structured prose was strongest; CCL-Min was a safer compact compromise than ASCII for this case.

research

Research-outline prompt

Compress a paper-outline request about SemanticZip, its distinction from Context Codec, safety-critical channels, metrics, limitations, and future work.

54.8%

gain

0.917

CAR

0.940

WAR

1.000

precision

SemanticZip ASCII input

OUTLINE/SZip=lossyText+LLMdecomp ≠CC(commitments) SEC:mot+formal+regimes+SAFE-lossless+metrics+RW+limits+future !SOTA

LLM decoded as

Paper outline for SemanticZip as lossy text with LLM decompression; distinguish it from Context Codec commitments; include motivation, formal setup, compression regimes, safe lossless channels, metrics, related work, limitations, and future work; do not claim SOTA.

What happened

ASCII recovered 11/12 atoms with WAR 0.940 and precision 1.000. CCL-Min and structured prose reached full recovery.

Takeaway

The decoder handled compact research-outline notation well because section labels and conceptual anchors were explicit.

Best comparison: CCL-Min reached full recovery with strong compression; ASCII was nearly as recoverable and slightly shorter.

Known Limitations

Pilot scale

The study uses five author-constructed diagnostic cases, not a benchmark-scale corpus.

Single decoder setup

Round-trip recovery should be tested across multiple models, dates, prompts, and decoding configurations.

Defined scoring

Gold atoms, criticality weights, aliases, and fuzzy-match thresholds are defined within the study.

No prompt-compression baselines yet

Future work must compare against LLMLingua, LongLLMLingua, Selective Context, and related systems.

Tokenizer dependence

Token gains are reported for cl100k_base and o200k_base; other tokenizers may score symbols differently.

Utility is proxied

Atom recall is informative, but task-level behavioral validation is still needed.

References

[1]"SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors," arXiv:2605.24541, 2026.

[2]"Compress the Context, Keep the Commitments: A Context Codec for Efficient and Verifiable LLM State," arXiv:2605.01826, 2026.

[3]"LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models," EMNLP, 2023.

[4]"LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression," ACL, 2024.

[5]"Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024.

Interested in this research?

We are open to collaborations on LLM memory, prompt compression, semantic evaluation, and efficient AI systems.

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

Abstract

Evaluation Pipeline

Protect commitments

Compress predictable context

Decode with an LLM

Recover typed atoms

Score the round trip

Compression vs Recoverability

Full Pilot Results

Key Findings

Structured language remains the easiest decode target

CCL-Min is the strongest balanced format

ASCII shorthand beats emoji-heavy notation

Lossy compression belongs behind a protected channel

When SemanticZip makes sense

When to keep data protected

Diagnostic Cases

Lisbon travel planning

What happened

Takeaway

JavaScript canvas physics

What happened

Takeaway

Python data cleaning

What happened

Takeaway

React SaaS dashboard

What happened

Takeaway

Research-outline prompt

What happened

Takeaway

Known Limitations

Pilot scale

Single decoder setup

Defined scoring

No prompt-compression baselines yet

Tokenizer dependence

Utility is proxied

References

Interested in this research?

Get in Touch