Interactive LLMs (chat, copilots, agents) with strict latency targets Long‑context reasoning (codebases, research, video) with massive KV (key value) cache footprints Ranking and recommendation models ...
LCLMs compress LLM context before decode — 8.8x faster at 16x compression, beating every KV cache method tested. Open-sourced by NYU and Columbia.
Google researchers have published a new quantization technique called TurboQuant that compresses the key-value (KV) cache in large language models to 3.5 bits per channel, cutting memory consumption ...