2026-03-27 05:02:16

Google released a paper called TurboQuant, and within 24 hours, the community had ported it to llama.cpp.

What did TurboQuant do? It compressed the KV cache of large models to 3 bits, reducing memory usage by a factor of 6, and sped up inference by 8 times on H100.
The key point is—no retraining needed, no fine-tuning required, and no loss of accuracy. This is one of the reasons chip stocks plummeted.
Samsung and SK Hynix dropped over 6% in Seoul, and Micron fell 6.9% in the US stock market.
The market's concern is—if each model can use 6 times less memory, doesn’t that reduce the demand for HBM?
But I think the market overreacted. The reason is simple. The saved memory won't go to waste. Smaller KV caches mean the same GPU can handle larger contexts and more concurrent requests. Demand won't decrease; it will just be redistributed.
This pattern has repeatedly appeared in tech history—when CPUs get faster, software consumes all the performance headroom. When bandwidth increases, video streaming consumes all the bandwidth. When memory becomes more efficient, models will grow larger and more demanding.
The discussion #20969 on llama.cpp already has a working CPU implementation (pure C, no dependencies) and CUDA kernels.
Someone has run it on Apple Silicon using Metal. This means the barrier to running models locally has dropped another level.
TurboQuant is short-term negative for chip stocks but represents an efficiency dividend for the entire AI industry in the medium term. Those running local models benefit—same Mac can now fit larger models. Chip companies, don’t panic—demand won't disappear; it will be used more efficiently.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.