Yang Zhilin GTC 2026 Speech: Revealing Kimi's technical roadmap and discussing the "scaling bottleneck"

MaticHoleFiller · 2026-03-20T11:45:24+00:00

At the 2026 Nvidia GTC conference, Kimi founder Yang Zhilin emphasized that breakthroughs in large model intelligence require reconstructing underlying technologies, such as optimizers and attention mechanisms. The evolution of Kimi K2.5 focuses on three dimensions: token efficiency, long context, and agent clustering, leveraging new optimizers and attention architectures to enhance performance, achieve higher levels of intelligence, and promote open-source innovation.

MaticHoleFiller

2026-03-20 11:45:24

Abstract generation in progress

Sina Tech reported on the morning of March 18 that at NVIDIA GTC 2026, Kimi founder Yang Zhilin shared that to continuously push the upper limits of large model intelligence, fundamental restructuring of optimizers, attention mechanisms, and residual connections is necessary.

Since the official release of Kimi K2.5 at the end of January this year, Yang Zhilin has, for the first time, systematically disclosed the technical roadmap behind the model. He summarized Kimi’s evolution logic as resonating across three dimensions: token efficiency, long context, and agent swarms. In Yang Zhilin’s view, scaling is no longer just about resource accumulation but involves simultaneously seeking scale effects in computational efficiency, long-term memory, and automated collaboration. If the technological gains in these three areas are multiplied, the model will exhibit intelligence far beyond current levels.

The core of this presentation was technological restructuring. Yang Zhilin pointed out that many industry-standard techniques used today are essentially products from eight or nine years ago and are gradually becoming bottlenecks for scaling.

Since 2014, the Adam optimizer has been regarded as the industry standard, but in ultra-large-scale training, finding more token-efficient alternatives has become a trend. The Kimi team validated the significant potential of the Muon optimizer in improving token efficiency through experiments, but when scaling to the trillion-parameter K2 model, they encountered stability issues caused by logits explosion leading to model divergence. To address this, the team developed and open-sourced the MuonClip optimizer, which uses Newton-Schulz iteration combined with the QK-Clip mechanism to thoroughly solve the logits explosion problem while achieving twice the computational efficiency of traditional AdamW.

Regarding the full attention mechanism introduced in 2017, Yang Zhilin showcased Kimi Linear based on the KDA architecture. This hybrid linear attention architecture challenges the convention that all layers must use full attention. By optimizing recursive storage management, it increased decoding speed by 5 to 6 times in ultra-long contexts of 128K or even 1M tokens, while maintaining excellent performance across different scene lengths.

Additionally, for the residual connections with a decade of history, Kimi introduced the Attention Residuals scheme, replacing the traditional fixed additive accumulation with softmax attention on previous layer outputs. This addresses the longstanding issue of hidden states growing uncontrollably with depth, which dilutes the contribution of deeper layers, allowing each layer to selectively aggregate information based on input content. This work prompted reflections from Karpathy, co-founder of OpenAI, who said that our understanding of the seminal Transformer paper “Attention is All You Need” is still insufficient. Elon Musk, founder of xAI, also commented that Kimi’s work is impressive.

In the multimodal research area, Yang Zhilin shared an important observation: in native vision-text joint pretraining, vision reinforcement learning (Vision RL) can significantly enhance text performance. Ablation experiments showed that after training with Vision RL, the model’s performance on pure text benchmarks like MMLU-Pro and GPQA-Diamond improved by about 2.1%. This indicates that enhancing spatial reasoning and visual logic can effectively translate into deeper general cognitive abilities.

At the end of his speech, Yang Zhilin delved into the expansion of agent swarms. He believes future intelligent forms will evolve from single agents to dynamically generated clusters. The Orchestrator mechanism introduced in Kimi K2.5 can decompose complex long tasks into dozens of sub-agents for parallel processing. To prevent “serial collapse” caused by single points of dependency during collaboration, the team designed a new parallel RL reward function to truly incentivize models to learn task decomposition and parallel execution.

In summary, Yang Zhilin discussed the paradigm shift in AI research. He mentioned that ten years ago, research focused more on publishing new ideas, but limited by computational resources, it was difficult to verify these ideas through experiments at different scales. Now, with sufficient resources and the “Scaling Ladder,” researchers can conduct rigorous large-scale experiments, leading to more confident and reliable conclusions. This is why Kimi can extract new breakthroughs from seemingly “old” technologies. Kimi will continue to follow an open-source path, contributing foundational innovations like MuonClip, Kimi Linear, and Attention Residuals to the open-source community, building more powerful models, and promoting the democratization of AI technology. (Wen Meng)

KDA-0.29%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.