AndyBlocker
RSS FeedRecent Posts
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Updated: at 16:05Published: at 17:46Qwen团队,分析LLM中的Outliers是如何产生的、有什么影响。
2025
Published: at 00:542025.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Published: at 15:43开始做SNN-LLM的QAT/PTQ了,重新读一下之前看过的一些Activation量化的工作。
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Updated: at 17:46Published: at 15:35NIPS2025 Best Paper。Qwen的。实验实在是过于solid了,真有钱啊。
Nested Learning: The Illusion of Deep Learning Architectures
Updated: at 17:08Published: at 11:40谷歌新作,号称“深度学习新范式”。提到了异步,具体指的是让模型靠近输入的位置的更新频率高于靠后的位置,这个思路和之前Sakana AI的那个文章有点像。但文章里面的东西感觉全都是Fast Weight Programming的内容,arxiv的文章全文也一直没挂出来。
Kimi Linear: An Expressive, Efficient Attention Architecture
Updated: at 19:10Published: at 13:55Kimi Linear,有比较详细的实验&Scale Up。有Linear Attention可以去掉RoPE这个结论还是比较惊喜的。
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Updated: at 15:15Published: at 17:05AI Lab关于”广义“LLM推理加速的工作,包括Linear Attention,Sparse Attention,Diffusion LLM,Applications等。
Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2
Updated: at 00:39Published: at 23:32ICLR2025 Workshop,基于HAQ实现的Matmul-Free SNN LLM(虽然只做了370M参数的实验)部署到Loihi2上,实现了相比于Qwen-500M 模型3\timesThroughput和2\times能效。但说实话文章内容关键点都没怎么讲,也没有什么特别很exciting的东西。
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Updated: at 16:46Published: at 14:43DeltaNet
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Updated: at 15:07Published: at 13:50VLDB2024,阿里的工作,看起来工程特别扎实。LLM任务上只通过对weight做sparse load就能在decode阶段获得3-4倍的提速。