Tag: 算法

All the articles with the tag "算法".

Attention Residuals
Updated:2026年3月18日 at 17:29Published: 2026年3月18日 at 15:19
Kimi团队关于Residual Addition的扩展。看起来某种意义上算是复杂的拓扑结构，说不定在现在的硬件上会有优势？
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Updated:2026年3月2日 at 16:05Published: 2026年2月26日 at 17:46
Qwen团队，分析LLM中的Outliers是如何产生的、有什么影响。
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Published:2025年12月30日 at 15:43
开始做SNN-LLM的QAT/PTQ了，重新读一下之前看过的一些Activation量化的工作。
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Updated:2025年12月3日 at 17:46Published: 2025年12月3日 at 15:35
NIPS2025 Best Paper。Qwen的。实验实在是过于solid了，真有钱啊。
Nested Learning: The Illusion of Deep Learning Architectures
Updated:2025年11月10日 at 17:08Published: 2025年11月8日 at 11:40
谷歌新作，号称“深度学习新范式”。提到了异步，具体指的是让模型靠近输入的位置的更新频率高于靠后的位置，这个思路和之前Sakana AI的那个文章有点像。但文章里面的东西感觉全都是Fast Weight Programming的内容，arxiv的文章全文也一直没挂出来。
Kimi Linear: An Expressive, Efficient Attention Architecture
Updated:2025年11月4日 at 19:10Published: 2025年11月4日 at 13:55
Kimi Linear，有比较详细的实验&Scale Up。有Linear Attention可以去掉RoPE这个结论还是比较惊喜的。
Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Updated:2025年9月26日 at 16:46Published: 2025年9月25日 at 14:43
DeltaNet
MLP Memory: Language Modeling with Retriever-pretrained External Memory
Published:2025年8月25日 at 14:22
用MLP学习并代替RAG中kNN输出的概率分布。
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Published:2025年6月23日 at 17:47
看看Shift-Window Attention。
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T) Complexity
Published:2025年6月17日 at 16:56
用汉明距离替换Attention中的点乘操作，避免出现Spike错开的情况。中间的做法比较有趣，但是实验感觉做的一般般，尤其是claim了自己有硬件实现的情况下energy计算还用的是纯算法的计算，并且FPGA的具体实现也没有透露，说了也没有说清楚。精度没有超过ANN2SNN的SOTA。重点还是需要用一些其他的操作替换掉对SNN不适应的算子。

Tag: 算法

Attention Residuals

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Nested Learning: The Illusion of Deep Learning Architectures

Kimi Linear: An Expressive, Efficient Attention Architecture

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

MLP Memory: Language Modeling with Retriever-pretrained External Memory

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T) Complexity