文章

全部文章列表。

Kimi Linear: An Expressive, Efficient Attention Architecture

更新于: 4 Nov, 2025

Kimi Linear，有比较详细的实验&Scale Up。有Linear Attention可以去掉RoPE这个结论还是比较惊喜的。
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

更新于: 16 Oct, 2025

AI Lab关于”广义“LLM推理加速的工作，包括Linear Attention，Sparse Attention，Diffusion LLM，Applications等。
Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

更新于: 29 Sep, 2025

ICLR2025 Workshop，基于HAQ实现的Matmul-Free SNN LLM（虽然只做了370M参数的实验）部署到Loihi2上，实现了相比于Qwen-500M 模型3\timesThroughput和2\times能效。但说实话文章内容关键点都没怎么讲，也没有什么特别很exciting的东西。
Parallelizing Linear Transformers with the Delta Rule over Sequence Length

更新于: 26 Sep, 2025

DeltaNet
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

更新于: 24 Sep, 2025

VLDB2024，阿里的工作，看起来工程特别扎实。LLM任务上只通过对weight做sparse load就能在decode阶段获得3-4倍的提速。
SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型

更新于: 15 Sep, 2025

李国齐老师组的新工作技术报告。说实话，我并不觉得这是一个正经的SNN-LLM工作，感觉已经完全是Linear Attention国产化的工作了。很难评价。
MLP Memory: Language Modeling with Retriever-pretrained External Memory

更新于: 25 Aug, 2025

用MLP学习并代替RAG中kNN输出的概率分布。
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

更新于: 14 Aug, 2025

ACL2025 Best Paper，DeepSeek新作。分层KV Cache提高稀疏度，在训练和推理阶段同时提高性能。
GPU上的SNN稀疏加速

更新于: 14 Jul, 2025

把最近做的关于GPU上SNN稀疏加速的东西做一下总结，虽然不太成功。
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

更新于: 7 Jul, 2025

T-MAC, 用LUT加速BitNet系列的工作，在CPU上跑，后续还有一个工作叫T-MAN是在移动端的高通CPU里面的NPU上跑LUT加速。

文章

Kimi Linear: An Expressive, Efficient Attention Architecture

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型

MLP Memory: Language Modeling with Retriever-pretrained External Memory

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

GPU上的SNN稀疏加速

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge