归档

全部归档文章。

2026 ⁷

六月 ²

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

更新于: 16 Jun, 2026

作者认为softmax有效是因为它将Attention矩阵的Frobenius范数控制在了O(sqrt(N))量级，从而稳定了训练，因此提出用多项式激活代替softmax、在期望意义上实现相似的范数控制。理论推完发现这文章没中，ICLR2026得分2222，一下子就不想看下去了。感觉实验和理论都不是很好。
AsyncT vllm适配、加速笔记（三）

更新于: 5 Jun, 2026

最后一篇，主要囊括了AsyncT算子最终的Hopper Specilized版本算子介绍、一些最终效果的breakdown，以及对接下来可以做的工作的一些分析。下一步要对训练做些优化了。

五月 ²

AsyncT vllm适配、加速笔记（二）

更新于: 26 May, 2026

加速第二篇，主要是在CUDA Kernel上做更多的优化，反思之前的Benchmarking问题等。
AsyncT vllm适配、加速笔记（一）

更新于: 25 May, 2026

笔记的第一部分，主要覆盖了一些preliminaries，基础的vllm接入流程，以及简单的triton算子实现和最基础版本的CUDA算子实现。

三月 ¹

Attention Residuals

更新于: 18 Mar, 2026

Kimi团队关于Residual Addition的扩展。看起来某种意义上算是复杂的拓扑结构，说不定在现在的硬件上会有优势？

二月 ¹

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

更新于: 2 Mar, 2026

Qwen团队，分析LLM中的Outliers是如何产生的、有什么影响。

一月 ¹

2025

更新于: 19 Jan, 2026

2025.

2025 ⁴²

十二月 ²

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

更新于: 30 Dec, 2025

开始做SNN-LLM的QAT/PTQ了，重新读一下之前看过的一些Activation量化的工作。
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

更新于: 3 Dec, 2025

NIPS2025 Best Paper。Qwen的。实验实在是过于solid了，真有钱啊。

十一月 ²

Nested Learning: The Illusion of Deep Learning Architectures

更新于: 10 Nov, 2025

谷歌新作，号称“深度学习新范式”。提到了异步，具体指的是让模型靠近输入的位置的更新频率高于靠后的位置，这个思路和之前Sakana AI的那个文章有点像。但文章里面的东西感觉全都是Fast Weight Programming的内容，arxiv的文章全文也一直没挂出来。
Kimi Linear: An Expressive, Efficient Attention Architecture

更新于: 4 Nov, 2025

Kimi Linear，有比较详细的实验&Scale Up。有Linear Attention可以去掉RoPE这个结论还是比较惊喜的。

十月 ¹

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

更新于: 16 Oct, 2025

AI Lab关于”广义“LLM推理加速的工作，包括Linear Attention，Sparse Attention，Diffusion LLM，Applications等。

九月 ⁴

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

更新于: 29 Sep, 2025

ICLR2025 Workshop，基于HAQ实现的Matmul-Free SNN LLM（虽然只做了370M参数的实验）部署到Loihi2上，实现了相比于Qwen-500M 模型3\timesThroughput和2\times能效。但说实话文章内容关键点都没怎么讲，也没有什么特别很exciting的东西。
Parallelizing Linear Transformers with the Delta Rule over Sequence Length

更新于: 26 Sep, 2025

DeltaNet
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

更新于: 24 Sep, 2025

VLDB2024，阿里的工作，看起来工程特别扎实。LLM任务上只通过对weight做sparse load就能在decode阶段获得3-4倍的提速。
SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型

更新于: 15 Sep, 2025

李国齐老师组的新工作技术报告。说实话，我并不觉得这是一个正经的SNN-LLM工作，感觉已经完全是Linear Attention国产化的工作了。很难评价。

八月 ²

MLP Memory: Language Modeling with Retriever-pretrained External Memory

更新于: 25 Aug, 2025

用MLP学习并代替RAG中kNN输出的概率分布。
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

更新于: 14 Aug, 2025

ACL2025 Best Paper，DeepSeek新作。分层KV Cache提高稀疏度，在训练和推理阶段同时提高性能。

七月 ²

GPU上的SNN稀疏加速

更新于: 14 Jul, 2025

把最近做的关于GPU上SNN稀疏加速的东西做一下总结，虽然不太成功。
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

更新于: 7 Jul, 2025

T-MAC, 用LUT加速BitNet系列的工作，在CPU上跑，后续还有一个工作叫T-MAN是在移动端的高通CPU里面的NPU上跑LUT加速。

六月 ⁹

HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches

更新于: 25 Jun, 2025

ISCA2025，做稀疏数据流分块的，后半截没什么精力看了，现在的工作还没做稀疏编码。
SNN on GPU

更新于: 24 Jun, 2025

接下来要开始着手做这个SNN在GPU上的推理加速了，写一些笔记整理思路。
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

23 Jun, 2025

看看Shift-Window Attention。
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T) Complexity

17 Jun, 2025

用汉明距离替换Attention中的点乘操作，避免出现Spike错开的情况。中间的做法比较有趣，但是实验感觉做的一般般，尤其是claim了自己有硬件实现的情况下energy计算还用的是纯算法的计算，并且FPGA的具体实现也没有透露，说了也没有说清楚。精度没有超过ANN2SNN的SOTA。重点还是需要用一些其他的操作替换掉对SNN不适应的算子。
Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNN

11 Jun, 2025

ICLR 2024 Spotlight, 利用Lyapunov Noise进行SNN Pruning。
Prosperity: Accelerating Spiking Neural Networks via Product Sparsity

更新于: 11 Jun, 2025

HPCA在投的一篇SNN加速器文章，里面的“Product Sparsity”本质是减少相同内容的重复计算，和一般讨论的稀疏是两种不同的概念。
Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion

10 Jun, 2025

意义不明，用Layer-By-Layer写了一下LIF就没别的Contribution了，发在了一个叫做ICANN的会上。工作量也太小了。
Recurrent Residual Module for Fast Inference in Videos

更新于: 9 Jun, 2025

CVPR2018， DiffEncode + 稀疏加速，但感觉太老了。
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models

更新于: 9 Jun, 2025

NIPS2022上一篇比较有影响力的论文，对GAN和扩散模型做推理加速的工作，提出了Spatially Sparse Inference，仅在被编辑区域上稀疏地应用卷积滤波器，同时对未编辑区域复用缓存的特征

五月 ⁸

SlowFast Networks for Video Recognition

更新于: 30 May, 2025

多分支CNN，会不会有一些分支能学到更加相似的帧间变化？
DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos

更新于: 23 May, 2025

利用CNN Layer的“线性”特征在帧之间做feature的差分，并且做了CUDA加速。和ViStream几乎一样的思路，能不能解决我们现在的问题？
Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks

更新于: 21 May, 2025

ISCA 2025, 基于结构化稀疏的SNN加速器。如果直接用LUT存，可能会出现需要保存的稀疏pattern数量太多，显存占用太严重，所以通过预先校准一级“结构化稀疏”，将Online Spike Activation变成一级可以完全用LUT算的L1 Sparse和稀疏度非常高的L2 Sparse。模仿一下idea搬到GPU上来做？
Temporal Flexibility in Spiking Neural Networks: Towards Generalization Across Time Steps and Deployment Friendliness

21 May, 2025

ICLR2025 Poster，似乎也在做Elastic inference？
A Simple Framework for Contrastive Learning of Visual Representations

20 May, 2025

对比学习SimCLR的论文。对比学习能对齐每一层的Feature吗？
QKFormer: Hierarchical Spiking Transformer using Q-K Attention

更新于: 8 May, 2025

QKFormer，NIPS2024 Spotlight，把Direct Training SNN在ImageNet和CIFAR上的点刷的特别高，感觉之后要做就避不开它。
Transformers without Normalization

7 May, 2025

何恺明新作，用DyT代替Norm，把同步操作变成了Element Wise的操作。新文章里面有用到，学习一下。
Visualizing and Understanding the Effectiveness of BERT

更新于: 6 May, 2025

最近做SNN训练的过程中在研究怎么可视化训练过程中的Loss，在想新加入的方法会不会对模型的Loss Landscape有影响，一般讲Loss Landscape怎么做可视化的文章都会引用这篇文章对Loss Landscape的分析和做法。

四月 ³

One-Minute Video Generation with Test-Time Training

更新于: 22 Apr, 2025

最近Demo很火的TTT视频生成，可以生成60s级别的长视频。学习一下TTT的东西，SNN的On-Chip Learning和TTT能不能做结合？
Evolution Strategies as a Scalable Alternative to Reinforcement Learning

更新于: 21 Apr, 2025

这两天在弄SNN训练的事情，需要验证一下用的Surrogate Gradient的准确性，老师介绍读一下这篇文章，用Evolution Strategy验证一下现在梯度估计的准确性。
SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute

更新于: 15 Apr, 2025

sparTA，带稀疏优化的DNN编译器，把tensor的稀疏性作为一种重要属性考虑到编译过程中，生成高效的代码。

三月 ³

Scalable Diffusion Models with Transformers

更新于: 16 Mar, 2025

Diffusion Transformer.
初探AI Infra

更新于: 11 Mar, 2025

趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识，还有一些CUDA编程的体系架构之类的内容。
Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

更新于: 8 Mar, 2025

使用大kernel DS卷积替代self-attention。字节新加坡的工作。

二月 ²

SpikeCV: Open a Continuous Computer Vision Era

更新于: 8 Mar, 2025

事件相机开源框架。
Neuromorphic computing at scale

更新于: 8 Mar, 2025

发在Nature上的一篇review，讨论了SNN/神经模态计算社区现在面临的一些问题、挑战，和一些可能的发展方向。

一月 ⁴

Titans: Learning to Memorize at Test Time

更新于: 8 Mar, 2025

从TTT改进而来的新架构，尝试通过TTT的方式改进模型的记忆能力。
Segment Anything

更新于: 8 Mar, 2025

Meta的SAM。
SDiT: Spiking Diffusion Model with Transformer

更新于: 8 Mar, 2025

脉冲Diffusion Transformer，里面的Transformer的结构是RWKV的。
2024

更新于: 8 Mar, 2025

2024.

2024 ³⁴

十二月 ⁴

ConvUNeXt:An efficient convolution neural network for medical image segmentation

更新于: 8 Mar, 2025

ConvNext + UNet，发在一个C刊上，借鉴学习一下，想想我的模块怎么设计。
Rethinking the Membrane Dynamics and Optimization Objectives of Spiking Neural Networks

更新于: 8 Mar, 2025

NIPS2024。主要研究的是静态任务中，推理前膜电位初始值设置对精度的影响。
ConvNext V2: Co-designing and Scaling ConvNets with Masked Autoencoders

更新于: 8 Mar, 2025

ConvNext续作，引入了MAE。
A ConvNet for the 2020s

更新于: 8 Mar, 2025

CVPR2022。Meta的工作，在ViT相关工作占视觉大头的情况下重构纯卷积的网络，并且取得了很好的效果。

十一月 ¹

LoCC工作总结

更新于: 8 Mar, 2025

老板找到idea到交稿只用了两个星期，第一次完整跟着做完一整篇论文的工作。

十月 ³

Were RNNs All We Needed?

更新于: 8 Mar, 2025

改进RNN，便于scale up
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

更新于: 8 Mar, 2025

GPU上做MM相关的算子生成，利用load balancing和稀疏做加速，根据model生成PTX代码
VPRTempo: A Fast Temporally Encoded Spiking Neural Network for Visual Place Recognition

更新于: 8 Mar, 2025

ICRA2024的论文，用Temporal Encoding的STDP Direct Training的SNN做场景识别的任务。太简单了

八月 ³

Memory-Efficient Reversible Spiking Neural Networks

更新于: 8 Mar, 2025

通过设计提高训练速度，降低显存占用的工作。
SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

更新于: 8 Mar, 2025

SNN+Mamba完成TVG时序视频定位任务，哈工大和北大的工作。
Integer-Valued Training and Spike-Driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

更新于: 8 Mar, 2025

SpikeYOLO，中科院自动化所的工作，ECCV2024 Oral

七月 ³

SNN视频流任务调研

更新于: 7 May, 2025

学习一下视频stream上任务的一些工作，大概计划一下后续的工作。
SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN

更新于: 8 Mar, 2025

游康师兄的工作，ANN2SNN的Transformer。
SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence

更新于: 8 Mar, 2025

北大惊蛰，非常有影响力的SNN框架，实现了从数据编码、数据集整合到训练、硬件部署的全流程，SNN的torch级别的工作。发表在Science Advanced上。

六月 ²

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

更新于: 8 Mar, 2025

LLM的Interger-Only PTQ量化工作。
程序语言理论笔记

更新于: 8 Mar, 2025

程序语言理论课程的复习笔记。

五月 ²

The Minimum Equivalent DNF Problem and Shortest Implicants

更新于: 8 Mar, 2025

证明MIN-DNF问题是完全的
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

更新于: 8 Mar, 2025

对ViT的纯整型量化，W8A8，中科院2023 ICCV

三月 ¹⁶

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

更新于: 8 Mar, 2025

EAGL，声称只要用CPU在3秒内就能完成对ResNet的量化，效率远高于HAWQ等其他传统的方法
Towards spike-based machine intelligence with neuromorphic computing

更新于: 8 Mar, 2025

Nature上关于SNN的综述
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

更新于: 8 Mar, 2025

Flash Attention，利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling，Online Softmax和Kernel Fusion。
WWW: What, When, Where to Compute-in-Memory

更新于: 8 Mar, 2025

一些关于存内计算的验证与思考。
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

更新于: 8 Mar, 2025

谷歌的，第一篇完整跑通interger-only量化推理流程的工作。
SpikeSim: An end-to-end Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks

更新于: 8 Mar, 2025

SNN部署的硬件设计or evaluation benchmark。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

更新于: 8 Mar, 2025

From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron，减少资源消耗。
Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

更新于: 8 Mar, 2025

GEMM data mapping的介绍，主要是各种脉动阵列相关的加速器。
HAWQ: Hessian Aware Quantization of Neural Networks with Mixed-Precision

更新于: 8 Mar, 2025

模型量化经典方法，基于黑森矩阵，一种二阶信息的量化方法。
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

更新于: 8 Mar, 2025

BISMO优化。
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

更新于: 8 Mar, 2025

TVM。
Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

更新于: 8 Mar, 2025

Roofline model，描述一个系统的性能是受内存制约还是受计算制约。
A Comprehensive Survey on Electronic Design Automation and Graph Neural Networks: Theory and Applications

更新于: 8 Mar, 2025

图神经网络在EDA领域应用的综述。
A Hardware-Software Blueprint for Flexible Deep Learning Specialization

更新于: 8 Mar, 2025

VTA。
BISMO: A Scalable Bit Serial Matrix Multiplication Overlay for Reconfigurable Computing

更新于: 8 Mar, 2025

BISMO。
Code Transpilation for Hardware Accelerators

更新于: 8 Mar, 2025

基于Metalift，做的还很不完善。

归档

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

AsyncT vllm适配、加速笔记（三）

AsyncT vllm适配、加速笔记（二）

AsyncT vllm适配、加速笔记（一）

Attention Residuals

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

2025

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Nested Learning: The Illusion of Deep Learning Architectures

Kimi Linear: An Expressive, Efficient Attention Architecture

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

SpikingBrain-瞬息 1.0技术报告：原生国产自主可控类脑脉冲大模型

MLP Memory: Language Modeling with Retriever-pretrained External Memory

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

GPU上的SNN稀疏加速

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches

SNN on GPU

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T) Complexity

Sparse Spiking Neural Network: Exploiting Heterogeneity in Timescales for Pruning Recurrent SNN

Prosperity: Accelerating Spiking Neural Networks via Product Sparsity

Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion

Recurrent Residual Module for Fast Inference in Videos

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models

SlowFast Networks for Video Recognition

DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos

Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks

Temporal Flexibility in Spiking Neural Networks: Towards Generalization Across Time Steps and Deployment Friendliness

A Simple Framework for Contrastive Learning of Visual Representations

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

Transformers without Normalization

Visualizing and Understanding the Effectiveness of BERT

One-Minute Video Generation with Test-Time Training

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute

Scalable Diffusion Models with Transformers

初探AI Infra

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

SpikeCV: Open a Continuous Computer Vision Era

Neuromorphic computing at scale

Titans: Learning to Memorize at Test Time

Segment Anything

SDiT: Spiking Diffusion Model with Transformer

2024

ConvUNeXt:An efficient convolution neural network for medical image segmentation

Rethinking the Membrane Dynamics and Optimization Objectives of Spiking Neural Networks

ConvNext V2: Co-designing and Scaling ConvNets with Masked Autoencoders

A ConvNet for the 2020s

LoCC工作总结

Were RNNs All We Needed?

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

VPRTempo: A Fast Temporally Encoded Spiking Neural Network for Visual Place Recognition

Memory-Efficient Reversible Spiking Neural Networks

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

Integer-Valued Training and Spike-Driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection

SNN视频流任务调研

SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN

SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

程序语言理论笔记

The Minimum Equivalent DNF Problem and Shortest Implicants

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Towards spike-based machine intelligence with neuromorphic computing

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

WWW: What, When, Where to Compute-in-Memory

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

SpikeSim: An end-to-end Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

HAWQ: Hessian Aware Quantization of Neural Networks with Mixed-Precision

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures