Tag: 推理加速
All the articles with the tag "推理加速".
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Updated: at 15:07Published: at 13:50VLDB2024,阿里的工作,看起来工程特别扎实。LLM任务上只通过对weight做sparse load就能在decode阶段获得3-4倍的提速。
GPU上的SNN稀疏加速
Updated: at 11:09Published: at 14:11把最近做的关于GPU上SNN稀疏加速的东西做一下总结,虽然不太成功。
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Published: at 16:23T-MAC, 用LUT加速BitNet系列的工作,在CPU上跑,后续还有一个工作叫T-MAN是在移动端的高通CPU里面的NPU上跑LUT加速。
HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches
Published: at 16:27ISCA2025,做稀疏数据流分块的,后半截没什么精力看了,现在的工作还没做稀疏编码。
SNN on GPU
Published: at 11:48接下来要开始着手做这个SNN在GPU上的推理加速了,写一些笔记整理思路。
Prosperity: Accelerating Spiking Neural Networks via Product Sparsity
Published: at 16:52HPCA在投的一篇SNN加速器文章,里面的“Product Sparsity”本质是减少相同内容的重复计算,和一般讨论的稀疏是两种不同的概念。
Recurrent Residual Module for Fast Inference in Videos
Published: at 15:25CVPR2018, DiffEncode + 稀疏加速,但感觉太老了。
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
Published: at 14:18NIPS2022上一篇比较有影响力的论文,对GAN和扩散模型做推理加速的工作,提出了Spatially Sparse Inference,仅在被编辑区域上稀疏地应用卷积滤波器,同时对未编辑区域复用缓存的特征
初探AI Infra
Updated: at 18:30Published: at 16:04趁最近找实习的机会学习、总结一下之前零散接触过的模型推理/训练加速的知识,还有一些CUDA编程的体系架构之类的内容。
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference
Updated: at 15:06Published: at 14:18GPU上做MM相关的算子生成,利用load balancing和稀疏做加速,根据model生成PTX代码