Tag: 推理加速

All the articles with the tag "推理加速".

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Updated:2025年3月8日 at 15:06Published: 2024年3月7日 at 13:27
Flash Attention，利用硬件结构加速Attention计算速度、减少内存占用的算法。核心是Tiling，Online Softmax和Kernel Fusion。
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Updated:2025年3月8日 at 15:06Published: 2024年3月4日 at 18:33
谷歌的，第一篇完整跑通interger-only量化推理流程的工作。
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Updated:2025年3月8日 at 15:06Published: 2024年3月4日 at 18:32
From IPADS, 利用模型预测LLM中需要激活的MoE or Neuron，减少资源消耗。

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness