Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
PDF: https://attention-survey.github.io/files/Attention_Survey.pdf
Paper webpage: https://attention-survey.github.io
This paper provides a comprehensive survey of Efficient Attention Methods, categorizing them into four classes.
- [2025/8/19] 🎉 Our survey paper is now publicly available on GitHub! If you do find our resources helpful, please cite our paper.
💡 Core Idea: Accelerate attention by leveraging hardware characteristics.
📝 Overall Formulations:
Hardware-efficient attention for the prefilling stage can be formulated as:

Where
Hardware-efficient attention for the decoding stage can be formulated as:

Where
An example is FlashAttention, which tiles
The Table below summarizes various hardware-efficient attention methods. 👇
💡 Core Idea: Compressing the KV cache of attention by weight sharing or low rank decomposition while keeping computational cost unchanged, as with a full-sized KV cache.
📝 Overall Formulations:
The Table Below is a summarization of various compact attention approaches. 👇
💡 Core Idea: Selectively performing a subset of computations in attention while omitting others.
📝 Overall Formulations:
The Table below summarizes various sparse attention methods. 👇
💡 Core Idea: Redesigning the computational formulation of attention to achieve (\mathcal{O}(N)) time complexity.
📝 Overall Formulations:
Linear Attention can be implemented in three forms: parallel, recurrent, and chunkwise.
Many linear attention methods incorporate forget gates and select gates.
Based on the presence of these gates, we can classify linear attention methods as follows:
-
Naive Linear Attention (No Gates)
📝 The Table below summarizes naive attention methods. 👇
-
Linear Attention with a Forget Gate
📝 This Table compares methods that use a forget gate. 👇
-
Linear Attention with Forget and Select Gates
📝 This Table compares methods that utilize both the forget gate and the select gate. 👇
A unique approach, Test-Time Training (TTT), treats the hidden states of linear attention as learnable parameters.
If you find our work helpful, please cite our paper:
@article{zhangsurvey,
title={Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention},
author={Zhang, Jintao and Su, Rundong and Liu, Chunyu and Wei, Jia and Wang, Ziteng and Zhang, Pengle and Wang, Haoxu and Jiang, Huiqiang and Huang, Haofeng and Xiang, Chendong and others}
}












