This is the official code for the COLM 2025 paper Stuffed Mamba: Oversized States Lead to the Inability to Forget, by Yingfa Chen*, Yutong Wu*, Chenyang Song, Zhen Leng Thai, Xingyu Shen, Xu Han, Zhiyuan Liu, Maosong Sun.
To contact the authors, please email to: chenyingfa1999@gmail.com, wuyutong_yuna@163.com
RNNs/SSMs are much more efficient than Transformers in processing long contexts, but they fail to extrapolate beyond the training context length. Why? 🤔
→ They do not know how to forget past information when the state is overloaded with memories.
This is surprising since models such as Mamba (selective SSM) have a built-in forgetting mechanism (i.e., the forget gate). However, because their recurrent state is overparameterized (too large), these models can achieve good langauge modeling performance without learning how to forget past information.
Based on the above hypothesis and supporting experimental analyses, we arrived at the following finding:
Let
We call
- The
analysisdirectory contains the source code for the analysis experiments (Section 4 in the paper). It also contains the code for computing perplexity as a function of context length. - The
passkey-retrievaldirectory contains the source code for evaluating Mamba-2, RWKV-5, and RWKV-6 in passkey retrieval.
For how to run the code to reproduce the results in the paper, checkout the README file inside the above directories.
The code is tested with:
- PyTorch 2.3
- Python 3.12
- Transformers 4.45.2
Other required libraries are listed in requirements.txt.
For running HGRN-2, you also need to install FlashLinearAttention from: https://github.com/sustcsonglin/flash-linear-attention.
To perform the analysis experiments, you also need the model checkpoints, which are downloaded from their respective official repositories. For instance, for Mamba-2, download the official checkpoints from https://huggingface.co/state-spaces. Of course, you also need the respective tokenizers. Then, you need to specify the paths to the checkpoints and tokenizers through command-line arguments (see the README in the subdirectories for more details).
For evaluating on synthetic data (such as passkey retrieval and the "newlines" prompt), you don't need to download any data. For training and evaluation on real data (such as RedPajama and SlimPajama), you need to download from the official sources, and specify the path through command-line arguments.
- The checkpoints of Mamba-2 trained on longer sequences will be released soon.
- Add the implementation of the training-free SC mitigation methods.
A large part of the code is copied from the following sources:
- https://www.github.com/OpenBMB/InfiniteBench: For passkey evaluation.
- https://github.com/state-spaces/mamba: For the Mamba-2 implementation.
- The HuggingFace Transformers library: For RWKV implementation.
To cite us, please use the following BibTeX.
@inproceedings{chen2025stuffed-mamba,
title={Stuffed Mamba: Oversized States Lead to the Inability to Forget},
author={Yingfa Chen and Xinrong Zhang and Shengding Hu and Xu Han and Zhiyuan Liu and Maosong Sun},
year={2025},
booktitle={Second Conference on Language Modeling},
url={https://arxiv.org/abs/2410.07145},
}

