This is part 1 of my new multi-part series. Towards spatial Mamba state models for images, videos and time series.
Is Mamba is all you need? Certainly, people have long thought that about the Transformer architecture introduced by A. Vaswani et. al. in Attention is all you need In 2017, the Transformer revolutionized the field of deep learning again and again. Its general-purpose architecture can be easily adapted to various data modalities, such as text, images, videos, and time series, and it seems that the more computational resources and data you add to the Transformer, the more efficient it becomes.
However, the Transformer's attention mechanism has a major drawback: it is complex. O(N²)which means that it scales quadratically with the length of the sequence. This implies that the longer the input sequence, the more computational resources it will need, making working with large sequences often unfeasible.
- What is this series about?
- Why do we need a new model?
- Structured state space models