Understand how SSM and Mamba work, as well as how to start implementing them in Keras and TensorFlow.
Submitted on December 1, 2023 on arXiv, the article titled “Mamba: modeling linear time sequences with selective state spaces” proposed an interesting approach to sequence modeling. The authors – alberto g., The three damages – introduced 'Mamba' which used 'selective' state space models (SSM) to achieve results that rival the performance of the now ubiquitous Transformer model.
Transformers have recently gained popularity with the emergence of large language models (LLM) such as LLaMa-2, GPT-4, Claude, Gemini, etc., but suffer from the context window problem. The problem with transformers lies at their core, the multi-head attention mechanism.
The main problem with multi-head attention arises from the fact that for an input sequence length n, the time complexity and space complexity increase by O(n²). This limits the duration of an LLM's context window. Because, to increase it 10 times, we need to scale the hardware requirements (mostly GPU VRAM) 100 times.
Mamba, on the other hand, climbs by O(n)!, that is, linearly.
This linear scaling is what has led researchers to speculate that Mamba could be the future of sequence modeling.
The core of the Mamba model comes from the concept of State Space Models. State space models, such as Transformers and RNNs, process sequences of information, such as text, audio signals, video frames, DNA sequences, etc.
State space models arise from the idea of describing a physical system as a set of inputs, outputs, and variables. These variables are: ABC D.