Mamba - Linear-Time Sequence Modeling with Selective State Spaces

Mamba is a novel architecture that addresses Transformers’ computational inefficiency on long sequences.

Albert Gu, Tri Dao

Introduction

Foundation models are large models pre-trained on massive data and then adapted for downstream tasks. FMs often employ sequence models as the backbone and are currently predominantly based on the Transformer.

The efficacy of self-attention in the Transformer attributed to

However, this property also brings in fundamental drawbacks

Recently, structured state space sequence models (SSMs) were proposed as a new class of architectures for sequence modeling. These models can be interpreted as a combination of RNNs and CNNs, with inspiration from classical state space models.

The obtained results from SSMs are very promising, as it not only scale linearly or near-linearly with respect to the sequence length but also demonstrates outstanding ability in modelling long-range dependencies, dominating benchmarks such as the Long Range area.

In addition, many variations of SSMs have been successful in continuous signal data domains such as audio and vision. However, SSMs have been less effective in modeling discrete and information dense data such as text.

Therefore, the authors propose a new class of SSM to achieve the performance of Transformers while scaling linearly.

Selection Mechanism

Previous SSMs lacks the ability to select data in an input-dependent manner. A new selection mechanism is proposed to to allow the model to filter out irrelevant information and remember releveant information indefinitely.

Hardware-aware Algorithm

To overcome the requirement where all prior models must be time- and input-invariant to be computationally efficient, the authors propose a hardware-aware algorithm. Instead of computing the model convolutionally, the algorithm computes the model recurrently with a scan and does not materialize the expanded state to avoid IO access between different GPU Mem hierarchy.

Architecture

The authors simplify the prior SSM architectures by combining the prior SSM architectures with the MLP block of Transformers into a single block (Mamba).

Selective SSMs and by extension the Mamba architecture are fully recurrent models with key properties such as

  1. High Quality for dense modalities such as language and genomics
  2. Fast Training and Inference as training computation and memory scales linearly and inference autoregressive unrolling requires constant time per step.
  3. Long Context as model demonstrated quality and efficiency for sequence length up to 1M

State Space Models

Structured state space models (S4) is a recent class of sequence models for deep learning that are related to RNN, CNN, and classical state space models. They are defined with four parameters (Δ,A,B,C)(\Delta, A, B, C)and are inspired by a continuous system

h(t)=Ah(t)+B(x(t))(1a)y(t)=Ch(t)(1b) \begin{aligned} h^\prime(t) &= Ah(t) + B(x(t)) && \text{(1a)} \\ y(t) &= Ch(t) && \text{(1b)} \end{aligned}

that maps a 1-dimensional function or sequence x(t)Ry(t)Rx(t)\in \R \mapsto y(t) \in \R through an implicit latent space h(t)RNh(t) \in \R^N.

Currently, the S4 models use the four parameters to define a sequence-to-sequence transformation in two stages:

  1. Discretization - the first stage that transforms the “continuous parameters” (Δ,A,B)(\Delta,A,B) to “discrete parameters” (Aˉ,Bˉ)(\bar{A}, \bar{B}) through fixed formulas Aˉ=fA(Δ,A)\bar{A} = f_A(\Delta,A) and Bˉ=fB(Δ,A,B)\bar{B} = f_B(\Delta,A,B), where (fA,fB)(f_A,f_B) is a discretization rule. Various rules can be used. Below is an exmpler rule using the zero-order hold (ZOH):

Aˉ=exp(ΔA)Bˉ=(ΔA)1(exp(ΔA)I)ΔB \bar{A} = \exp(\Delta A) \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B

  1. Computation - after the parameters are discretized, the model can be computed two ways, either as a linear recurrence (2) or a global convolution (3). The model uses the convolutional mode (3) for efficient parallelizable training and recurrent mode (2) for efficient autoregressive inference.

ht=Aˉht1+Bˉxt(2a)Kˉ=(CBˉ,CABˉ,,CAˉkBˉ,)(3a)yt=Cht(2b)y=xKˉ(3b) \begin{aligned} h_t &= \bar{A}h_{t-1} + \bar{B}x_t & (2a) && && && &\bar{K} = (C\bar{B}, C\bar{AB}, \dots, C\bar{A}^k\bar{B}, \dots) & (3a) \\ y_t &= Ch_t & (2b) && && && &y = x * \bar{K} & (3b) \end{aligned}

Limitations to State Space Models

Linear Time Invariance

A key property of equations (1), (2), and (3) is Linear Time Invariance (LTI). This means that (Δ,A,B,C)(\Delta, A, B, C) and (Aˉ,Bˉ)(\bar{A}, \bar{B}) are fixed for all time-steps. However, LTI models have fundamental limitations in modeling in certain types of data. Thus, the authors propose a method that removes the LTI constraints while overcoming the efficiency bottlenecks.

Efficiency Bottleneck

In addition, the structured SSMs are so named because it requires imposing structure on AA matrix to compute efficiently. Commonly, AA is imposed to be a diagonal matrix. In this case, ARN×N,BRN×1,CR1×NA\in \R^{N\times N}, B \in \R^{N\times1}, C \in \R^{1\times N} can all be represented by NN numbers.

Consider an input sequence xx of batch size BB and length LL with DD channels, the SSM is independently applied to each channel. Then, the total hidden spaces have DNDN per input, and computing over the sequence length requires O(BLDN)O(BLDN) time and memory. This is the root of the efficeincty bottleneck.

Selective State Space Models

This section is dedicated to explain the selection mechanism, its motivation and implementation.

Motivation: Selection as a Means of Compression

The fundamental problem of sequence modeling is compressing context into a smaller state. This problem can be viewed as the tradeoffs between effectiveness and efficiency. For example, attention is effective and inefficient because it explicitly does not compress context at all. On the other hand, recurrent models are efficient due to finite number of states, limiting its effectiveness by how well the state has compressed the context.

To understand the tradeoff, we focus on two examples of synthetic tasks

The above tasks reveal the failures of LIT models

In summary, the efficiency vs. effectiveness tradeoff can be characterized by how well the model compress the states.

To tackle this tradeoff, the authors propose a fundamental principle called selectivity, which is the ability to focues relevant information and filter out irrelevant information.

Improving SSMs with Selection

The authors incorporate a selection mechanism into models by allowing the parameters to affect interactions to be input-dependent. Therefore, the main difference is simply making parameters Δ,B,C\Delta, B, C be functions of input.

In particular, the parameters are not of length LL, meaning that the model has now change from time-invariant to time-varying.

Efficient Implementation of selective SSMs

The core limitation to prior SSMs is their computational efficient, which is also the reason why previous models used LTI (non-selective) models.

Motivation of Prior Models

Overview of Selective Scan: Hardware-Aware State Expansion

The authors address the computation problem of SSMs with three classical techniques: kernel fusion, parallel scan, and recomputation. First, the authors make two main observations:

  1. Recurrent computation uses O(BLDN)O(BLDN) FLOPS while convolutional computation uses O(BLDlog(L))O(BLD\log(L)) FLOPs. Therefore, for long sequences and not-too-large state dimension NN, recurrent actually uses fewer FLOPs.
  2. The two challenges are sequential nature of recurrence and the large memory usage. To address the later, we can not materialize the full state hh.

The main idea is to only materialize the state hh only in more efficient levels of memory hierarchy by levearing properties of modern accelerators. In particular, most operations are bounded by memory bandwidth, including our scan operation. We use kernel fusion to reduce the amount of memory IOs, significantly speeding up the process implementation.

More specifically,

A Simplified SSM Architecture

Similar to other SSMs, selective SSMs are standalone sequence transformations that can be flexibly incorporated into neural networks. Previously, the most well-known SSM architecture is the H3 architecture, which is comprised of a block inspired by linear attention interleaved with an MLP block. This work simplifies the above architecture by combining those two components into one.

The architecture expands the model dimension DD by a controllable expansion factor EE. Most of the parameters (3ED23ED^2) are in the linear projections, while the number of SSM parameters is much less in comparison.

The Mamba blcok is repeated, interleaved with standard normalization and residual connections, to form the Mamba architecture. In the experiment, the authors fix E=2E=2 with two stacks of the block. SiLu / Switch activation function and LayerNorm are chosen for the activation function and optional normalization layer respectively.

Properties of Selection Mechanisms

Connection to Gating Mechanism

An article helpful for understand gating can be found here.

Note that the connection between RRN gating and the discretization of continuous-time systems is well established. Theorem 1 essentially states that an input-dependent gate can be viewed as the ZOH discretization.

Theorem 1. When N=1,A=1,B=1,sΔ=Linear(x),τΔ=softplusN=1, A=-1, B=1, s_\Delta = \text{Linear}(x), \tau_\Delta = \text{softplus}, then the selective SSM recurrence takes the form

gt=σ(Linear(xt))ht=(1gt)ht1+gtxt \begin{aligned} g_t &= \sigma(Linear(x_t)) \\ h_t &= (1-g_t)h_{t-1}+g_tx_t \end{aligned}

Interpretation of Selection Mechanisms

Variable Spacing Selectivity allows filtering out irrelevant noise tokens that may occur between inputs of interests, which is exemplified by the Selective Copying task.

Filtering Context Many sequence models do not improve with longer context due to their inability to effectively ignore irrelevant context. On the other hand, selective models can simply reset their state at any time to remove extraneous history, improving the performance with longer context length.

Boundary Resetting In case of multiple independent sequences are stitched together, Transformers can keep them separate by instantiating a particular attention mask. LTI models, on the other hand, will bleed information between sequences. Selective SSMs resets their state at boundaries.

Interpretation of Δ\Delta Δ\Delta in general controls the balance between how much to focus or ignore the current input.

Interpretation of AA The selectivity in Δ\Delta is enough to ensure the selectivity in (Aˉ,Bˉ)(\bar{A}, \bar{B}). Making AA is hypothesized to have similar performance, thus left out for simplicity.

Interpretation of BB and CC Modifying BB and CC allows fine-grained control over whether to let an input xtx_t into the state hth_t or the state hth_t into the output yty_t. These can be interpreted as allowing the model to modulate the recurrent dynamics based on the content (input) and context (hidden states) respectively.

Additional Information

Previously, most SSMs use complex numbers in state hh. However, it has been empirically observed that completely real-valued SSMs seem to work fine, possibly better in some settings. The author hypothesizes that the complex-real tradeoff is related to the continuous-discrete spectrum in data modalities, where