tokens that consistently absorb a lot of attention from other tokens, without sending much attention out.

Paper

Efficient Streaming Language Models with Attention Sinks

How to run inference beyond the hardware limit ?

shift the entire window forward
- invalidating KV cache
- keeps cache anyway ?
- window attention don’t work ?

The Problem

The models learn to allocate extra attention they don’t need in the softmax to the first few tokens ?
Overmixing ?
The higher the layers, the more attention is thrown towards the front.

The Answer

sliding window with re-computation (high compute)
streamingLLM
- As we go up the layers, more and more attention are being allocated to the first/ earlier token.
- so just keep the attention sink when sliding the window

Figure 4: The KV cache of StreamingLLM
Generating Token 7:

      0
      1
      2
      3
      4
      5
      6
      7
    
Generating Token 8:

      0
      1
      2
      3
      4 (evicted)
      5
      6
      7
      8
    
Generating Token 9:

      0
      1
      2
      3
      4 (evicted)
      5 (evicted)
      6
      7
      8
      9
    

        
        Attention Sink (kept)
      

        
        Rolling KV cache (sliding window)
      

        
        Generated / recent tokens
      

        
        Evicted tokens (dashed)
      

What so special about the attention sinks ?

Just the fact that they are in the beginning. The position of being in front naturally makes them the stabilizers for other tokens to throw attention to.

Create a special dedicated learnable sink at the start which can always be used for every training sample.

The positions being used is the relative position, assume that no tokens was evicted and ask the model to predict the 7th token when token 4 and 5 are missing (6→4, 7→5, 8→6) and just concatenate from 3→6 (absolute position) or rather 3→4 (relative position).

Can this be done when attention are looking both ways ? Bidirectional ?

Big Bird BERT ?

Intuition

Transformers need somewhere “safe” to put attention
- evicted tokens can have majority of weight causing unstable shift when the window is moved → spiking perplexity
Attention anchors ?
- optimizer / weight stabilizer
Prevent overmixing by allowing later tokens to not mix too many concepts together and not average out the meaning of useful tokens to 0.

🪴 Berwin Gan

Explorer

How Attention Sinks Keep Language Models Stable