tokens that consistently absorb a lot of attention from other tokens, without sending much attention out.

Paper

  1. Efficient Streaming Language Models with Attention Sinks

How to run inference beyond the hardware limit ?

  1. shift the entire window forward
    • invalidating KV cache
    • keeps cache anyway ?
    • window attention donโ€™t work ?

The Problem

  • The models learn to allocate extra attention they donโ€™t need in the softmax to the first few tokens ?
  • Overmixing ?
  • The higher the layers, the more attention is thrown towards the front.

The Answer

  • sliding window with re-computation (high compute)
  • streamingLLM
    • As we go up the layers, more and more attention are being allocated to the first/ earlier token.
    • so just keep the attention sink when sliding the window
Figure 4: The KV cache of StreamingLLM
Generating Token 7:
0 1 2 3 4 5 6 7
Generating Token 8:
0 1 2 3 4 (evicted) 5 6 7 8
Generating Token 9:
0 1 2 3 4 (evicted) 5 (evicted) 6 7 8 9
Attention Sink (kept)
Rolling KV cache (sliding window)
Generated / recent tokens
Evicted tokens (dashed)

What so special about the attention sinks ?

Just the fact that they are in the beginning. The position of being in front naturally makes them the stabilizers for other tokens to throw attention to.

  • Create a special dedicated learnable sink at the start which can always be used for every training sample.

The positions being used is the relative position, assume that no tokens was evicted and ask the model to predict the 7th token when token 4 and 5 are missing (6โ†’4, 7โ†’5, 8โ†’6) and just concatenate from 3โ†’6 (absolute position) or rather 3โ†’4 (relative position).

Can this be done when attention are looking both ways ? Bidirectional ?

  • Big Bird BERT ?

Intuition

  1. Transformers need somewhere โ€œsafeโ€ to put attention
    • evicted tokens can have majority of weight causing unstable shift when the window is moved โ†’ spiking perplexity
  2. Attention anchors ?
    • optimizer / weight stabilizer
  3. Prevent overmixing by allowing later tokens to not mix too many concepts together and not average out the meaning of useful tokens to 0.