Muon
Muon = Momentum Orthogonalized by Newton-Schulz For 2D parameters ?
def newtonschulz5(G, steps=5, eps=1e-7):
assert G.ndim == 2
a, b, c = (3.4445, -4.7750, 2.0315)
X = G.bfloat16()
X /= (X.norm() + eps)
if G.size(0) > G.size(1):
X = X.T
for _ in range(steps):
A = X @ X.T
B = b * A + c * A @ A
X = a * X + B @ X
if G.size(0) > G.size(1):
X = X.T
return X
Orthogonal when the dot product is 0. Turn 90 ish degree. Remove reinforcing component and focus on directions that are geniunly different from what we already have. Without orthogonalization, if your parameters are [1000, 0.001] and momentum is [500, 0.0005], most of the momentum just reinforces the already-large first parameter.
don't let the momentum update be biased towards reinforcing the current parameter magnitudes. Instead, focus on directions that are independent of what we already have.
Alternatives
- SVD
- Coupled Newton iteration
MuonClip
QK-Clipping Constraint attention scores by rescaling Q/K matrices. After each Muon step, check QK scores and rescales if it exceed threshold. QK : raw attention scores computed before applying softmax in the attention mechanism. Determine how much each position in the sequence should โattend toโ every other position.
Extra
- NanoGPT-speedrun? CIFAR-10 speedrun ?
- MuonClip by Kimi