ML Compiler
Radical Simplification Through Search
- Matmuls
- Adds
- Muls
- Few Elementwise Ops
11 Main
Unary
- Exp2
- Log2
- Sin
- Recip
- Sqrt
Binary
- Add
- Mul
- Mod
- LessThan
Reduction
- SumReduce
- MaxReduce
Ideas
- Going static
- Compilers slow by default
- Complex operation needs many simple operations
- ML Compilers
- Very Large Instruction Width
- Kernel Search
- Use Bend2 ?
Kernel Search
- Represent models as graph
- Convert graphs into egglog expressions
- Use simple rewrite rules to transform the expression
- Profile the discovered kernels and choose the fastest
- Use MCTS to only profile a small subset.
Question
- Tolerance for different kernels, some kernels have more multiplies and their floats becomes different numerically. How to maintain stability ? Do it symbolically ?
- egraphs-good.github.io ?
- Equality Saturation ?
Kernel Fusion
Most power is for moving data. Best to move stuff less = faster compute.
Naive vs Flash Attention Flash Attention 2 ? Flash Attention 3 ?
Post Processing
- Buffer Reuse
- Queueing Kernels
Pruning
- Alpha Beta Pruning
- MCTS
- Other types ?
- Use search to determine which basics ops to keep ?
- Collective Ops ?