🪴 Berwin Gan

❯

❯

Machine Learning 🤖

❯

❯

Luminal - Search-Based Deep Learning Compilers

Luminal - Search-Based Deep Learning Compilers

Jul 13, 20251 min read

compiler
search
kernels

ML Compiler

Radical Simplification Through Search

Matmuls
Adds
Muls
Few Elementwise Ops

11 Main

Unary

Exp2
Log2
Sin
Recip
Sqrt

Binary

Add
Mul
Mod
LessThan

Reduction

SumReduce
MaxReduce

Ideas

Going static
Compilers slow by default
Complex operation needs many simple operations
ML Compilers
1. Very Large Instruction Width
Kernel Search
Use Bend2 ?

Kernel Search

Represent models as graph
Convert graphs into egglog expressions
Use simple rewrite rules to transform the expression
Profile the discovered kernels and choose the fastest
Use MCTS to only profile a small subset.

Question

Tolerance for different kernels, some kernels have more multiplies and their floats becomes different numerically. How to maintain stability ? Do it symbolically ?
egraphs-good.github.io ?
Equality Saturation ?

Kernel Fusion

Most power is for moving data. Best to move stuff less = faster compute.

Naive vs Flash Attention Flash Attention 2 ? Flash Attention 3 ?

Post Processing

Buffer Reuse
Queueing Kernels

Pruning

Alpha Beta Pruning
MCTS
Other types ?
Use search to determine which basics ops to keep ?
Collective Ops ?

Graph View

Radical Simplification Through Search
11 Main
Unary
Binary
Reduction
Ideas
Kernel Search
Question
Kernel Fusion
Post Processing
Pruning

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community