Nov 13, 2025

PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)!

Some of the compiler passes to keep in mind:

  • fusion
  • tiling
  • use hardware acceleration (like tensor cores)
  • constant folding
  • perform redundant computation to avoid global memory accesses where profitable
  • pack into buffers
  • loop transformation
  • unroll-and-jam (register tiling?)
  • vectorization
  • reorder execution for better spatial, temporary and group reuse

Scheduling approaches:

  • greedy heuristics
  • ILP
  • dynamic programming
  • analytical cost models

For fusion, PolyBlocks uses a Polyhedral slicing-based approach in the affine pass of MLIR. This approach seems to perform better than simple fusion (done by XLA and TorchInductor). Need to read about this some more.

Important optimizations for matrix multiplication kernels (to get really close to cuBLAS performance):

  • Shared Memory Tiling
  • Register Tiling
  • Padding (to avoid Shared Memory Bank conflicts)
  • Load/Store Vectorization (from global memory to shared)
  • Fetch the data for the next iteration (of a loop) while processing the current iteration

Some other random notes:

  • There are a fixed number of physical registers in each SM/multiprocessor. These are divided (logically) between the threads running on that SM. So the number of threads that can run at a time is determined by the total number of physical registers divided by the number of registers required by each thread.
  • This means that there’s no one-right-answer for kernel fusion. Jobs that have slow memory transfers would benefit from smaller kernels, so that lots of threads can run in parallel (to hide the latency of memory transfers), while jobs that are light on memory transfers can have heavier-but-fewer kernels.
  • User-facing API: TensorRT-style AOT compiled engine files, or Torch/Mojo/PolyBlocks-style JIT compilers inside Python, or in between (e.g. TensorRT-RTX).
  • For the host-side code (i.e. the code that talks to the driver), it might be a good idea to generate C++ code that people can compile themselves (for power users). But this would add more hoops for the user to jump through, so maybe this might be just an option?
  • Quantization hardware-awareness in the compiler is important, so that it can factor that in during tiling and memory layout.