Easy Diffusion v3

Nov 5, 2025

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.

Some final takeaways:

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
A single compiler is unlikely to fit every scenario.
The scheduler needs to be grounded in truth.
Simulators might be worth exploring more.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

And thankfully, this will also finally make AMD’s consumer GPUs more accessible to developers (by codifying the immense tribal knowledge of various ROCm versions on AMD’s consumer GPUs).

Hand-written kernels could go the way of hand-written assembly code. This was always going to happen eventually, but I think it’s pretty close now.

General-purpose ML compilers are still far from good, but the infrastructure and know-how is finally coming together. I don’t see anything fundamentally blocking that from happening (other than lots of hard work). The good news is that the recent widespread use of ML models (on all kinds of devices) and the combinatorial explosion of operator * data-type * hardware will naturally force ML compilers to become good.

A single compiler is unlikely to fit every scenario

Compiling a graph automatically into executable GPU code is going to be an entire field in itself, with sub-fields specializing on particular aspects.

Reading more papers on GPU scheduling (listed below) further reinforced the view that this is a manufacturing logistics problem (where you need to figure out the best plan for coordinating your manufacturing operations across cities and factories). There’s ample scope for specialization.

There are a lot of factors to consider:

There can be many possible ‘overall’ goals: latency, throughput, energy. These ‘overall goals’ might even change dynamically, for e.g. when a laptop switches from charging to battery, or datacenters that run some models at maximum performance while running others at higher throughput or energy efficiency.
The operating architecture (and constraints) change significantly from multi-rack GPU clusters, to laptop GPUs, to SoC on phones.
We also have a diverse set of accelerator types. GPUs are very different architecturally from TPUs, which are very different from Tenstorrent devices. So the assumptions made by the compiler will vary significantly based on the accelerator type.
This is an early field. There will be newer algorithms and papers for graph optimizations, scheduling (ML models vs classic optimization), code generation etc.

So it’s unlikely that a single compiler can fit every problem. While there are a lot of shared techniques, each application will need to take a “whole-picture” view of their operating constraints while compiling the graph into machine code.

That’s why the idea of MLIR is pretty useful, where different operating constraints can be baked into different ‘dialects’, producing different domain-specific compilers. But it’s still very early days.

The scheduler needs to be grounded in truth

It’s also clear that the actual performance numbers from the GPUs matter when making the cost model. For e.g.:

the memory bandwidth and transfer speed between different levels.
the actual raw performance of compute units.
the kernel dispatch time.
the actual execution time of generated kernels.

An incorrect or simplistic cost model of the GPU can result in inefficient execution plans. For e.g. we can end up with unexpected bottlenecks if the generated kernels take longer (or less) than their predicted execution time. Driver updates can also be a source of surprises, for e.g. changes in cache eviction policies, warp scheduling algorithms etc.

Taking the manufacturing logistics analogy, you need to be aware of what’s actually happening in the factories, regardless of the ‘ideal’ manufacturing plan.

This is well understood - that’s why autotuners are used widely. But a bunch of papers downplay or ignore these factors, presumably because they’re looking at just one slice of the overall problem (e.g. graph optimization). So it is important not to optimize a graph while ignoring the hardware, for e.g. fusing operations into kernels that are too big or too small for a particular device (in context of the overall graph).

Simulators might be worth exploring more

A GPU simulator would model the compute cores, memory hierarchy sizes, performance, memory transfer dynamics etc for each given GPU model. While not recommended for cycle-accurate prediction of performance, it could help find a decent first-approximation execution plan. This plan can then be tuned just-in-time using autotuners.

It may not even need to actually run the code, as long as it produces correct tensor shapes. Because the goal is to model the performance characteristics, not emulate a GPU in software.

A simulator (with a library of lots of popular GPUs) might also help train ML schedulers by simulating a diverse range of operating constraints. Like a “GPU Dojo”.

There have been attempts like GPGPU-Sim, Accel-Sim, MGPUSIM and it is a hard problem. Again, this isn’t a new idea. CPU compilers have hand-written models of various hardware targets.

A simulator will never be perfect - ask any engineer in Formula 1. But I don’t think any Formula 1 team today would get rid of their simulation software, just because they can’t model reality to 100% accuracy.

Random ideas

Random ideas that may or may not be viable:

Use “Kernel Splitting” instead of “Kernel Fusion”. A bottom-up approach of “inlining” everything first, and then splitting up the overall task into new kernels that suit the operating constraints.
Work directly with PyTorch modules (instead of “lower-level” graphs like ONNX). This will provide the compiler with access to richer information, for e.g. higher-level grouping of operators, looping over blocks etc (instead of working directly with a flattened graph of operators). Mojo and torch.compile benefit from this knowledge.
Functional Programming approaches can help in terms of thinking about graph optimizations.
Cache Oblivious algorithms might not map well to GPUs, because kernel execution overhead is very high (compared to the overhead of recursion on CPUs). But there might be some ideas worth borrowing.
Load the main weights of the model (into CPU memory and GPU Global Memory) in the desired layout, based on the tiling used etc. Instead of converting the layout at runtime while launching the kernel.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

A single compiler is unlikely to fit every scenario

The scheduler needs to be grounded in truth

Simulators might be worth exploring more

Random ideas

Some interesting papers