Easy Diffusion v3

Nov 5, 2025

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.

Some final takeaways:

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
A single compiler is unlikely to fit every scenario.
The scheduler needs to be grounded in truth.
Simulators might be worth exploring more.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

Oct 10, 2025

Tags: easydiffusion, sdkit, compilers

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.

tl;dr summary

The current state is:

Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.

The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.