Nov 13, 2025
PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)!
Some of the compiler passes to keep in mind:
- fusion
- tiling
- use hardware acceleration (like tensor cores)
- constant folding
- perform redundant computation to avoid global memory accesses where profitable
- pack into buffers
- loop transformation
- unroll-and-jam (register tiling?)
- vectorization
- reorder execution for better spatial, temporary and group reuse
Scheduling approaches:
Nov 7, 2025
Tags: ml, compiler, onnx, ggml, sdkit, worklog
Wrote a simple script to convert ONNX to GGML. It auto-generates C++ code that calls the corresponding ggml functions (for each ONNX operator). This file can then be compiled and run like a normal C++ ggml program, and will produce the same results as the original model in PyTorch.
The generated file can work on multiple backends: CPU, CUDA, ROCm, Vulkan, Metal etc, by providing the correct compiler flags during cmake -B, e.g. -D GGML_CUDA=1 for CUDA.
Nov 5, 2025
Tags: easydiffusion, sdkit
Following up to the deep-dive on ML compilers:
sdkit v3 won’t use general-purpose ML compilers. They aren’t yet ready for sdkit’s target platforms, and need a lot of work (well beyond sdkit v3’s scope). But I’m quite certain that sdkit v4 will use them, and sdkit v3 will start making steps in that direction.
For sdkit v3, I see two possible paths:
- Use an array of vendor-specific compilers (like TensorRT-RTX, MiGraphX, OpenVINO etc), one for each target platform.
- Auto-generate ggml code from onnx (or pytorch), and beat it on the head until it meets sdkit v3’s performance goals. Hand-tune kernels, contribute to ggml, and take advantage of ggml’s multi-backend kernels.
Both approaches provide a big step-up from sdkit v2 in terms of install size and performance. So it makes sense to tap into these first, and leave ML compilers for v4 (as another leap forward).
Nov 5, 2025
Tags: easydiffusion, sdkit, compilers
This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.
Some final takeaways:
- ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
- A single compiler is unlikely to fit every scenario.
- The scheduler needs to be grounded in truth.
- Simulators might be worth exploring more.
ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)
It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.
Oct 27, 2025
A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem:
- CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’.
- Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU).
- Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse (shed) for stockpiling whatever inventory it needs, also known as ‘Shared Memory’.
- Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process.
This analogy can help understand the scale and performance penalty for data transfers.
Oct 24, 2025
Tags: mlir, easydiffusion, sdkit
Good post on using MLIR for compiling ML models to GPUs. It gives a good broad overview of a GPU architecture, and how MLIR fits into that. The overall series looks pretty interesting too!
Making a note here for future reference - https://www.stephendiehl.com/posts/mlir_gpu/
Oct 22, 2025
Tags: easydiffusion, samplers, c++
Wrote a fresh implementation of most of the popular samplers and schedulers used for image generation (Stable Diffusion and Flux) at https://github.com/cmdr2/samplers.cpp. A few other schedulers (like Align Your Steps) have been left out for now, but are pretty easy to implement.
It’s still work-in-progress, and is not ready for public use. The algorithmic port has been completed, and the next step is to test the output values against reference values (from another implementation, e.g. Forge WebUI). After that, I’ll translate it to C++.
Oct 10, 2025
Tags: easydiffusion, sdkit, compilers
Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.
tl;dr summary
The current state is:
- Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
- Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.
The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.
Oct 10, 2025
Tags: easydiffusion, sdkit, engine
The design constraints for Easy Diffusion’s next engine (i.e. sdkit v3) are:
- Lean: Install size of < 200 MB uncompressed (excluding models).
- Fast: Performance within 10% of the best-possible speed on that GPU for that model.
- Capable: Supports Stable Diffusion 1.x, 2.x, 3.x, XL, Flux, Chroma, ControlNet, LORA, Embedding, VAE. Supports loading custom model weights (from civitai etc), and memory offloading (for smaller GPUs).
- Targets: Desktops and Laptops, Windows/Linux/Mac, NVIDIA/AMD/Intel/Apple.
I think it’s possible, using ML compilers like TensorRT-RTX (and similar compilers for other platforms). See: Some notes on ML compilers.
Sep 1, 2025
Tags: easydiffusion, admin, worklog
Cleared the backlog of stale issues on ED’s github repo. This brought down the number of open issues from ~350 to 74.
A number of those suggestions and issues are already being tracked on my task board. The others had either been fixed, or were really old (i.e. not relevant to reply anymore).
While I’d have genuinely wanted to solve all of those unresolved issues, I was on a break from this project for nearly 1.5 years, so unfortunately it is what it is.
Aug 25, 2025
Tags: tensorRT, torch, easydiffusion, ggml, cuda, vulkan
Experimented with TensorRT-RTX (a new library offered by NVIDIA).
The first step was a tiny toy model, just to get the build and test setup working.
The reference model in PyTorch:
import torch
import torch.nn as nn
class TinyCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
self.relu = nn.ReLU()
self.pool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(8, 4) # 4-class toy output
def forward(self, x):
x = self.relu(self.conv(x))
x = self.pool(x).flatten(1)
return self.fc(x)I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.
Jun 17, 2025
Tags: easydiffusion, blog
Development update for Easy Diffusion - It’s chugging along in starts and stops. Broadly, there are three tracks:
-
Maintenance: The past few months have seen increased support for AMD, Intel and integrated GPUs. This includes AMD on Windows. Added support for the new AMD 9060/9070 cards last week, and the new NVIDIA 50xx cards in March.
-
Flux to the main branch / release v3.5 to stable: Right now, Flux / v3.5 still requires you to enable ED beta first. And then install Forge. Last week I got Flux working in our main engine (with decent rendering speed). It still needs more work to support all the different models formats for Flux. Using Forge was a temporary arrangement, until Flux worked in our main engine.
Mar 4, 2025
Tags: easydiffusion
Upgraded the default version of Easy Diffusion to Python 3.9. Newer versions of torch don’t support Python 3.8, so this became urgent after the release of NVIDIA’s 50xx series GPUs.
I choose 3.9 as a temporary fix (instead of a newer Python version), since it had the least amount of package conflicts. The future direction of Easy Diffusion’s backend is unclear right now - there are a bunch of possible paths. So I didn’t want to spend too much time on this. I also wanted to minimize the risk to existing users.
Feb 10, 2025
Tags: easydiffusion, sdkit, amd, torchruntime, windows, intel, integrated, directml
Easy Diffusion (and sdkit) now also support AMD on Windows automatically (using DirectML), thanks to integrating with torchruntime. It also supports integrated GPUs (Intel and AMD) on Windows, making Easy Diffusion faster on PCs without dedicated graphics cards.
Feb 10, 2025
Tags: easydiffusion, torchruntime, sdkit
Spent the last week or two getting torchruntime fully integrated into Easy Diffusion, and making sure that it handles all the edge-cases.
Easy Diffusion now uses torchruntime to automatically install the best-possible version of torch (on the users’ computer) and support a wider variety of GPUs (as well as older GPUs). And it uses a GPU-agnostic device API, so Easy Diffusion will automatically support additional GPUs when they are supported by torchruntime.