# filter by: [posts | worklogs]

Jan 8, 2026

Tags: easydiffusion, sdkit, worklog

For Z-Image, the performance of the stock version of chromaForge is poorer than sd.cpp’s. Mainly because chromaForge isn’t able to run the smaller gguf quantized models that sd.cpp is able to run (chromaForge fails with the errors that I was fixing yesterday).

If I really want to push through with this, it would be good to fix the remaining issues with gguf models in chromaForge. Only then can the performance be truly compared (in order to decide whether to release this into ED 3.5). I want to compare the performance of the smaller gguf models, because that’s what ED’s users will run typically.

Jan 7, 2026

Tags: easydiffusion, sdkit, worklog

Worked on fixing Z-Image support in ED’s fork of chromaForge (a fork of Forge WebUI). Fixed a number of integration issues. It’s now crashing on a matrix multiplication error, which looks like an incorrectly transposed matrix (mostly due to reading the weights in the wrong order).

I’ll try to install a stock version of chromaForge to see its raw performance with Z-Image (and whether it’s worth pursuing the integration), and also use it to help investigate the matrix multiplication error (and any future errors).

Dec 25, 2025

Tags: worklog, easydiffusion

Collecting the worklog over the past few weeks.

  • Enabled Flash-Attention and CPU offloading by default in sdkit3 (i.e. Easy Diffusion v4).
  • Added optional VAE tiling (and VAE tile size configuration) via config.yaml in Easy Diffusion v4.
  • Created Easy Diffusion’s fork of Forge WebUI, in order to apply the patches required to run with ED. And also to try adding new features like Z-Image (which are missing in the seemingly-abandoned main Forge repo).
  • Improved the heuristics used for killing and restarting the backend child process, since /ping requests are unreliable if the backend is under heavy load.
  • Merged a few PRs (1 2) for torchruntime that improve support for pinning pre-cu128 torch versions and fix the order of detection of DirectML and CUDA (prefers CUDA).
  • Added progress bars when downloading v4 backend artifacts.

Dec 8, 2025

Tags: sdkit, easydiffusion

The new engine that’ll power Easy Diffusion’s upcoming v4 release (i.e. sdkit3) has now been integrated into Easy Diffusion. It’s available to test by selecting v4 engine in the Settings tab (after enabling Beta). Please press Save and restart Easy Diffusion after selecting this.

It uses stable-diffusion.cpp and ggml under-the-hood, and produces optimized, lightweight builds for the target hardware.

The main benefits of Easy Diffusion’s new engine are:

  1. Very lightweight - Less than 100 MB install footprint, compared to 3 GB+ for Forge and other PyTorch-based engines.
  2. Much better for AMD/Intel/Integrated users - avoids the hot mess of ROCm and DirectML, by using a reliable Vulkan backend (that’s also used in llama.cpp).
  3. Opportunity for even faster image generation in the future - this currently uses stock sd.cpp, which has room for further optimization.
  4. Support for older GPUs - Vulkan supports older GPUs, especially older AMD GPUs unsupported by ROCm/PyTorch.

This supports:

Nov 27, 2025

Tags: sdkit, v3

Managed to get stable-diffusion.cpp integrated into sdkit v3 and Easy Diffusion.

sdkit v3 wraps stable-diffusion.cpp with an API server. For now, the API server exposes an API compatible with Forge WebUI. This saves me time, and allows Easy Diffusion to work out-of-the-box with the new C++ based sdkit.

It compiles and runs quite well. Ran it with Easy Diffusion’s UI. Tested with Vulkan and CUDA, on Windows.

There are a few feature gaps (e.g. gfpgan, more controlnet models, more controlnet filters, more schedulers/samplers, reload specific models instead of everything), but stable-diffusion.cpp has come a long way over the past year. The performance is reasonable. Not as fast as Forge or diffusers, but respectable. I haven’t spent any time on performance optimizations yet.

Nov 19, 2025

Tags: sdkit, ggml, compiler

Following up to the previous post on sdkit v3’s design:

The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph).

But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

Nov 18, 2025

Tags: ml, compiler, sdkit, onnx, ggml

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler.

The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones.

It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

Nov 13, 2025

Tags: ml, compiler, sdkit

PolyBlocks is another interesting ML compiler, written using MLIR. It’s a startup incubated in IISc Bangalore, run by someone (Uday Bondhugula) who co-authored a paper on compiler optimizations for GPGPUs back in 2008 (17 years ago)!

Some of the compiler passes to keep in mind:

  • fusion
  • tiling
  • use hardware acceleration (like tensor cores)
  • constant folding
  • perform redundant computation to avoid global memory accesses where profitable
  • pack into buffers
  • loop transformation
  • unroll-and-jam (register tiling?)
  • vectorization
  • reorder execution for better spatial, temporary and group reuse

Scheduling approaches:

Nov 7, 2025

Tags: ml, compiler, onnx, ggml, sdkit, worklog

Wrote a simple script to convert ONNX to GGML. It auto-generates C++ code that calls the corresponding ggml functions (for each ONNX operator). This file can then be compiled and run like a normal C++ ggml program, and will produce the same results as the original model in PyTorch.

The generated file can work on multiple backends: CPU, CUDA, ROCm, Vulkan, Metal etc, by providing the correct compiler flags during cmake -B, e.g. -D GGML_CUDA=1 for CUDA.

Nov 5, 2025

Tags: easydiffusion, sdkit

Following up to the deep-dive on ML compilers:

sdkit v3 won’t use general-purpose ML compilers. They aren’t yet ready for sdkit’s target platforms, and need a lot of work (well beyond sdkit v3’s scope). But I’m quite certain that sdkit v4 will use them, and sdkit v3 will start making steps in that direction.

For sdkit v3, I see two possible paths:

  1. Use an array of vendor-specific compilers (like TensorRT-RTX, MiGraphX, OpenVINO etc), one for each target platform.
  2. Auto-generate ggml code from onnx (or pytorch), and beat it on the head until it meets sdkit v3’s performance goals. Hand-tune kernels, contribute to ggml, and take advantage of ggml’s multi-backend kernels.

Both approaches provide a big step-up from sdkit v2 in terms of install size and performance. So it makes sense to tap into these first, and leave ML compilers for v4 (as another leap forward).

Nov 5, 2025

Tags: easydiffusion, sdkit, compilers

This post concludes (for now) my ongoing deep-dive into ML compilers, while researching for sdkit v3. I’ve linked (at the end) to some of the papers that I read related to graph execution on GPUs.

Some final takeaways:

  1. ML compilers might break CUDA’s moat (and fix AMD’s ROCm support).
  2. A single compiler is unlikely to fit every scenario.
  3. The scheduler needs to be grounded in truth.
  4. Simulators might be worth exploring more.

ML compilers might break CUDA’s moat (and fix AMD’s ROCm support)

It’s pretty clear that ML compilers are going to be a big deal. NVIDIA’s TensorRT is also an ML compiler, but it only targets their GPUs. Once the generated machine code (from cross-vendor ML compilers) is comparable in performance to hand-tuned kernels, these compilers are going to break the (in)famous moat of CUDA.

Oct 27, 2025

Tags: gpu, ai, sdkit

A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem:

  1. CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’.
  2. Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU).
  3. Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse for stockpiling whatever inventory it needs, also known as ‘Shared Memory’.
  4. Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process.

This analogy can help understand the scale and performance penalty for data transfers.

Oct 22, 2025

Tags: easydiffusion, samplers, c++

Wrote a fresh implementation of most of the popular samplers and schedulers used for image generation (Stable Diffusion and Flux) at https://github.com/cmdr2/samplers.cpp. A few other schedulers (like Align Your Steps) have been left out for now, but are pretty easy to implement.

It’s still work-in-progress, and is not ready for public use. The algorithmic port has been completed, and the next step is to test the output values against reference values (from another implementation, e.g. Forge WebUI). After that, I’ll translate it to C++.

Oct 10, 2025

Tags: easydiffusion, sdkit, compilers

Some notes on machine-learning compilers, gathered while researching tech for Easy Diffusion’s next engine (i.e. sdkit v3). For context, see the design constraints of the new engine.

tl;dr summary

The current state is:

  1. Vendor-specific compilers are the only performant options on consumer GPUs right now. For e.g. TensorRT-RTX for NVIDIA, MiGraphX for AMD, OpenVINO for Intel.
  2. Cross-vendor compilers are just not performant enough right now for Stable Diffusion-class workloads on consumer GPUs. For e.g. like TVM, IREE, XLA.

The focus of cross-vendor compilers seems to be either on datacenter hardware, or embedded devices. The performance on desktops and laptops is pretty poor. Mojo doesn’t target this category (and doesn’t support Windows). Probably because datacenters and embedded devices are currently where the attention (and money) is.