Nov 19, 2025

Following up to the previous post on sdkit v3’s design:

The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph).

But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

Here’s a possible roadmap instead:

  1. sdkit v3 - Use stable-diffusion.cpp, and change whatever’s necessary to support Easy Diffusion’s requirements. Upstream the changes where possible.
  2. sdkit v4 - Develop graph-compiler further to generate ggml automatically from onnx (for the required models). This will bypass the need for hand-written ports of models like in sd.cpp, and enable further high-level optimizations that’ll run automatically on the graph.
  3. sdkit v5 - Add automatic GPU kernel code-generation in graph-compiler, bypassing the need for hand-written kernels like in ggml. This will enable further low-level optimizations around tiling, fusion, memory transfers etc.

The benefits are:

  1. Saves time. For e.g. I don’t need to reimplement LoRA, ControlNet etc right away. Or write my own GPU kernel code-generator.
  2. Keeps delivering value to users ASAP. I don’t need to wait for massive projects to finish before delivering value.
  3. Helps gain experience progressively. For e.g. the experience of manually optimizing the Conv2D operator bottleneck (and the overall graph) in stable-diffusion.cpp will be useful later (when building an automatic optimizer).