Easy Diffusion v3

Nov 19, 2025

Following up to the previous post on sdkit v3’s design:

The initial experiments with generating ggml from onnx models were promising, and it looks like a fairly solid path forward. It produces numerically-identical results, and there’s a clear path to reach performance-parity with stable-diffusion.cpp with a few basic optimizations (since both will eventually generate the same underlying ggml graph).

But I think it’s better to use the simpler option first, i.e. use stable-diffusion.cpp directly. It mostly meets the design goals for sdkit v3 (after a bit of performance tuning). Everything else is premature optimization and scope bloat.

Nov 18, 2025

Tags: ml, compiler, sdkit, onnx, ggml

Successfully compiled the VAE of Stable Diffusion 1.5 using graph-compiler.

The compiled model is terribly slow because I haven’t written any performance optimizations, and it (conservatively) converts a lot of intermediate tensors to contiguous copies. But we don’t need any clever optimizations to get to decent performance, just basic ones.

It’s pretty exciting because I was able to bypass the need to port the model to C++ manually. Instead, I was able to just compile the exported ONNX model and get the same output values as the original PyTorch implementation (given the same input and weights). I could compile to any platform supported by ggml by just changing one flag (e.g. CPU, CUDA, ROCm, Vulkan, Metal etc).

Nov 7, 2025

Tags: ml, compiler, onnx, ggml, sdkit, worklog

Wrote a simple script to convert ONNX to GGML. It auto-generates C++ code that calls the corresponding ggml functions (for each ONNX operator). This file can then be compiled and run like a normal C++ ggml program, and will produce the same results as the original model in PyTorch.

The generated file can work on multiple backends: CPU, CUDA, ROCm, Vulkan, Metal etc, by providing the correct compiler flags during cmake -B, e.g. -D GGML_CUDA=1 for CUDA.

Aug 25, 2025

Tags: tensorRT, torch, easydiffusion, ggml, cuda, vulkan

Experimented with TensorRT-RTX (a new library offered by NVIDIA).

The first step was a tiny toy model, just to get the build and test setup working.

The reference model in PyTorch:

import torch
import torch.nn as nn

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(8, 4)  # 4-class toy output

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = self.pool(x).flatten(1)
        return self.fc(x)

I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.