Easy Diffusion v3

Aug 25, 2025

Experimented with TensorRT-RTX (a new library offered by NVIDIA).

The first step was a tiny toy model, just to get the build and test setup working.

The reference model in PyTorch:

import torch
import torch.nn as nn

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(8, 4)  # 4-class toy output

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = self.pool(x).flatten(1)
        return self.fc(x)

I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.

I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, ggml + Vulkan, and ONNX Runtime + CUDA.

I’ve included the performance numbers below, but they shouldn’t be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely.

For 10k iterations:

Time	Framework	Environment
1.6s	plain torch	Ubuntu Linux (WSL)
1.6s	TensorRT	Windows
1.6s	fused CUDA kernel	Windows
1.6s	ONNX Runtime with CUDA	Windows
1.7s	TensorRT RTX	Windows
1.9s	plain torch	Windows
2.3s	ggml + CUDA	Windows
2.6s	torch.compile() with Triton	Ubuntu Linux (WSL)
5.1s	fused Vulkan shader	Windows
5.3s	ggml + Vulkan	Windows

It’s interesting that torch.compile() was slower than plain torch on both Windows and Ubuntu Linux (WSL). And plain torch was pretty close to TensorRT and TensorRT RTX on Windows.

Maybe the model (and data) is too small? I’ll pick a more representative model next - the Unet of a Stable Diffusion 1.5 model.