Aug 25, 2025

Experimented with TensorRT-RTX (a new library offered by NVIDIA).

The first step was a tiny toy model, just to get the build and test setup working.

The reference model in PyTorch:

import torch
import torch.nn as nn

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 8, 3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(8, 4)  # 4-class toy output

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = self.pool(x).flatten(1)
        return self.fc(x)

I ran this on a NVIDIA 4060 8 GB (Laptop) for 10K iterations, on Windows and WSL-with-Ubuntu, with float32 data.

I ported this model to plain torch, torch.compile, TensorRT, TensorRT RTX, plain CUDA (fused operation), plain Vulkan (fused operation), ggml + CUDA, and ggml + Vulkan.

I’ve included the performance numbers below, but they shouldn’t be taken very seriously since the model is too small to paint a true picture (in terms of computation complexity and data size). The intent is to verify that the different test setups are working somewhat sanely.

For 10k iterations:

Time Framework Environment
1.6s plain torch Ubuntu Linux (WSL)
1.9s plain torch Windows
2.6s torch.compile() with Triton Ubuntu Linux (WSL)
1.7s TensorRT RTX Windows
1.6s TensorRT Windows
1.6s fused CUDA kernel Windows
5.1s fused Vulkan shader Windows
2.3s ggml + CUDA Windows
5.3s ggml + Vulkan Windows

It’s interesting that torch.compile() was slower than plain torch on both Windows and Ubuntu Linux (WSL). And plain torch was pretty close to TensorRT and TensorRT RTX on Windows.

Maybe the model (and data) is too small? I’ll pick a more representative model next - the Unet of a Stable Diffusion 1.5 model.