Oct 27, 2025
A possible intuition for understanding GPU memory hierarchy (and the performance penalty for data transfer between various layers) is to think of it like a manufacturing logistics problem:
- CPU (host) to GPU (device) is like travelling overnight between two cities. The CPU city is like the “headquarters”, and contains a mega-sized warehouse of parts (think football field sizes), also known as ‘Host memory’.
- Each GPU is like a different city, containing its own warehouse outside the city, also known as ‘Global Memory’. This warehouse stockpiles whatever it needs from the headquarters city (CPU).
- Each SM/Core/Tile is a factory located in different areas of the city. Each factory contains a small warehouse (shed) for stockpiling whatever inventory it needs, also known as ‘Shared Memory’.
- Each warp is a bulk stamping machine inside the factory, producing 32 items in one shot. There’s a tray next to each machine, also known as ‘Registers’. This tray is used for keeping stuff temporarily for each stamping process.
This analogy can help understand the scale and performance penalty for data transfers.
Sep 4, 2024
Tags: easydiffusion, ai, lab, performance, featured
tl;dr: Explored a possible optimization for Flux with diffusers when using enable_sequential_cpu_offload(). It did not work.
While trying to use Flux (nearly 22 GB of weights) with diffusers on a 12 GB graphics card, I noticed that it barely used any GPU memory when using enable_sequential_cpu_offload(). And it was super slow. It turns out that the largest module in Flux’s transformer model is around 108 MB, so because diffusers streams modules one-at-a-time, the peak VRAM usage never crossed above a few hundred MBs.