Platform
Machine setup (GPU)
1 min
floyo runs on nvidia h100 nvl gpus each h100 nvl delivers 3 9 tb/s of hbm3 memory bandwidth over 2x the bandwidth of rtx pro 6000 gpus which helps workflows run faster, especially for higher res generations, larger models, and longer videos h100 nvl rtx pro 6000 rtx 5090 what it means (ai/ml workflows) memory type & size 94gb hbm3 96gb gddr7 32gb gddr7 hbm3 vs gddr7 hbm3 is “on package” stacked memory built for extreme speed gddr7 is fast memory chips on the board, but with less bandwidth than hbm setups memory bandwidth 3 938 tb/s 1 792 tb/s 1 792 tb/s bandwidth = how fast the gpu can read/write vram diffusion/video workflows move a ton of data every step when bandwidth is the limiter, higher bandwidth directly translates to faster steps and better throughput (especially big images, long video, and multi model workflows) h100 nvl has 2 2x higher memory bandwidth than rtx pro 6000 and rtx 5090 fp8 tensor 3,341 tflops 2,015 tflops 1,676 tflops think of this as “top end ai math speed” when a workflow/runtime can use fp8 if a workflow is fp8 optimized ( quantized ), higher fp8 mean more images/frames per minute h100 leads by a lot here fp16 tensor 1,671 tflops 1,008 tflops 838 tflops most open source image/video generation inference runs in fp16/bf16 today this often tracks real world speed for diffusion steps h100 is materially ahead, which is why we standardize on h100 nvl for predictable, fast workflow runs tf32 tensor 835 tflops 504 tflops 210 tflops mostly relevant to training/fine tuning and some mixed precision training paths less important for typical “run a workflow and generate” inference tensor tflops are peak theoretical values shown with “sparsity” where applicable; real world speed varies by model, precision/quantization, and which part of the workflow is the bottleneck (often memory bandwidth for diffusion/video)