Skip to content

DS-R1 FP4 on B200 vs DS-R1 FP8 on H200 --- slow-down in TTFT #5929

@matkle

Description

@matkle

System Info
GPU: 8x NVIDIA B200 180 GB vs 8x NVIDIA H200 141 GB
TRT-LLM: docker image built from source for v0.21.0rc2
OS: Ubuntu 22.04

Issue
We see consistent and significant slow-down in TTFT for DS-R1 FP4 on B200 vs DS-R1 FP8 on H200, unless ep_size=None (in which case we see speed-up as expected). The following numbers were obtained by running LLMPerf against trtllm-serve:

Model Batch Size Input Output ITL p50 TP=8, EP=None ITL p50 TP=8, EP=4 ITL p50 TP=8, EP=8 TTFT p50 TP=8, EP=None TTFT p50 TP=8, EP=4 TTFT p50 TP=8, EP=8
DeepSeek-R1 FP8 1 1600 600 0.01378 0.01081 0.011 0.16463 0.13436 0.12411
DeepSeek-R1 FP4 1 1600 600 0.00678 0.00661 0.00681 0.16104 0.16018 0.17265
Speed-up       2.03332 1.63469 1.61488 1.02231 0.83877 0.71885
                   
DeepSeek-R1 FP8 1 8192 600 0.01453 0.01159 0.01178 0.51583 0.37007 0.35281
DeepSeek-R1 FP4 1 8192 600 0.00714 0.00697 0.00717 0.39411 0.39012 0.44809
Speed-up       2.0356 1.66225 1.64388 1.30884 0.9486 0.78736
                   
DeepSeek-R1 FP8 8 1600 600 0.01818 0.01566 0.01642 0.69481 0.49538 0.47733
DeepSeek-R1 FP4 8 1600 600 0.00885 0.00904 0.00926 0.62297 0.64562 0.73009
Speed-up       2.05344 1.73273 1.77381 1.11532 0.7673 0.65379
                   
DeepSeek-R1 FP8 8 8192 600 0.02124 0.01802 0.01903 2.26644 1.44035 1.37933
DeepSeek-R1 FP4 8 8192 600 0.01131 0.01135 0.01176 1.77425 1.70643 2.04391
Speed-up       1.87824 1.58823 1.61873 1.27741 0.84407 0.67485

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions