-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Description
System Info
GPU: 8x NVIDIA B200 180 GB vs 8x NVIDIA H200 141 GB
TRT-LLM: docker image built from source for v0.21.0rc2
OS: Ubuntu 22.04
Issue
We see consistent and significant slow-down in TTFT for DS-R1 FP4 on B200 vs DS-R1 FP8 on H200, unless ep_size=None
(in which case we see speed-up as expected). The following numbers were obtained by running LLMPerf against trtllm-serve
:
Model | Batch Size | Input | Output | ITL p50 TP=8, EP=None | ITL p50 TP=8, EP=4 | ITL p50 TP=8, EP=8 | TTFT p50 TP=8, EP=None | TTFT p50 TP=8, EP=4 | TTFT p50 TP=8, EP=8 |
---|---|---|---|---|---|---|---|---|---|
DeepSeek-R1 FP8 | 1 | 1600 | 600 | 0.01378 | 0.01081 | 0.011 | 0.16463 | 0.13436 | 0.12411 |
DeepSeek-R1 FP4 | 1 | 1600 | 600 | 0.00678 | 0.00661 | 0.00681 | 0.16104 | 0.16018 | 0.17265 |
Speed-up | 2.03332 | 1.63469 | 1.61488 | 1.02231 | 0.83877 | 0.71885 | |||
DeepSeek-R1 FP8 | 1 | 8192 | 600 | 0.01453 | 0.01159 | 0.01178 | 0.51583 | 0.37007 | 0.35281 |
DeepSeek-R1 FP4 | 1 | 8192 | 600 | 0.00714 | 0.00697 | 0.00717 | 0.39411 | 0.39012 | 0.44809 |
Speed-up | 2.0356 | 1.66225 | 1.64388 | 1.30884 | 0.9486 | 0.78736 | |||
DeepSeek-R1 FP8 | 8 | 1600 | 600 | 0.01818 | 0.01566 | 0.01642 | 0.69481 | 0.49538 | 0.47733 |
DeepSeek-R1 FP4 | 8 | 1600 | 600 | 0.00885 | 0.00904 | 0.00926 | 0.62297 | 0.64562 | 0.73009 |
Speed-up | 2.05344 | 1.73273 | 1.77381 | 1.11532 | 0.7673 | 0.65379 | |||
DeepSeek-R1 FP8 | 8 | 8192 | 600 | 0.02124 | 0.01802 | 0.01903 | 2.26644 | 1.44035 | 1.37933 |
DeepSeek-R1 FP4 | 8 | 8192 | 600 | 0.01131 | 0.01135 | 0.01176 | 1.77425 | 1.70643 | 2.04391 |
Speed-up | 1.87824 | 1.58823 | 1.61873 | 1.27741 | 0.84407 | 0.67485 |
Metadata
Metadata
Assignees
Labels
No labels