Skip to content

Commit 0223de0

Browse files
authored
[None][doc] Add deployment guide section for VDR task (#6669)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 46357e7 commit 0223de0

File tree

4 files changed

+378
-4
lines changed

4 files changed

+378
-4
lines changed

examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md renamed to docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ docker run --rm -it \
3535
-p 8000:8000 \
3636
-v ~/.cache:/root/.cache:rw \
3737
--name tensorrt_llm \
38-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc5 \
38+
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
3939
/bin/bash
4040
```
4141

Lines changed: 364 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,364 @@
1+
# Quick Start Recipe for Llama3.3 70B on TensorRT-LLM - Blackwell & Hopper Hardware
2+
3+
## Introduction
4+
5+
This deployment guide provides step-by-step instructions for running the Llama 3.3-70B Instruct model using TensorRT-LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.
6+
7+
The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
8+
9+
## Access & Licensing
10+
11+
To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License ([https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). NVIDIA’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
12+
13+
## Prerequisites
14+
15+
GPU: NVIDIA Blackwell or Hopper Architecture
16+
OS: Linux
17+
Drivers: CUDA Driver 575 or Later
18+
Docker with NVIDIA Container Toolkit installed
19+
Python3 and python3-pip (Optional, for accuracy evaluation only)
20+
21+
## Models
22+
23+
* FP8 model: [Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8)
24+
* NVFP4 model: [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)
25+
26+
27+
Note that NVFP4 is only supported on NVIDIA Blackwell
28+
29+
## Deployment Steps
30+
31+
### Run Docker Container
32+
33+
Run the docker container using the TensorRT-LLM NVIDIA NGC image.
34+
35+
```shell
36+
docker run --rm -it \
37+
--ipc=host \
38+
--gpus all \
39+
-p 8000:8000 \
40+
-v ~/.cache:/root/.cache:rw \
41+
--name tensorrt_llm \
42+
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
43+
/bin/bash
44+
```
45+
46+
Note:
47+
48+
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
49+
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
50+
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
51+
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
52+
53+
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
54+
55+
### Creating the TRT-LLM Server config
56+
57+
We create a YAML configuration file /tmp/config.yml for the TensorRT-LLM Server and populate it with the following recommended performance settings.
58+
59+
```shell
60+
EXTRA_LLM_API_FILE=/tmp/config.yml
61+
62+
cat << EOF > ${EXTRA_LLM_API_FILE}
63+
enable_attention_dp: false
64+
cuda_graph_config:
65+
enable_padding: true
66+
max_batch_size: 1024
67+
kv_cache_config:
68+
dtype: fp8
69+
EOF
70+
```
71+
72+
### Launch the TRT-LLM Server
73+
74+
Below is an example command to launch the TRT-LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
75+
76+
```shell
77+
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
78+
--host 0.0.0.0 \
79+
--port 8000 \
80+
--backend pytorch \
81+
--max_batch_size 1024 \
82+
--max_num_tokens 2048 \
83+
--max_seq_len 2048 \
84+
--kv_cache_free_gpu_memory_fraction 0.9 \
85+
--tp_size 1 \
86+
--ep_size 1 \
87+
--trust_remote_code \
88+
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
89+
```
90+
91+
After the server is set up, the client can now send prompt requests to the server and receive results.
92+
93+
### Configs and Parameters
94+
95+
These options are used directly on the command line when you start the `trtllm-serve` process.
96+
#### `--tp_size`
97+
98+
&emsp;**Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
99+
100+
#### `--ep_size`
101+
102+
&emsp;**Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
103+
104+
#### `--kv_cache_free_gpu_memory_fraction`
105+
106+
&emsp;**Description:** A value between 0.0 and 1.0 that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
107+
108+
&emsp;**Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.
109+
110+
#### `--backend pytorch`
111+
112+
&emsp;**Description:** Tells TensorRT-LLM to use the **pytorch** backend.
113+
114+
#### `--max_batch_size`
115+
116+
&emsp;**Description:** The maximum number of user requests that can be grouped into a single batch for processing.
117+
118+
#### `--max_num_tokens`
119+
120+
&emsp;**Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
121+
122+
#### `--max_seq_len`
123+
124+
&emsp;**Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
125+
126+
#### `--trust_remote_code`
127+
128+
&emsp;**Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
129+
130+
131+
#### Extra LLM API Options (YAML Configuration)
132+
133+
These options provide finer control over performance and are set within a YAML file passed to the trtllm-serve command via the \--extra\_llm\_api\_options argument.
134+
135+
#### `kv_cache_config`
136+
137+
&emsp;**Description**: A section for configuring the Key-Value (KV) cache.
138+
139+
&emsp;**Options**:
140+
141+
&emsp;&emsp;dtype: Sets the data type for the KV cache.
142+
143+
&emsp;&emsp;**Default**: auto (uses the data type specified in the model checkpoint).
144+
145+
#### `cuda_graph_config`
146+
147+
&emsp;**Description**: A section for configuring CUDA graphs to optimize performance.
148+
149+
&emsp;**Options**:
150+
151+
&emsp;&emsp;enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.
152+
153+
&emsp;&emsp;**Default**: false
154+
155+
&emsp;&emsp;max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.
156+
157+
&emsp;&emsp;**Default**: 0
158+
159+
&emsp;&emsp;**Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.
160+
161+
&emsp;&emsp;batch\_sizes: A specific list of batch sizes to create CUDA graphs for.
162+
163+
&emsp;&emsp;**Default**: None
164+
165+
#### `moe_config`
166+
167+
&emsp;**Description**: Configuration for Mixture-of-Experts (MoE) models.
168+
169+
&emsp;**Options**:
170+
171+
&emsp;&emsp;backend: The backend to use for MoE operations.
172+
173+
&emsp;&emsp;**Default**: CUTLASS
174+
175+
#### `attention_backend`
176+
177+
&emsp;**Description**: The backend to use for attention calculations.
178+
179+
&emsp;**Default**: TRTLLM
180+
181+
See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class for the full list of options which can be used in the `extra_llm_api_options`.
182+
183+
## Testing API Endpoint
184+
185+
### Basic Test
186+
187+
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
188+
189+
You can query the health/readiness of the server using:
190+
191+
```shell
192+
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
193+
```
194+
195+
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
196+
197+
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
198+
199+
```shell
200+
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
201+
"model": "nvidia/Llama-3.3-70B-Instruct-FP8",
202+
"prompt": "Where is New York?",
203+
"max_tokens": 16,
204+
"temperature": 0
205+
}'
206+
```
207+
208+
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
209+
210+
```json
211+
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
212+
```
213+
214+
### Troubleshooting Tips
215+
216+
* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
217+
* Ensure your model checkpoints are compatible with the expected format
218+
* For performance issues, check GPU utilization with nvidia-smi while the server is running
219+
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
220+
* For connection issues, make sure port 8000 is not being used by another application
221+
222+
### Running Evaluations to Verify Accuracy (Optional)
223+
224+
We use the lm-eval tool to test the model’s accuracy. For more information see [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
225+
226+
To run the evaluation harness exec into the running TensorRT-LLM container and install with this command:
227+
228+
```shell
229+
docker exec -it tensorrt_llm /bin/bash
230+
231+
pip install lm_eval
232+
```
233+
234+
FP8 command for GSM8K
235+
236+
* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, set add\_special\_tokens=False to avoid it.
237+
238+
```
239+
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP8
240+
241+
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
242+
```
243+
244+
Sample result in Blackwell.
245+
246+
```
247+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
248+
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
249+
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9348|± |0.0068|
250+
| | |strict-match | 5|exact_match|↑ |0.8870|± |0.0087|
251+
```
252+
253+
FP4 command for GSM8K
254+
255+
* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it.
256+
257+
```shell
258+
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP4
259+
260+
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
261+
```
262+
263+
Sample result in Blackwell
264+
265+
```shell
266+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
267+
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
268+
|gsm8k| 3|flexible-extract| 5|exact_match||0.9356|± |0.0068|
269+
| | |strict-match | 5|exact_match||0.8393|± |0.0101|
270+
```
271+
272+
## Benchmarking Performance
273+
274+
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in “benchmark\_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script.
275+
276+
```shell
277+
cat <<EOF > bench.sh
278+
concurrency_list="1 2 4 8 16 32 64 128 256"
279+
multi_round=5
280+
isl=1024
281+
osl=1024
282+
result_dir=/tmp/llama3.3_output
283+
284+
for concurrency in ${concurrency_list}; do
285+
num_prompts=$((concurrency * multi_round))
286+
python -m tensorrt_llm.serve.scripts.benchmark_serving \
287+
--model nvidia/Llama-3.3-70B-Instruct-FP8 \
288+
--backend openai \
289+
--dataset-name "random" \
290+
--random-input-len ${isl} \
291+
--random-output-len ${osl} \
292+
--random-prefix-len 0 \
293+
--random-ids \
294+
--num-prompts ${num_prompts} \
295+
--max-concurrency ${concurrency} \
296+
--ignore-eos \
297+
--tokenize-on-client \
298+
--percentile-metrics "ttft,tpot,itl,e2el"
299+
done
300+
EOF
301+
chmod +x bench.sh
302+
```
303+
304+
To benchmark the FP4 model, replace \--model nvidia/Llama-3.3-70B-Instruct-FP8 with \--model nvidia/Llama-3.3-70B-Instruct-FP4.
305+
306+
If you want to save the results to a file add the following options.
307+
308+
```shell
309+
--save-result \
310+
--result-dir "${result_dir}" \
311+
--result-filename "concurrency_${concurrency}.json"
312+
```
313+
314+
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
315+
316+
Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
317+
318+
```shell
319+
./bench.sh
320+
```
321+
322+
Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations.
323+
324+
```
325+
============ Serving Benchmark Result ============
326+
Successful requests: 16
327+
Benchmark duration (s): 17.66
328+
Total input tokens: 16384
329+
Total generated tokens: 16384
330+
Request throughput (req/s): [result]
331+
Output token throughput (tok/s): [result]
332+
Total Token throughput (tok/s): [result]
333+
User throughput (tok/s): [result]
334+
---------------Time to First Token----------------
335+
Mean TTFT (ms): [result]
336+
Median TTFT (ms): [result]
337+
P99 TTFT (ms): [result]
338+
-----Time per Output Token (excl. 1st token)------
339+
Mean TPOT (ms): [result]
340+
Median TPOT (ms): [result]
341+
P99 TPOT (ms): [result]
342+
---------------Inter-token Latency----------------
343+
Mean ITL (ms): [result]
344+
Median ITL (ms): [result]
345+
P99 ITL (ms): [result]
346+
----------------End-to-end Latency----------------
347+
Mean E2EL (ms): [result]
348+
Median E2EL (ms): [result]
349+
P99 E2EL (ms): [result]
350+
==================================================
351+
```
352+
353+
### Key Metrics
354+
355+
* Median Time to First Token (TTFT)
356+
* The typical time elapsed from when a request is sent until the first output token is generated.
357+
* Median Time Per Output Token (TPOT)
358+
* The typical time required to generate each token *after* the first one.
359+
* Median Inter-Token Latency (ITL)
360+
* The typical time delay between the completion of one token and the completion of the next.
361+
* Median End-to-End Latency (E2EL)
362+
* The typical total time from when a request is submitted until the final token of the response is received.
363+
* Total Token Throughput
364+
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

examples/models/core/llama4/Quick Start Recipe for TRT-LLM + Llama4 Scout.md renamed to docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Quick Start Recipe for Llama4 Scout 17B FP8 and NVFP4
1+
# Quick Start Recipe for Llama4 Scout 17B on TensorRT-LLM - Blackwell & Hopper Hardware
22

33
## Introduction
44

@@ -38,7 +38,7 @@ docker run --rm -it \
3838
-p 8000:8000 \
3939
-v ~/.cache:/root/.cache:rw \
4040
--name tensorrt_llm \
41-
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4 \
41+
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
4242
/bin/bash
4343
```
4444

@@ -177,7 +177,7 @@ These options provide finer control over performance and are set within a YAML f
177177

178178
&emsp;**Default**: TRTLLM
179179

180-
See the [TorchLlmArgs](https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_args.py#L1980) class for the full list of options which can be used in the `extra_llm_api_options`.
180+
See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class for the full list of options which can be used in the `extra_llm_api_options`.
181181

182182
## Testing API Endpoint
183183

0 commit comments

Comments
 (0)