Skip to content

Commit 7a1af1c

Browse files
authored
Cherry-pick #5947 (#5989)
Signed-off-by: Fanrong Li <[email protected]>
1 parent 0523f77 commit 7a1af1c

File tree

2 files changed

+11
-0
lines changed

2 files changed

+11
-0
lines changed

tensorrt_llm/_torch/pyexecutor/model_engine.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1463,6 +1463,16 @@ def previous_seq_slots_device():
14631463
previous_batch_len * self.max_beam_width].copy_(
14641464
new_tokens.flatten(), non_blocking=True)
14651465

1466+
if (not self._disable_overlap_scheduler
1467+
and next_draft_tokens_device is None
1468+
and len(extend_requests) > 0):
1469+
# During warmup, for those generation requests, we don't have previous tensors,
1470+
# so we need to set the previous_pos_id_offsets and previous_kv_lens_offsets to zeros
1471+
# to skip the value changes in _preprocess_inputs. Otherwise, there will be illegal memory access
1472+
# when writing key/values to the KV cache.
1473+
self.previous_pos_id_offsets_cuda *= 0
1474+
self.previous_kv_lens_offsets_cuda *= 0
1475+
14661476
position_ids = torch.tensor(position_ids,
14671477
dtype=torch.int,
14681478
pin_memory=True)

tests/integration/test_lists/test-db/l0_dgx_b200.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ l0_dgx_b200:
2323
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus[pp4-fp8kv=True-attn_backend=TRTLLM-torch_compile=False]
2424
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=True]
2525
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=True]
26+
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
2627
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[ep4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
2728
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[ep4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=True]
2829
- accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp2pp2-mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=True]

0 commit comments

Comments
 (0)