-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[https://nvbugs/5412885][doc] Add the workaround doc for H200 OOM #6853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zhenhua Wang <[email protected]>
📝 WalkthroughWalkthroughUpdated the DeepSeek R1 deployment quick-start guide: refined troubleshooting tips with CUDA OOM guidance and env var, and added an optional lm-eval-based evaluation workflow with commands, tokenizer notes, and sample results for FP8/FP4 on GSM8K. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
249-255
: Use the correct pip package name and pin a version for reproducibilityThe PyPI package is “lm-eval” (hyphen), not “lm_eval”. Pinning helps avoid accidental regressions.
-pip install lm_eval +pip install 'lm-eval==0.4.5'
🧹 Nitpick comments (2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (2)
261-265
: Optional: cap generation length to stabilize results across harness versionsExplicitly setting a max generation length avoids accidental defaults changing across lm-eval versions.
-lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k +lm_eval --model local-completions --tasks gsm8k --batch_size 256 \ + --gen_kwargs temperature=0.0,add_special_tokens=False,max_gen_toks=512 \ + --num_fewshot 5 \ + --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False \ + --log_samples --output_path trtllm.fp8.gsm8k
278-285
: Mirror the lm-eval package-name/version fix in the FP4 section and consider a brief note on tuning concurrencyThe FP4 command mirrors FP8; ensure the prior pip fix is applied and consider noting that very high concurrency may trigger rate-limits/timeouts depending on server scheduling and max_batch_size.
Optionally add:
-lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k +lm_eval --model local-completions --tasks gsm8k --batch_size 256 \ + --gen_kwargs temperature=0.0,add_special_tokens=False,max_gen_toks=512 \ + --num_fewshot 5 \ + --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False \ + --log_samples --output_path trtllm.fp4.gsm8kIf you notice 429s/timeouts, reduce num_concurrent (e.g., 16) to match your server’s effective max_batch_size and scheduling configuration.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
(1 hunks)
🔇 Additional comments (1)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
259-265
: Verified local-completions args and add_special_tokens handling
- The lm-evaluation-harness
local-completions
provider acceptsbase_url
,model
/pretrained
,num_concurrent
,tokenized_requests
,max_retries
, and other flags exactly as shown in your snippet.- The Triton-LLM
/v1/completions
OpenAI-compatible server exposes anadd_special_tokens
field (default True) and forwards it throughtensorrt_llm/serve/openai_protocol.py
and the preprocessing model, so settingadd_special_tokens=False
is honored.No changes needed—the documentation snippet is accurate.
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
Show resolved
Hide resolved
/bot run |
PR_Github #15086 [ run ] triggered by Bot |
PR_Github #15086 [ run ] completed with state |
This is to address Kaiyu's offline suggestion to NVIDIA#6853 . Keep this separate from the original PR for clean. Signed-off-by: Zhenhua Wang <[email protected]>
This is to address Kaiyu's offline suggestion to NVIDIA#6853 . Keep this separate from the original PR for clean. Signed-off-by: Zhenhua Wang <[email protected]>
Summary by CodeRabbit