[Misc] add collect_env feat#218
Conversation
Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds a standalone collect_env.py script to collect and print Kunlun XPU environment diagnostics (system/toolchain, PyTorch, XPU driver/runtime, package versions, and selected env vars) to support debugging and issue reporting.
Changes:
- Introduces a new environment collection script with a formatted report output.
- Parses
xpu-smioutput to extract device model names, driver version, and XPU-RT (XRE) version. - Collects relevant Python package versions and Kunlun-related environment variables.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def parse_xpu_smi_output(run_lambda): | ||
| """ | ||
| Parse the complete output of xpu-smi command | ||
| [Principle Explanation] | ||
| The xpu-smi output format is similar to nvidia-smi, we need to parse it with regex. |
There was a problem hiding this comment.
The regex-based parsing in parse_xpu_smi_output() is core to this feature and is easy to break if xpu-smi output format changes. Add unit tests that feed representative xpu-smi outputs into this function via a mocked run_lambda and assert driver/XRE/device parsing results.
| Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version, | ||
| but actual pip installed version is "0.1.0". | ||
| Now prioritize using importlib.metadata to get real installed version. |
There was a problem hiding this comment.
The get_vllm_kunlun_version() docstring hardcodes an "actual pip installed version" value ("0.1.0"), which is environment-dependent and can become stale/misleading as releases change. Consider rewording this to describe the mismatch generically without pinning a specific version number.
| Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version, | |
| but actual pip installed version is "0.1.0". | |
| Now prioritize using importlib.metadata to get real installed version. | |
| Previously retrieved the version from vllm_kunlun.platforms.version, which could | |
| differ from the version of the actually installed pip package. | |
| Now this function prioritizes using importlib.metadata to report the installed version. |
| # Decode byte stream to string | ||
| output = raw_output.decode("utf-8").strip() | ||
| err = raw_err.decode("utf-8").strip() |
There was a problem hiding this comment.
run() decodes subprocess output with decode('utf-8') without an error handler, which can raise UnicodeDecodeError on systems that emit non‑UTF8 text and abort env collection. Consider using text=True with encoding='utf-8' and errors='replace' (or passing errors='replace' to decode).
| # Decode byte stream to string | |
| output = raw_output.decode("utf-8").strip() | |
| err = raw_err.decode("utf-8").strip() | |
| # Decode byte stream to string; be tolerant of non-UTF8 bytes | |
| output = raw_output.decode("utf-8", errors="replace").strip() | |
| err = raw_err.decode("utf-8", errors="replace").strip() |
| try: | ||
| # Try getting from torch_xmlir | ||
| import torch_xmlir | ||
|
|
There was a problem hiding this comment.
get_bkcl_version() only catches ImportError around import torch_xmlir, but importing that module can also fail with OSError/RuntimeError (e.g., missing shared libraries) and would crash the entire script. Catch a broader Exception here and continue to the other BKCL detection methods.
| kunlun_xpu_info=get_kunlun_gpu_info(run_lambda), | ||
| kunlun_driver_version=get_kunlun_driver_version(run_lambda), | ||
| kunlun_xre_version=get_kunlun_xre_version(run_lambda), | ||
| bkcl_version=get_bkcl_version(run_lambda), | ||
| kunlun_topo=get_kunlun_topo(run_lambda), |
There was a problem hiding this comment.
parse_xpu_smi_output() is invoked separately by get_kunlun_gpu_info(), get_kunlun_driver_version(), and get_kunlun_xre_version(), which results in running xpu-smi multiple times during a single collection. Consider calling parse_xpu_smi_output() once in get_kunlun_env_info() and passing the parsed result (or caching it) to avoid redundant command executions and improve reliability.
| parsed = parse_xpu_smi_output(run_lambda) | ||
| if parsed and parsed["xre_version"]: | ||
| return parsed["xre_version"] | ||
| return "N/A (not installed or not detected)" |
There was a problem hiding this comment.
get_kunlun_xre_version() returns "N/A (not installed or not detected)" whenever parsing fails, including when xpu-smi can't be executed or returns an error. This can misreport "not installed" when the real issue is "could not collect"; consider returning None (so the caller prints "Could not collect") or including the actual command failure reason.
| return "N/A (not installed or not detected)" | |
| return None |
| secret_terms = ("secret", "token", "api", "access", "password") | ||
|
|
||
| for k, v in sorted(os.environ.items()): | ||
| if any(term in k.lower() for term in secret_terms): | ||
| continue |
There was a problem hiding this comment.
The script prints environment variable values (e.g., TORCH_*, VLLM_*) into the report. Even with the current substring filter, this can still leak sensitive data in bug reports if a secret variable doesn’t match the filtered terms. Consider masking values by default (e.g., show only keys or redact values) and/or requiring explicit opt-in to print env var values.
|
I think that is a good job! @Lidang-Jiang cc @liwei109 @baoqian426 |
PR Description
This PR introduces a
collect_envfeature to help users quickly gather and export runtime/environment information for vLLM-Kunlun.It is intended to improve debugging efficiency and make issue/bug reports more actionable by providing a standardized environment summary.
What’s included
collect_env)Why this change
When users report failures or performance issues, environment differences are often the root cause. A built-in
collect_envmakes it easier to reproduce and diagnose problems.How to use (example)
collect_enventry (CLI/script) to print/save environment infoExample
vllm-project/vllm#32915
Checklist (Required)
Before submitting this PR, please ensure that all the following items are completed:
pre-commitchecks.git commit -s.PR Type
Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:
[Feature]– New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)[Bugfix]– Bug fixes[CI/Build]– CI, build system, or infrastructure improvements[Doc]– Documentation updates or fixes[Misc]– Other changes that do not fit the above categories (use sparingly)Detailed Checklist (Click to Expand)
Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.
1. Code Quality
pre-commit).2. Testing
3. DCO Compliance
This project follows the Developer Certificate of Origin (DCO).
Signed-off-by:line.git commit -sto automatically add the sign-off.4. Review Expectations
During the review process, maintainers may:
We appreciate your patience and collaboration throughout the review process!