[Misc] add collect_env feat by Lidang-Jiang · Pull Request #218 · baidu/vLLM-Kunlun

Lidang-Jiang · 2026-02-25T09:44:01Z

PR Description

This PR introduces a collect_env feature to help users quickly gather and export runtime/environment information for vLLM-Kunlun.
It is intended to improve debugging efficiency and make issue/bug reports more actionable by providing a standardized environment summary.

What’s included

Add environment collection capability (collect_env)
Collect key software/hardware/runtime metadata (e.g., Python/package versions, OS info, accelerator/driver/runtime details) and format it for easy sharing
Provide a unified output to paste into issues / logs for faster triage

Why this change

When users report failures or performance issues, environment differences are often the root cause. A built-in collect_env makes it easier to reproduce and diagnose problems.

How to use (example)

Run the collect_env entry (CLI/script) to print/save environment info
Attach the output when filing an issue or asking for support

Example

Collecting Kunlun XPU environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
Clang version                : 10.0.0-4ubuntu1 
CMake version                : version 4.0.3
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.5.1+cu118
Is debug build               : False

==============================
      Python Environment
==============================
Python version               : 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.10.0-1.0.0.39-x86_64-with-glibc2.31

==============================
    Kunlun / XPU Info
==============================
XPU models and configuration : 
XPU 0: P800 OAM (96.0GB)
XPU 1: P800 OAM (96.0GB)
XPU 2: P800 OAM (96.0GB)
XPU 3: P800 OAM (96.0GB)
XPU 4: P800 OAM (96.0GB)
XPU 5: P800 OAM (96.0GB)
XPU 6: P800 OAM (96.0GB)
XPU 7: P800 OAM (96.0GB)
Kunlun driver version        : 515.58
XRE (Runtime) version        : N/A (not installed or not detected)
BKCL version                 : Found at: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/libbkcl.so
...
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.5.1+cu118
[pip3] torch_plugin==0.1.0
[pip3] torch_xray==2.0.2
[pip3] torchaudio==2.5.1+cu118
[pip3] torchvision==0.20.1+cu118
[pip3] transformers==4.57.0
[pip3] triton==3.0.0+b2cde523
[pip3] vllm==0.11.0
[pip3] vllm-kunlun==0.11.0
[pip3] xmlir==1.0.0.1
[pip3] xtorch_ops==0.1.2109+523cb26d
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] torch                     2.5.1+cu118              pypi_0    pypi
[conda] torch-plugin              0.1.0                    pypi_0    pypi
[conda] torch-xray                2.0.2                    pypi_0    pypi
[conda] torchaudio                2.5.1+cu118              pypi_0    pypi
[conda] torchvision               0.20.1+cu118             pypi_0    pypi
[conda] transformers              4.57.0                   pypi_0    pypi
[conda] triton                    3.0.0+b2cde523           pypi_0    pypi
[conda] vllm-kunlun               0.11.0                   pypi_0    pypi
[conda] xmlir                     1.0.0.1                  pypi_0    pypi
[conda] xtorch-ops                0.1.2109+523cb26d          pypi_0    pypi

==============================
      vLLM-Kunlun Info
==============================
vLLM Version                 : 0.11.0
vLLM-Kunlun Version          : 0.11.0

==============================
     Environment Variables
==============================
XPU_FORCE_SHARED_DEVICE_CONTEXT=1

...

just like vllm issues ...

Your current environment
The output of python collect_env.py
🐛 Describe the bug

vllm-project/vllm#32915

Checklist (Required)

Before submitting this PR, please ensure that all the following items are completed:

All code changes pass the pre-commit checks.
Commits are signed off using git commit -s.
The PR title is properly classified (see below).

PR Type

Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:

[Feature] – New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)
[Bugfix] – Bug fixes
[CI/Build] – CI, build system, or infrastructure improvements
[Doc] – Documentation updates or fixes
[Misc] – Other changes that do not fit the above categories (use sparingly)

Note: If the PR spans multiple categories, include all relevant prefixes.

Detailed Checklist (Click to Expand)

Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.

1. Code Quality

All linting and formatting checks pass (pre-commit).
The code is well-structured and sufficiently documented.
The change is designed with maintainability and readability in mind.

2. Testing

Relevant unit tests are added or updated.
Integration tests are included when applicable.
Existing tests continue to pass.

3. DCO Compliance

This project follows the Developer Certificate of Origin (DCO).

All commits include a Signed-off-by: line.
Use git commit -s to automatically add the sign-off.

4. Review Expectations

During the review process, maintainers may:

Request code refactoring or additional tests.
Ask for clarifications on design decisions.
Suggest performance, stability, or maintainability improvements.

We appreciate your patience and collaboration throughout the review process!

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

Copilot

Pull request overview

Adds a standalone collect_env.py script to collect and print Kunlun XPU environment diagnostics (system/toolchain, PyTorch, XPU driver/runtime, package versions, and selected env vars) to support debugging and issue reporting.

Changes:

Introduces a new environment collection script with a formatted report output.
Parses xpu-smi output to extract device model names, driver version, and XPU-RT (XRE) version.
Collects relevant Python package versions and Kunlun-related environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T10:15:33Z

+def parse_xpu_smi_output(run_lambda):
+    """
+    Parse the complete output of xpu-smi command
+    [Principle Explanation]
+    The xpu-smi output format is similar to nvidia-smi, we need to parse it with regex.


The regex-based parsing in parse_xpu_smi_output() is core to this feature and is easy to break if xpu-smi output format changes. Add unit tests that feed representative xpu-smi outputs into this function via a mocked run_lambda and assert driver/XRE/device parsing results.

Copilot · 2026-02-25T10:15:34Z

+    Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version,
+    but actual pip installed version is "0.1.0".
+    Now prioritize using importlib.metadata to get real installed version.


The get_vllm_kunlun_version() docstring hardcodes an "actual pip installed version" value ("0.1.0"), which is environment-dependent and can become stale/misleading as releases change. Consider rewording this to describe the mismatch generically without pinning a specific version number.

Suggested change

Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version,

but actual pip installed version is "0.1.0".

Now prioritize using importlib.metadata to get real installed version.

Previously retrieved the version from vllm_kunlun.platforms.version, which could

differ from the version of the actually installed pip package.

Now this function prioritizes using importlib.metadata to report the installed version.

Copilot · 2026-02-25T10:15:34Z

+        # Decode byte stream to string
+        output = raw_output.decode("utf-8").strip()
+        err = raw_err.decode("utf-8").strip()


run() decodes subprocess output with decode('utf-8') without an error handler, which can raise UnicodeDecodeError on systems that emit non‑UTF8 text and abort env collection. Consider using text=True with encoding='utf-8' and errors='replace' (or passing errors='replace' to decode).

Suggested change

# Decode byte stream to string

output = raw_output.decode("utf-8").strip()

err = raw_err.decode("utf-8").strip()

# Decode byte stream to string; be tolerant of non-UTF8 bytes

output = raw_output.decode("utf-8", errors="replace").strip()

err = raw_err.decode("utf-8", errors="replace").strip()

Copilot · 2026-02-25T10:15:35Z

+    try:
+        # Try getting from torch_xmlir
+        import torch_xmlir
+


get_bkcl_version() only catches ImportError around import torch_xmlir, but importing that module can also fail with OSError/RuntimeError (e.g., missing shared libraries) and would crash the entire script. Catch a broader Exception here and continue to the other BKCL detection methods.

Copilot · 2026-02-25T10:15:35Z

+        kunlun_xpu_info=get_kunlun_gpu_info(run_lambda),
+        kunlun_driver_version=get_kunlun_driver_version(run_lambda),
+        kunlun_xre_version=get_kunlun_xre_version(run_lambda),
+        bkcl_version=get_bkcl_version(run_lambda),
+        kunlun_topo=get_kunlun_topo(run_lambda),


parse_xpu_smi_output() is invoked separately by get_kunlun_gpu_info(), get_kunlun_driver_version(), and get_kunlun_xre_version(), which results in running xpu-smi multiple times during a single collection. Consider calling parse_xpu_smi_output() once in get_kunlun_env_info() and passing the parsed result (or caching it) to avoid redundant command executions and improve reliability.

Copilot · 2026-02-25T10:15:35Z

+    parsed = parse_xpu_smi_output(run_lambda)
+    if parsed and parsed["xre_version"]:
+        return parsed["xre_version"]
+    return "N/A (not installed or not detected)"


get_kunlun_xre_version() returns "N/A (not installed or not detected)" whenever parsing fails, including when xpu-smi can't be executed or returns an error. This can misreport "not installed" when the real issue is "could not collect"; consider returning None (so the caller prints "Could not collect") or including the actual command failure reason.

Suggested change

return "N/A (not installed or not detected)"

return None

Copilot · 2026-02-25T10:15:35Z

+    secret_terms = ("secret", "token", "api", "access", "password")
+
+    for k, v in sorted(os.environ.items()):
+        if any(term in k.lower() for term in secret_terms):
+            continue


The script prints environment variable values (e.g., TORCH_*, VLLM_*) into the report. Even with the current substring filter, this can still leak sensitive data in bug reports if a secret variable doesn’t match the filtered terms. Consider masking values by default (e.g., show only keys or redact values) and/or requiring explicit opt-in to print env var values.

xyDong0223 · 2026-02-26T07:18:32Z

I think that is a good job! @Lidang-Jiang cc @liwei109 @baoqian426

[Misc] add collect_env feat

d1c9292

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

xyDong0223 requested a review from Copilot February 25, 2026 10:09

Copilot started reviewing on behalf of xyDong0223 February 25, 2026 10:09 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

xyDong0223 assigned baoqian426 and liwei109 Feb 26, 2026

xyDong0223 requested review from baoqian426 and liwei109 February 26, 2026 07:18

xyDong0223 added the documentation Improvements or additions to documentation label Feb 26, 2026

xyDong0223 approved these changes Feb 27, 2026

View reviewed changes

xyDong0223 merged commit 153093d into baidu:main Feb 27, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] add collect_env feat#218

[Misc] add collect_env feat#218
xyDong0223 merged 1 commit into
baidu:mainfrom
Lidang-Jiang:main

Lidang-Jiang commented Feb 25, 2026 •

edited by xyDong0223

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

xyDong0223 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Lidang-Jiang commented Feb 25, 2026 • edited by xyDong0223 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

What’s included

Why this change

How to use (example)

Example

vllm-project/vllm#32915

Checklist (Required)

PR Type

1. Code Quality

2. Testing

3. DCO Compliance

4. Review Expectations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

xyDong0223 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Lidang-Jiang commented Feb 25, 2026 •

edited by xyDong0223

Loading