Skip to content

[Misc] add collect_env feat#218

Merged
xyDong0223 merged 1 commit into
baidu:mainfrom
Lidang-Jiang:main
Feb 27, 2026
Merged

[Misc] add collect_env feat#218
xyDong0223 merged 1 commit into
baidu:mainfrom
Lidang-Jiang:main

Conversation

@Lidang-Jiang

@Lidang-Jiang Lidang-Jiang commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

PR Description

This PR introduces a collect_env feature to help users quickly gather and export runtime/environment information for vLLM-Kunlun.
It is intended to improve debugging efficiency and make issue/bug reports more actionable by providing a standardized environment summary.

What’s included

  • Add environment collection capability (collect_env)
  • Collect key software/hardware/runtime metadata (e.g., Python/package versions, OS info, accelerator/driver/runtime details) and format it for easy sharing
  • Provide a unified output to paste into issues / logs for faster triage

Why this change

When users report failures or performance issues, environment differences are often the root cause. A built-in collect_env makes it easier to reproduce and diagnose problems.

How to use (example)

  • Run the collect_env entry (CLI/script) to print/save environment info
  • Attach the output when filing an issue or asking for support

Example

Collecting Kunlun XPU environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
Clang version                : 10.0.0-4ubuntu1 
CMake version                : version 4.0.3
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.5.1+cu118
Is debug build               : False

==============================
      Python Environment
==============================
Python version               : 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.10.0-1.0.0.39-x86_64-with-glibc2.31

==============================
    Kunlun / XPU Info
==============================
XPU models and configuration : 
XPU 0: P800 OAM (96.0GB)
XPU 1: P800 OAM (96.0GB)
XPU 2: P800 OAM (96.0GB)
XPU 3: P800 OAM (96.0GB)
XPU 4: P800 OAM (96.0GB)
XPU 5: P800 OAM (96.0GB)
XPU 6: P800 OAM (96.0GB)
XPU 7: P800 OAM (96.0GB)
Kunlun driver version        : 515.58
XRE (Runtime) version        : N/A (not installed or not detected)
BKCL version                 : Found at: /root/miniconda/envs/python310_torch25_cuda/lib/python3.10/site-packages/torch_xmlir/libbkcl.so
...
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.5.1+cu118
[pip3] torch_plugin==0.1.0
[pip3] torch_xray==2.0.2
[pip3] torchaudio==2.5.1+cu118
[pip3] torchvision==0.20.1+cu118
[pip3] transformers==4.57.0
[pip3] triton==3.0.0+b2cde523
[pip3] vllm==0.11.0
[pip3] vllm-kunlun==0.11.0
[pip3] xmlir==1.0.0.1
[pip3] xtorch_ops==0.1.2109+523cb26d
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] torch                     2.5.1+cu118              pypi_0    pypi
[conda] torch-plugin              0.1.0                    pypi_0    pypi
[conda] torch-xray                2.0.2                    pypi_0    pypi
[conda] torchaudio                2.5.1+cu118              pypi_0    pypi
[conda] torchvision               0.20.1+cu118             pypi_0    pypi
[conda] transformers              4.57.0                   pypi_0    pypi
[conda] triton                    3.0.0+b2cde523           pypi_0    pypi
[conda] vllm-kunlun               0.11.0                   pypi_0    pypi
[conda] xmlir                     1.0.0.1                  pypi_0    pypi
[conda] xtorch-ops                0.1.2109+523cb26d          pypi_0    pypi

==============================
      vLLM-Kunlun Info
==============================
vLLM Version                 : 0.11.0
vLLM-Kunlun Version          : 0.11.0

==============================
     Environment Variables
==============================
XPU_FORCE_SHARED_DEVICE_CONTEXT=1

...

just like vllm issues ...

Your current environment
The output of python collect_env.py
🐛 Describe the bug

vllm-project/vllm#32915

Checklist (Required)

Before submitting this PR, please ensure that all the following items are completed:

  • All code changes pass the pre-commit checks.
  • Commits are signed off using git commit -s.
  • The PR title is properly classified (see below).

PR Type

Please prefix the PR title with one or more of the following labels to help reviewers quickly understand the nature of the change:

  • [Feature] – New features or enhancements (e.g. Attention, Communicator, Kernel, Worker, etc.)
  • [Bugfix] – Bug fixes
  • [CI/Build] – CI, build system, or infrastructure improvements
  • [Doc] – Documentation updates or fixes
  • [Misc] – Other changes that do not fit the above categories (use sparingly)

Note: If the PR spans multiple categories, include all relevant prefixes.


Detailed Checklist (Click to Expand)

Thank you for contributing to vLLM Kunlun! To help us maintain high code quality and streamline the review process, please ensure your PR meets the following requirements.

1. Code Quality

  • All linting and formatting checks pass (pre-commit).
  • The code is well-structured and sufficiently documented.
  • The change is designed with maintainability and readability in mind.

2. Testing

  • Relevant unit tests are added or updated.
  • Integration tests are included when applicable.
  • Existing tests continue to pass.

3. DCO Compliance

This project follows the Developer Certificate of Origin (DCO).

  • All commits include a Signed-off-by: line.
  • Use git commit -s to automatically add the sign-off.

4. Review Expectations

During the review process, maintainers may:

  • Request code refactoring or additional tests.
  • Ask for clarifications on design decisions.
  • Suggest performance, stability, or maintainability improvements.

We appreciate your patience and collaboration throughout the review process!

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a standalone collect_env.py script to collect and print Kunlun XPU environment diagnostics (system/toolchain, PyTorch, XPU driver/runtime, package versions, and selected env vars) to support debugging and issue reporting.

Changes:

  • Introduces a new environment collection script with a formatted report output.
  • Parses xpu-smi output to extract device model names, driver version, and XPU-RT (XRE) version.
  • Collects relevant Python package versions and Kunlun-related environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread collect_env.py
Comment on lines +220 to +224
def parse_xpu_smi_output(run_lambda):
"""
Parse the complete output of xpu-smi command
[Principle Explanation]
The xpu-smi output format is similar to nvidia-smi, we need to parse it with regex.

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex-based parsing in parse_xpu_smi_output() is core to this feature and is easy to break if xpu-smi output format changes. Add unit tests that feed representative xpu-smi outputs into this function via a mocked run_lambda and assert driver/XRE/device parsing results.

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
Comment on lines +439 to +441
Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version,
but actual pip installed version is "0.1.0".
Now prioritize using importlib.metadata to get real installed version.

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_vllm_kunlun_version() docstring hardcodes an "actual pip installed version" value ("0.1.0"), which is environment-dependent and can become stale/misleading as releases change. Consider rewording this to describe the mismatch generically without pinning a specific version number.

Suggested change
Previously got hardcoded version "0.9.2" from vllm_kunlun.platforms.version,
but actual pip installed version is "0.1.0".
Now prioritize using importlib.metadata to get real installed version.
Previously retrieved the version from vllm_kunlun.platforms.version, which could
differ from the version of the actually installed pip package.
Now this function prioritizes using importlib.metadata to report the installed version.

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
Comment on lines +44 to +46
# Decode byte stream to string
output = raw_output.decode("utf-8").strip()
err = raw_err.decode("utf-8").strip()

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run() decodes subprocess output with decode('utf-8') without an error handler, which can raise UnicodeDecodeError on systems that emit non‑UTF8 text and abort env collection. Consider using text=True with encoding='utf-8' and errors='replace' (or passing errors='replace' to decode).

Suggested change
# Decode byte stream to string
output = raw_output.decode("utf-8").strip()
err = raw_err.decode("utf-8").strip()
# Decode byte stream to string; be tolerant of non-UTF8 bytes
output = raw_output.decode("utf-8", errors="replace").strip()
err = raw_err.decode("utf-8", errors="replace").strip()

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
Comment on lines +409 to +412
try:
# Try getting from torch_xmlir
import torch_xmlir

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_bkcl_version() only catches ImportError around import torch_xmlir, but importing that module can also fail with OSError/RuntimeError (e.g., missing shared libraries) and would crash the entire script. Catch a broader Exception here and continue to the other BKCL detection methods.

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
Comment on lines +587 to +591
kunlun_xpu_info=get_kunlun_gpu_info(run_lambda),
kunlun_driver_version=get_kunlun_driver_version(run_lambda),
kunlun_xre_version=get_kunlun_xre_version(run_lambda),
bkcl_version=get_bkcl_version(run_lambda),
kunlun_topo=get_kunlun_topo(run_lambda),

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_xpu_smi_output() is invoked separately by get_kunlun_gpu_info(), get_kunlun_driver_version(), and get_kunlun_xre_version(), which results in running xpu-smi multiple times during a single collection. Consider calling parse_xpu_smi_output() once in get_kunlun_env_info() and passing the parsed result (or caching it) to avoid redundant command executions and improve reliability.

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
parsed = parse_xpu_smi_output(run_lambda)
if parsed and parsed["xre_version"]:
return parsed["xre_version"]
return "N/A (not installed or not detected)"

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_kunlun_xre_version() returns "N/A (not installed or not detected)" whenever parsing fails, including when xpu-smi can't be executed or returns an error. This can misreport "not installed" when the real issue is "could not collect"; consider returning None (so the caller prints "Could not collect") or including the actual command failure reason.

Suggested change
return "N/A (not installed or not detected)"
return None

Copilot uses AI. Check for mistakes.
Comment thread collect_env.py
Comment on lines +508 to +512
secret_terms = ("secret", "token", "api", "access", "password")

for k, v in sorted(os.environ.items()):
if any(term in k.lower() for term in secret_terms):
continue

Copilot AI Feb 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script prints environment variable values (e.g., TORCH_*, VLLM_*) into the report. Even with the current substring filter, this can still leak sensitive data in bug reports if a secret variable doesn’t match the filtered terms. Consider masking values by default (e.g., show only keys or redact values) and/or requiring explicit opt-in to print env var values.

Copilot uses AI. Check for mistakes.
@xyDong0223

Copy link
Copy Markdown
Collaborator

I think that is a good job! @Lidang-Jiang cc @liwei109 @baoqian426

@xyDong0223 xyDong0223 added the documentation Improvements or additions to documentation label Feb 26, 2026
@xyDong0223 xyDong0223 merged commit 153093d into baidu:main Feb 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants