vllm v1 engine 启动模型无法卸载

### System Info / 系統信息

xinference docker 1.6.1
a100 cuda12.4



### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

xinference docker 1.6.1

### The command used to start Xinference / 用以启动 xinference 的命令

docker run --name xinference -d -p 9997:9997 -p 9998:9998 -e XINFERENCE_HOME=/data  **_-e VLLM_USE_V1=1_**
-v ****:/data
-v ****/saves:*****/saves
--gpus all xprobe/xinference:v1.6.1 sh -c "xinference-local -H 0.0.0.0

### Reproduction / 复现过程

使用vllm部署qwen3-4B

Initializing a V1 LLM engine (v0.8.5) with config: model='xxxxx', speculative_config=None, tokenizer='xxxx', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(**_guided_decoding_backend='auto'_**, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=xxxx, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}

卸载模型失败：
2025-06-09 21:24:31,320 xinference.core.worker 147 INFO     [request ce205be4-45b2-11f0-aa3e-0242ac110002] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7fee2227b6a0>, kwargs: model_uid=qwen3_v1.5.19-0
2025-06-09 21:24:31,324 xinference.model.llm.vllm.core 158156 INFO     Stopping vLLM engine
[2025-06-09 21:24:31] INFO pool.py:430: Sub pool can't be killed: psutil.Process(pid=158156, name='Model: qwen3_', status='running', started='20:40:07')
2025-06-09 21:24:31,429 xinference.core.worker 147 INFO     [request ce205be4-45b2-11f0-aa3e-0242ac110002] Leave terminate_model, elapsed time: 0 s

### Expected behavior / 期待表现

vllm v1 engine 模型可以正常卸载

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vllm v1 engine 启动模型无法卸载 #3598

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vllm v1 engine 启动模型无法卸载 #3598

Description

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions