-
Notifications
You must be signed in to change notification settings - Fork 779
Description
System Info / 系統信息
xinference docker 1.6.1
a100 cuda12.4
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / docker
- pip install / 通过 pip install 安装
- installation from source / 从源码安装
Version info / 版本信息
xinference docker 1.6.1
The command used to start Xinference / 用以启动 xinference 的命令
docker run --name xinference -d -p 9997:9997 -p 9998:9998 -e XINFERENCE_HOME=/data -e VLLM_USE_V1=1
-v ****:/data
-v ****/saves:*****/saves
--gpus all xprobe/xinference:v1.6.1 sh -c "xinference-local -H 0.0.0.0
Reproduction / 复现过程
使用vllm部署qwen3-4B
Initializing a V1 LLM engine (v0.8.5) with config: model='xxxxx', speculative_config=None, tokenizer='xxxx', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=xxxx, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
卸载模型失败:
2025-06-09 21:24:31,320 xinference.core.worker 147 INFO [request ce205be4-45b2-11f0-aa3e-0242ac110002] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7fee2227b6a0>, kwargs: model_uid=qwen3_v1.5.19-0
2025-06-09 21:24:31,324 xinference.model.llm.vllm.core 158156 INFO Stopping vLLM engine
[2025-06-09 21:24:31] INFO pool.py:430: Sub pool can't be killed: psutil.Process(pid=158156, name='Model: qwen3_', status='running', started='20:40:07')
2025-06-09 21:24:31,429 xinference.core.worker 147 INFO [request ce205be4-45b2-11f0-aa3e-0242ac110002] Leave terminate_model, elapsed time: 0 s
Expected behavior / 期待表现
vllm v1 engine 模型可以正常卸载