-
Notifications
You must be signed in to change notification settings - Fork 779
Open
Description
System Info / 系統信息
Master node use pip install
Worker node use docker
- Ubuntu 22.04
- RTX 4090
- CUDA Version: 11.8
- Nvidia Driver Version: 550.90.07
- torch :2.3.1
- python: 3.10
- xformers: 0.0.27
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / docker
- pip install / 通过 pip install 安装
- installation from source / 从源码安装
Version info / 版本信息
V0.13.1
The command used to start Xinference / 用以启动 xinference 的命令
xinference-supervisor -H "192.168.***" --auth-config /data//auth_config.json --log-level info
docker run -d -e XINFERENCE_MODEL_SRC=modelscope -e TZ=Asia/Shanghai -e NCCL_DEBUG=INFO -e XINFERENCE_ENDPOINT=http://127.0.0.1:9998/**** -e HF_ENDPOINT=https://hf-mirror.com -v /data/***:/root/.cache -v /data/:/root/xinference/data --name xinference-work --gpus all --shm-size 64g --net=host registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.13.1 xinference-worker -e "http://192.168.******:9997" -H "192.168." --log-level info
Reproduction / 复现过程
- start master and worker
- stop worker
Result:
*Some of the data has been desensitized and is replaced by .
2024-07-30 09:16:16,638 xinference.core.supervisor 192779 ERROR Worker timeout. address: 192.168.*******:31147, check count remaining 4...
2024-07-30 09:16:21,645 xinference.core.supervisor 192779 ERROR Worker timeout. address: 192.168.*******:31147, check count remaining 3...
2024-07-30 09:16:26,651 xinference.core.supervisor 192779 ERROR Worker timeout. address: 192.168.*******:31147, check count remaining 2...
2024-07-30 09:16:31,657 xinference.core.supervisor 192779 ERROR Worker timeout. address: 192.168.*******:31147, check count remaining 1...
2024-07-30 09:16:36,663 xinference.core.supervisor 192779 ERROR Worker dead. address: 192.168.*******:31147, influenced models: ['hn***-4-1', 'hn***-4-3']
2024-07-30 09:16:37,500 uvicorn.access 192644 INFO 117.158.******:50676 - "POST /v1/chat/completions HTTP/1.1" 200
2024-07-30 09:16:44,115 uvicorn.access 192644 INFO 117.158.******:23441 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,124 uvicorn.access 192644 INFO 117.158.******:24016 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,126 uvicorn.access 192644 INFO 117.158.******:14711 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,199 xinference.api.restful_api 192644 ERROR [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***
Traceback (most recent call last):
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 759, in describe_model
data = await (await self._get_supervisor_ref()).describe_model(model_uid)
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
result = await self._run_coro(message.message_id, coro)
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
return await coro
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1102, in describe_model
raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***
Expected behavior / 期待表现
- If one of the multiple worker nodes goes offline, the cluster can respond normally.
- If a worker node goes offline and then comes back online, it can conduct business normally.