deploying in cluster mode in Docker, if the worker nodes go offline, the service will be abnormal.

### System Info / 系統信息

**Master node use pip install
Worker node use docker** 

- Ubuntu 22.04
- RTX 4090
- CUDA Version: 11.8
- Nvidia Driver Version: 550.90.07
- torch  :2.3.1
- python: 3.10
- xformers:  0.0.27

### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [X] docker / docker
- [x] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

V0.13.1

### The command used to start Xinference / 用以启动 xinference 的命令
xinference-supervisor -H "192.168.******"  --auth-config  /data/***/auth_config.json --log-level info

docker run -d -e XINFERENCE_MODEL_SRC=modelscope -e TZ=Asia/Shanghai  -e NCCL_DEBUG=INFO  -e XINFERENCE_ENDPOINT=http://127.0.0.1:9998/**** -e  HF_ENDPOINT=https://hf-mirror.com -v /data/****:/root/.cache  -v /data/*****:/root/xinference/data  --name xinference-work   --gpus all  --shm-size 64g   --net=host   registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.13.1  xinference-worker -e "http://192.168.******:9997" -H "192.168.******"   --log-level info

### Reproduction / 复现过程

1. start master and worker
2. stop worker

# Result:
**Some of the data has been desensitized and is replaced by *.**

```
2024-07-30 09:16:16,638 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 4...
2024-07-30 09:16:21,645 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 3...
2024-07-30 09:16:26,651 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 2...
2024-07-30 09:16:31,657 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 1...
2024-07-30 09:16:36,663 xinference.core.supervisor 192779 ERROR    Worker dead. address: 192.168.*******:31147, influenced models: ['hn***-4-1', 'hn***-4-3']
2024-07-30 09:16:37,500 uvicorn.access 192644 INFO     117.158.******:50676 - "POST /v1/chat/completions HTTP/1.1" 200
2024-07-30 09:16:44,115 uvicorn.access 192644 INFO     117.158.******:23441 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,124 uvicorn.access 192644 INFO     117.158.******:24016 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,126 uvicorn.access 192644 INFO     117.158.******:14711 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,199 xinference.api.restful_api 192644 ERROR    [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***
Traceback (most recent call last):
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 759, in describe_model
    data = await (await self._get_supervisor_ref()).describe_model(model_uid)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1102, in describe_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***

```


### Expected behavior / 期待表现

1. If one of the multiple worker nodes goes offline, the cluster can respond normally.
2. If a worker node goes offline and then comes back online, it can conduct business normally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deploying in cluster mode in Docker, if the worker nodes go offline, the service will be abnormal. #1973

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Result:

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

deploying in cluster mode in Docker, if the worker nodes go offline, the service will be abnormal. #1973

Description

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Result:

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions