Skip to content

deploying in cluster mode in Docker, if the worker nodes go offline, the service will be abnormal. #1973

@GabrielXie

Description

@GabrielXie

System Info / 系統信息

Master node use pip install
Worker node use docker

  • Ubuntu 22.04
  • RTX 4090
  • CUDA Version: 11.8
  • Nvidia Driver Version: 550.90.07
  • torch :2.3.1
  • python: 3.10
  • xformers: 0.0.27

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

V0.13.1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-supervisor -H "192.168.***" --auth-config /data//auth_config.json --log-level info

docker run -d -e XINFERENCE_MODEL_SRC=modelscope -e TZ=Asia/Shanghai -e NCCL_DEBUG=INFO -e XINFERENCE_ENDPOINT=http://127.0.0.1:9998/**** -e HF_ENDPOINT=https://hf-mirror.com -v /data/***:/root/.cache -v /data/:/root/xinference/data --name xinference-work --gpus all --shm-size 64g --net=host registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.13.1 xinference-worker -e "http://192.168.******:9997" -H "192.168." --log-level info

Reproduction / 复现过程

  1. start master and worker
  2. stop worker

Result:

*Some of the data has been desensitized and is replaced by .

2024-07-30 09:16:16,638 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 4...
2024-07-30 09:16:21,645 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 3...
2024-07-30 09:16:26,651 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 2...
2024-07-30 09:16:31,657 xinference.core.supervisor 192779 ERROR    Worker timeout. address: 192.168.*******:31147, check count remaining 1...
2024-07-30 09:16:36,663 xinference.core.supervisor 192779 ERROR    Worker dead. address: 192.168.*******:31147, influenced models: ['hn***-4-1', 'hn***-4-3']
2024-07-30 09:16:37,500 uvicorn.access 192644 INFO     117.158.******:50676 - "POST /v1/chat/completions HTTP/1.1" 200
2024-07-30 09:16:44,115 uvicorn.access 192644 INFO     117.158.******:23441 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,124 uvicorn.access 192644 INFO     117.158.******:24016 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,126 uvicorn.access 192644 INFO     117.158.******:14711 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-07-30 09:16:44,199 xinference.api.restful_api 192644 ERROR    [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***
Traceback (most recent call last):
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 759, in describe_model
    data = await (await self._get_supervisor_ref()).describe_model(model_uid)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/xiegangpeng/.conda/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1102, in describe_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=192.168.******:50201, pid=192779] Model not found in the model list, uid: hn***

Expected behavior / 期待表现

  1. If one of the multiple worker nodes goes offline, the cluster can respond normally.
  2. If a worker node goes offline and then comes back online, it can conduct business normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions