distributed inference is very slow with Mac m2 ultra

**Describe the bug**

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

![Image](https://github.com/user-attachments/assets/c716555b-01ae-49f0-b9c2-f83cfd614286)


![Image](https://github.com/user-attachments/assets/0900930c-0332-4480-ad1a-135187f9884e)


**Result**
inference is very slow: 0.69 tokens/s
![Image](https://github.com/user-attachments/assets/34324d85-bfe7-4b51-9c39-9fe0934f46ff)

**Expected behavior**

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

**Environment**

- GPUStack version:0.5.1
- OS:macos 14/15
- GPU: Mac Studio m2 ultra



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

distributed inference is very slow with Mac m2 ultra #1233

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

distributed inference is very slow with Mac m2 ultra #1233

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions