Skip to content

distributed inference is very slow with Mac m2 ultra #1233

@gaord

Description

@gaord

Describe the bug

with 2 Mac studio m2 Ultra: 192GB and 64GB, create a gpu cluster. In resource displays two workers ready. deploy DeepSeek-R1-UD-IQ1_S.gguf(131GB) locally in one big file with the following distribution configuration:

Image

Image

Result
inference is very slow: 0.69 tokens/s
Image

Expected behavior

commonly the same hardware could provide 17 tokens/s with Ollama or llama.cpp backend. GPUStack could catch up with this anyway.

Environment

  • GPUStack version:0.5.1
  • OS:macos 14/15
  • GPU: Mac Studio m2 ultra

Metadata

Metadata

Assignees

Labels

rpc serverllama-box RPC server issues

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions