Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.
But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority (un-used) experts weights on disk because CPU memory is also limited, and only load/unload these experts when in use ?
DeepSeek-R1 MoE has 61 layers, current llama implementation will load n-gpu-layers of MoE to GPU memory (say 7 layers on 24GB 4090) while remaining layers of weights all on CPU memory, but when CPU only has limited memory (say 32GB), there will be huge swapping and make the inference speed extremely slow. but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.
I'm wondering if similar features is available or WIP inside llama.cpp or any other popular inference frameworks ?
Really appreciate your help!
Motivation
This will help llama.cpp users to run DeepSeek-R1, the best reasoning LLM by far, which has 685B MoE param but only 37B activated param, on consumer GPU device (4090 24GB) plus consumer CPU (i7/i9 32GB mem), with low-bit quantization, with an acceptable inference speed.
Possible Implementation
No response