-
Notifications
You must be signed in to change notification settings - Fork 461
Open
Labels
enhancementNew feature or requestNew feature or request
Description
@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.
Main parallelism components:
- 64-way expert parallelism
- 16-way pipeline parallelism
- with ZeRO-1 data parallelism
- Note: they do not apply TP.
Other main modeling components:
- multi-head latent attention (MLA)
- multi-token prediction with their MTP modules
- mixed-precision training (mix of FP8, BF16, FP32)
Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3
Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
tianyu-l, fduwjj, hhaAndroid, airlsyn, gameofdimension and 11 moreraghukiran1224 and xiefan46
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request