Skip to content

DeepSeek V3 Support #760

@casper-hansen

Description

@casper-hansen

@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.

Main parallelism components:

  • 64-way expert parallelism
  • 16-way pipeline parallelism
  • with ZeRO-1 data parallelism
  • Note: they do not apply TP.

Other main modeling components:

  • multi-head latent attention (MLA)
  • multi-token prediction with their MTP modules
  • mixed-precision training (mix of FP8, BF16, FP32)

Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3
Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Performance:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions