Skip to content

[QUESTION] In what cases will tp_comm_overlap accelerate training #2662

@Cccei000

Description

@Cccei000

Using H20*8 to finetune Qwen3-32B with tp8 and 16384 packed_seq_len (with padding for fixed seq length as required by tp_comm_overlap userbuffer). The global batchsize is 16.

With tp_comm_overlap, it only speeds up by less than 2s/step from 43s/step, which is minor.

Are there any guides on for what model size, tp size and seq length will tp_comm_overlap be beneficial?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions