Skip to content

Distributed training problem #2

@julycetc

Description

@julycetc

you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl
seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.

  1. INFO NET/Plugin : No plugin found (libnccl-net.so)
    2.NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error.
    3.NCCL INFO NET/IB : No device found

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions