Skip to content

Error while Training: Software caused connection abort #2593

@Nexusyyan

Description

@Nexusyyan
  • Environment:Win11+WSL2+Rocm 6.2.3
  • Device:AMD Radeon RX7900XTX(GFX1100)
  • Logs:
[rank0]:[W831 16:10:20.914206846 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Traceback (most recent call last):
  File "/home/admin/GPT-SoVITS/GPT_SoVITS/s2_train.py", line 684, in <module>
    main()
  File "/home/admin/GPT-SoVITS/GPT_SoVITS/s2_train.py", line 61, in main
    mp.spawn(
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/home/admin/GPT-SoVITS/GPT_SoVITS/s2_train.py", line 200, in run
    net_g = DDP(net_g, device_ids=[rank], find_unused_parameters=True)
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 169.254.83.107<40067> failed : Software caused connection abort

"/home/admin/miniconda3/envs/GPTSoVits/bin/python" -s GPT_SoVITS/s1_train.py --config_file "/home/admin/GPT-SoVITS/TEMP/tmp_s1.yaml"
Seed set to 1234
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/admin/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 171, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/admin/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 147, in main
[rank0]:     trainer.fit(model, data_module, ckpt_path=ckpt_path)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 48, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 598, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 968, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1101, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1263, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3129, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/admin/miniconda3/envs/GPTSoVits/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2417, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: socketStartConnect: Connect to 169.254.83.107<40147> failed : Software caused connection abort

It seems the script uses IP corresponding to the eth1 by default. On my local machine, I use the command "export NCCL_SOCKET_IFNAME=lo
to switch to the lo network port and it works normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions