-
Notifications
You must be signed in to change notification settings - Fork 14
RuntimeError: Invalid device string: 'cuda:None' #15
Copy link
Copy link
Open
Description
During training (Tesla V100-PCIE-16GB) I get the following error
Train: 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
File "/anaconda/envs/rtfm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda/envs/rtfm/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/finetune.py", line 451, in <module>
main(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/finetune.py", line 408, in main
results = train(
File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/dev-medekm-gpu/code/Users/michael.medek/rtfm/rtfm/train_utils.py", line 274, in train
batch[key] = batch[key].to(f"cuda:{local_rank}")
RuntimeError: Invalid device string: 'cuda:None'
Train: 0%| Which traces to here
Line 274 in 9884a6b
| batch[key] = batch[key].to(f"cuda:{local_rank}") |
where local_rank is None, thus Invalid device string: 'cuda:None'. How is this supposed to work? The default of the function is local_rank=None which should be invalid, since it must be int, right? In evaluate() there is only local_rank: int.
By adding
local_rank = 0
rank = 0
print("WARNING! Overwriting local_rank and rank to 0!")this issue is worked around.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels