Skip to content

Add ddp_notebook alias for ddp_fork #13744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
Jul 23, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
782ca4a
fork
awaelchli Jun 22, 2022
aefac45
dont set device
awaelchli Jun 23, 2022
efce3c4
parallel dev
awaelchli Jun 23, 2022
810c0ba
add cuda
awaelchli Jun 24, 2022
5fd9cda
update device count
awaelchli Jun 24, 2022
c1b4fd0
fork
awaelchli Jun 24, 2022
daa07ee
cuda available
awaelchli Jun 24, 2022
679f363
set device
awaelchli Jun 24, 2022
c43f827
update
awaelchli Jun 24, 2022
b7e529e
update
awaelchli Jun 24, 2022
b51d172
cuda available
awaelchli Jun 25, 2022
9b41941
formatting
awaelchli Jun 25, 2022
9cea979
unused import
awaelchli Jun 25, 2022
daccd21
test fixes
awaelchli Jun 27, 2022
0ccc3b9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 27, 2022
a080ec0
add docstring
awaelchli Jun 27, 2022
9a914be
Merge remote-tracking branch 'origin/feature/ddp-fork2' into feature/…
awaelchli Jun 27, 2022
eae67cd
update
awaelchli Jun 27, 2022
167a710
update
awaelchli Jun 27, 2022
1bdd79d
fix mocks in tests
awaelchli Jun 27, 2022
297b55a
refactor
awaelchli Jun 27, 2022
6c5b769
fix test
awaelchli Jun 27, 2022
2671810
update lite and enums
awaelchli Jun 27, 2022
ff2a825
typo
awaelchli Jun 27, 2022
0879751
update docs
awaelchli Jun 27, 2022
fe16575
add validation for forking on platforms
awaelchli Jun 27, 2022
582872c
debug no breaking change for devices=1
awaelchli Jun 27, 2022
da70271
fix typo in test
awaelchli Jun 27, 2022
3f9a872
update docstring
awaelchli Jun 27, 2022
7291fa3
added windows test for device parser
awaelchli Jun 27, 2022
785c830
add changelog
awaelchli Jun 27, 2022
dd043ad
add test
awaelchli Jun 27, 2022
da843ee
add tests
awaelchli Jun 27, 2022
1a63662
update error message
awaelchli Jun 27, 2022
093a52e
Comparison section
awaelchli Jun 28, 2022
3be3c17
Merge branch 'master' into feature/ddp-fork2
awaelchli Jun 29, 2022
7b3c132
fork docs
awaelchli Jun 29, 2022
1b95954
typing
awaelchli Jun 29, 2022
df031b3
Merge branch 'master' into feature/ddp-fork2
awaelchli Jun 29, 2022
6855e50
Merge branch 'master' into feature/ddp-fork2
awaelchli Jun 30, 2022
8df3457
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 30, 2022
d5c28b9
fix tests
awaelchli Jul 1, 2022
0152c39
Merge branch 'master' into feature/ddp-fork2
awaelchli Jul 1, 2022
a6a0d09
Update docs/source-pytorch/accelerators/gpu_intermediate.rst
awaelchli Jul 3, 2022
4ede4cb
Update docs/source-pytorch/accelerators/gpu_intermediate.rst
awaelchli Jul 3, 2022
877ed07
Update docs/source-pytorch/accelerators/gpu_intermediate.rst
awaelchli Jul 3, 2022
bf36259
reviews
awaelchli Jul 3, 2022
6263636
Merge remote-tracking branch 'origin/feature/ddp-fork2' into feature/…
awaelchli Jul 3, 2022
ca9a0b3
Merge branch 'master' into feature/ddp-fork2
awaelchli Jul 19, 2022
c9b2601
handle start methods
awaelchli Jul 19, 2022
cca1606
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 19, 2022
e6b19a1
update tests
awaelchli Jul 19, 2022
59c3735
Merge remote-tracking branch 'origin/feature/ddp-fork2' into feature/…
awaelchli Jul 19, 2022
cc06e12
update type
awaelchli Jul 19, 2022
b7ffaef
Merge branch 'master' into feature/ddp-fork2
awaelchli Jul 19, 2022
c5480a1
fix merge errors
awaelchli Jul 19, 2022
87cd344
update tests
awaelchli Jul 19, 2022
b686b3b
remove unused import
awaelchli Jul 19, 2022
d6e194e
Add ddp_notebook alias for ddp_fork
awaelchli Jul 19, 2022
f35a922
refactor
awaelchli Jul 19, 2022
1a21ab5
update docs
awaelchli Jul 20, 2022
4afb950
Merge branch 'master' into feature/ddp-fork2-notebook
awaelchli Jul 22, 2022
51283f4
fix merge error
awaelchli Jul 22, 2022
177c19b
fix merge error
awaelchli Jul 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions docs/source-pytorch/accelerators/gpu_intermediate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Lightning supports multiple ways of doing distributed training.
- DistributedDataParallel (multiple-gpus across many machines)
- Regular (``strategy='ddp'``)
- Spawn (``strategy='ddp_spawn'``)
- Fork (``strategy='ddp_fork'``)
- Notebook/Fork (``strategy='ddp_notebook'``)
- Horovod (``strategy='horovod'``) (multi-machine, multi-gpu, configured at runtime)
- Bagua (``strategy='bagua'``) (multiple-gpus across many machines with advanced training algorithms)

Expand Down Expand Up @@ -101,7 +101,7 @@ There are cases in which it is NOT possible to use DDP. Examples are:
- Jupyter Notebook, Google COLAB, Kaggle, etc.
- You have a nested script without a root package

In these situations you should use `dp` or `ddp_spawn` instead.
In these situations you should use `ddp_notebook` or `dp` instead.

Distributed Data Parallel 2
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -201,18 +201,25 @@ You can then call your scripts anywhere
python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'


Distributed Data Parallel Fork
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Distributed Data Parallel in Notebooks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

DDP Fork is an alternative to Spawn that can be used in interactive Python and Jupyter notebooks, Google Colab, Kaggle notebooks, and so on:
DDP Notebook/Fork is an alternative to Spawn that can be used in interactive Python and Jupyter notebooks, Google Colab, Kaggle notebooks, and so on:
The Trainer enables it by default when such environments are detected.

.. code-block:: python

# train on 8 GPUs in a Jupyter notebook
trainer = Trainer(accelerator="gpu", devices=8)

# can be set explicitly
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_notebook")

# can also be used in non-interactive environments
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_fork")

Data Parallel (``strategy="dp"``) is the only other strategy supported in interactive environments but is slower, is discouraged by PyTorch and has other limitations.
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork for its speed and stability but it can only be used with scripts.
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.


Comparison of DDP variants and tradeoffs
Expand All @@ -225,7 +232,7 @@ Comparison of DDP variants and tradeoffs
* -
- DDP
- DDP Spawn
- DDP Fork
- DDP Notebook/Fork
* - Works in Jupyter notebooks / IPython environments
- No
- No
Expand Down
27 changes: 18 additions & 9 deletions src/pytorch_lightning/strategies/ddp_spawn.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@

log = logging.getLogger(__name__)

_DDP_FORK_ALIASES = (
"ddp_fork",
"ddp_fork_find_unused_parameters_false",
"ddp_notebook",
"ddp_notebook_find_unused_parameters_false",
)


class DDPSpawnStrategy(ParallelStrategy):
"""Spawns processes using the :func:`torch.multiprocessing.spawn` method and joins processes after training
Expand Down Expand Up @@ -283,20 +290,22 @@ def post_training_step(self):

@classmethod
def register_strategies(cls, strategy_registry: Dict) -> None:
for start_method in ("spawn", "fork"):
entries = (
("ddp_spawn", "spawn"),
("ddp_spawn_find_unused_parameters_false", "spawn"),
("ddp_fork", "fork"),
("ddp_fork_find_unused_parameters_false", "fork"),
("ddp_notebook", "fork"),
("ddp_notebook_find_unused_parameters_false", "fork"),
)
for name, start_method in entries:
strategy_registry.register(
f"ddp_{start_method}_find_unused_parameters_false",
name,
cls,
description=f"DDP {start_method.title()} strategy with `find_unused_parameters` as False",
description=f"DDP strategy with `find_unused_parameters` as False and `start_method` '{start_method}'",
find_unused_parameters=False,
start_method=start_method,
)
strategy_registry.register(
f"ddp_{start_method}",
cls,
description=f"DDP {start_method.title()} strategy",
start_method=start_method,
)

def teardown(self) -> None:
log.detail(f"{self.__class__.__name__}: tearing down strategy")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
StrategyRegistry,
TPUSpawnStrategy,
)
from pytorch_lightning.strategies.ddp_spawn import _DDP_FORK_ALIASES
from pytorch_lightning.tuner.auto_gpu_select import pick_multiple_gpus
from pytorch_lightning.utilities import (
_StrategyType,
Expand Down Expand Up @@ -614,10 +615,7 @@ def _check_strategy_and_fallback(self) -> None:
f"You selected strategy to be `{DDPFullyShardedNativeStrategy.strategy_name}`, "
"but GPU accelerator is not used."
)
if (
strategy_flag in ("ddp_fork", "ddp_fork_find_unused_parameters_false")
and "fork" not in torch.multiprocessing.get_all_start_methods()
):
if strategy_flag in _DDP_FORK_ALIASES and "fork" not in torch.multiprocessing.get_all_start_methods():
raise ValueError(
f"You selected `Trainer(strategy='{strategy_flag}')` but process forking is not supported on this"
f" platform. We recommed `Trainer(strategy='ddp_spawn')` instead."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
DeepSpeedStrategy,
SingleDeviceStrategy,
)
from pytorch_lightning.strategies.ddp_spawn import _DDP_FORK_ALIASES
from pytorch_lightning.strategies.hpu_parallel import HPUParallelStrategy
from pytorch_lightning.utilities.exceptions import MisconfigurationException
from tests_pytorch.helpers.runif import RunIf
Expand Down Expand Up @@ -749,7 +750,7 @@ def test_accelerator_specific_checkpoint_io(*_):
assert trainer.strategy.checkpoint_io is ckpt_plugin


@pytest.mark.parametrize("strategy", ["ddp_fork", "ddp_fork_find_unused_parameters_false"])
@pytest.mark.parametrize("strategy", _DDP_FORK_ALIASES)
@mock.patch(
"pytorch_lightning.trainer.connectors.accelerator_connector.torch.multiprocessing.get_all_start_methods",
return_value=[],
Expand Down
6 changes: 6 additions & 0 deletions tests/tests_pytorch/strategies/test_strategy_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,12 @@ def test_fsdp_strategy_registry(tmpdir):
{"find_unused_parameters": False, "start_method": "fork"},
marks=RunIf(skip_windows=True),
),
pytest.param(
"ddp_notebook_find_unused_parameters_false",
DDPSpawnStrategy,
{"find_unused_parameters": False, "start_method": "fork"},
marks=RunIf(skip_windows=True),
),
(
"ddp_sharded_spawn_find_unused_parameters_false",
DDPSpawnShardedStrategy,
Expand Down