Skip to content

Track CPU stats with DeviceStatsMonitor #11795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
9188cbe
Add cpu metrics to cpu accelerator
EricWiener Feb 7, 2022
efd5d3b
Add cpu metrics to device stats monitor
EricWiener Feb 7, 2022
29f918d
Add tests to make sure cpu stats are added correctly
EricWiener Feb 7, 2022
dbb6779
Check psutil availability before running
EricWiener Feb 9, 2022
54a70e3
Add cpu_ prefix to all logged cpu values
EricWiener Feb 9, 2022
47002e5
Asserting CPU metric keys aren't in GPU only logger
EricWiener Feb 9, 2022
17d48d7
Rename to _get_and_log_device_stats
EricWiener Feb 9, 2022
edb93ab
Addd CPUDeviceStatsEnum
EricWiener Feb 14, 2022
06dd8c4
Fix mypy issues
EricWiener Feb 18, 2022
3bffb48
CHange cpu metrics enum to constant str
EricWiener Feb 18, 2022
74e9069
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 18, 2022
317c354
Make CPU metric constants private
EricWiener Feb 25, 2022
10029b9
Update CPU metric imports
EricWiener Feb 25, 2022
14cae03
Update docstring and debugging.rst
EricWiener Feb 25, 2022
e187826
Update docstring and debugging.rst
EricWiener Feb 25, 2022
23d4260
Replace metrics.keys() w metrics
EricWiener Feb 25, 2022
c245ff1
Updated GPU tests
EricWiener Feb 25, 2022
43ce53e
Clean up comments + strings
EricWiener Feb 26, 2022
57eb885
Updated GPU metrics that are checked
EricWiener Feb 26, 2022
374d1ab
Updated GPU metric key
EricWiener Feb 26, 2022
5012b8c
Add back-ticks around code in error message + docs
EricWiener Feb 26, 2022
c7816bf
Parameterized torch + cpu tests and updated docstrings
EricWiener Feb 26, 2022
73d38fb
Only run GPU torch metrics logged for > 1.8
EricWiener Feb 26, 2022
ea0765e
update test
rohitgr7 Feb 27, 2022
5d1c0bc
Remove mocking comment + reduce steps
EricWiener Feb 27, 2022
0e90200
Update change log
EricWiener Feb 27, 2022
4e54f32
Update docs
kaushikb11 Apr 26, 2022
c6864de
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 26, 2022
4ccf4d7
Update cpu stats logging logic
kaushikb11 Apr 27, 2022
ce6be7f
fix errors
kaushikb11 Apr 27, 2022
629bd29
Update docstring
kaushikb11 Apr 28, 2022
59453d6
Update DeviceStatsMonitor and tests
kaushikb11 Apr 28, 2022
e97ab31
Update RunIf
kaushikb11 Apr 28, 2022
ae4a2d9
Update tests
kaushikb11 Apr 28, 2022
33c2bf4
Self review
carmocca Apr 28, 2022
26b43c4
Refactor test
carmocca Apr 28, 2022
56afcfb
Simplify tests
carmocca Apr 28, 2022
b3f0b81
Convert global to local
carmocca Apr 28, 2022
d540a19
Fix tests
kaushikb11 Apr 29, 2022
6569d88
Merge branch 'master' into feature/auto-track-cpu-stats
kaushikb11 May 4, 2022
9c4bb1f
Update tests/callbacks/test_device_stats_monitor.py
kaushikb11 May 4, 2022
ae4c99f
Update pytorch_lightning/callbacks/device_stats_monitor.py
kaushikb11 May 4, 2022
f50d0a7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 4, 2022
5a71600
Address reviews
kaushikb11 May 4, 2022
08d4ed2
Merge branch 'feature/auto-track-cpu-stats' of https://github.com/Eri…
kaushikb11 May 4, 2022
a3aa1f5
Update pytorch_lightning/callbacks/device_stats_monitor.py
kaushikb11 May 4, 2022
6d136fb
Update docs/source/tuning/profiler_basic.rst
kaushikb11 May 4, 2022
941c630
Address Rohit's comments
carmocca May 4, 2022
fd45c8d
Address reviews
kaushikb11 May 5, 2022
941315e
Update pytorch_lightning/callbacks/device_stats_monitor.py
kaushikb11 May 5, 2022
25b8c5e
Update pytorch_lightning/accelerators/cpu.py
kaushikb11 May 5, 2022
b392415
FIx tpu tests
kaushikb11 May 5, 2022
21dd440
Merge branch 'feature/auto-track-cpu-stats' of https://github.com/Eri…
kaushikb11 May 5, 2022
01430e2
Merge branch 'master' into feature/auto-track-cpu-stats
carmocca May 5, 2022
7f64889
TPU test
kaushikb11 May 6, 2022
41cb8eb
Merge branch 'master' into feature/auto-track-cpu-stats
kaushikb11 May 9, 2022
f06cc07
only on fit
rohitgr7 May 9, 2022
2ef6555
Fix test
kaushikb11 May 9, 2022
f394a1e
Fix tpu tests
kaushikb11 May 9, 2022
e90d3f2
Merge branch 'master' into feature/auto-track-cpu-stats
kaushikb11 May 9, 2022
c17bf69
Fix tests
kaushikb11 May 10, 2022
a052a38
Fix tests
kaushikb11 May 10, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed an issue with resuming from a checkpoint trained with QAT ([#11346](https://github.com/PyTorchLightning/pytorch-lightning/pull/11346))


- Added CPU metric tracking to `DeviceStatsMonitor` ([#11795](https://github.com/PyTorchLightning/pytorch-lightning/pull/11795))


## [1.5.10] - 2022-02-08

### Fixed
Expand Down
2 changes: 1 addition & 1 deletion dockers/tpu-tests/tpu_test_cases.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,12 @@ local tputests = base.BaseTest {
pip install -e .
echo $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS
export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"
# TODO (@kaushikb11): Add device stats tests here
coverage run --source=pytorch_lightning -m pytest -v --capture=no \
tests/strategies/test_tpu_spawn.py \
tests/profiler/test_xla_profiler.py \
pytorch_lightning/utilities/xla_device.py \
tests/accelerators/test_tpu.py \
tests/callbacks/test_device_stats_monitor.py \
tests/models/test_tpu.py
test_exit_code=$?
echo "\n||| END PYTEST LOGS |||\n"
Expand Down
3 changes: 3 additions & 0 deletions docs/source/tuning/profiler_basic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,6 @@ This can be measured with the :class:`~pytorch_lightning.callbacks.device_stats_
from pytorch_lightning.callbacks import DeviceStatsMonitor

trainer = Trainer(callbacks=[DeviceStatsMonitor()])

CPU metrics will be tracked by default on the CPU accelerator. To enable it for other accelerators set ``DeviceStatsMonitor(cpu_stats=True)``. To disable logging
CPU metrics, you can specify ``DeviceStatsMonitor(cpu_stats=False)``.
26 changes: 24 additions & 2 deletions pytorch_lightning/accelerators/cpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from pytorch_lightning.accelerators.accelerator import Accelerator
from pytorch_lightning.utilities import device_parser
from pytorch_lightning.utilities.exceptions import MisconfigurationException
from pytorch_lightning.utilities.imports import _PSUTIL_AVAILABLE
from pytorch_lightning.utilities.types import _DEVICE


Expand All @@ -35,8 +36,8 @@ def setup_environment(self, root_device: torch.device) -> None:
raise MisconfigurationException(f"Device should be CPU, got {root_device} instead.")

def get_device_stats(self, device: _DEVICE) -> Dict[str, Any]:
"""CPU device stats aren't supported yet."""
return {}
"""Get CPU stats from ``psutil`` package."""
return get_cpu_stats()

@staticmethod
def parse_devices(devices: Union[int, str, List[int]]) -> int:
Expand Down Expand Up @@ -67,3 +68,24 @@ def register_accelerators(cls, accelerator_registry: Dict) -> None:
cls,
description=f"{cls.__class__.__name__}",
)


# CPU device metrics
_CPU_VM_PERCENT = "cpu_vm_percent"
_CPU_PERCENT = "cpu_percent"
_CPU_SWAP_PERCENT = "cpu_swap_percent"


def get_cpu_stats() -> Dict[str, float]:
if not _PSUTIL_AVAILABLE:
raise ModuleNotFoundError(
"Fetching CPU device stats requires `psutil` to be installed."
" Install it by running `pip install -U psutil`."
)
import psutil

return {
_CPU_VM_PERCENT: psutil.virtual_memory().percent,
_CPU_PERCENT: psutil.cpu_percent(),
_CPU_SWAP_PERCENT: psutil.swap_memory().percent,
}
77 changes: 54 additions & 23 deletions pytorch_lightning/callbacks/device_stats_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,23 @@
import pytorch_lightning as pl
from pytorch_lightning.callbacks.base import Callback
from pytorch_lightning.utilities.exceptions import MisconfigurationException
from pytorch_lightning.utilities.imports import _PSUTIL_AVAILABLE
from pytorch_lightning.utilities.rank_zero import rank_zero_deprecation, rank_zero_warn
from pytorch_lightning.utilities.types import STEP_OUTPUT
from pytorch_lightning.utilities.warnings import rank_zero_deprecation


class DeviceStatsMonitor(Callback):
r"""
Automatically monitors and logs device stats during training stage. ``DeviceStatsMonitor``
is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.

Args:
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
It will raise a warning if ``psutil`` is not installed till v1.9.0.
If ``True``, it will log CPU stats regardless of the accelerator, and it will
raise an exception if ``psutil`` is not installed.
If ``False``, it will not log CPU stats regardless of the accelerator.

Raises:
MisconfigurationException:
If ``Trainer`` has no logger.
Expand All @@ -43,45 +51,68 @@ class DeviceStatsMonitor(Callback):
>>> trainer = Trainer(callbacks=[device_stats]) # doctest: +SKIP
"""

def setup(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", stage: Optional[str] = None) -> None:
if not trainer.loggers:
raise MisconfigurationException("Cannot use DeviceStatsMonitor callback with Trainer that has no logger.")
def __init__(self, cpu_stats: Optional[bool] = None) -> None:
self._cpu_stats = cpu_stats

def on_train_batch_start(
self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", batch: Any, batch_idx: int
def setup(
self,
trainer: "pl.Trainer",
pl_module: "pl.LightningModule",
stage: Optional[str] = None,
) -> None:
if stage != "fit":
return

if not trainer.loggers:
raise MisconfigurationException("Cannot use `DeviceStatsMonitor` callback with `Trainer(logger=False)`.")

# warn in setup to warn once
device = trainer.strategy.root_device
if self._cpu_stats is None and device.type == "cpu" and not _PSUTIL_AVAILABLE:
# TODO: raise an exception from v1.9
rank_zero_warn(
"`DeviceStatsMonitor` will not log CPU stats as `psutil` is not installed."
" To install `psutil`, run `pip install psutil`."
" It will raise an exception if `psutil` is not installed post v1.9.0."
)
self._cpu_stats = False

def _get_and_log_device_stats(self, trainer: "pl.Trainer", key: str) -> None:
if not trainer._logger_connector.should_update_logs:
return

device = trainer.strategy.root_device
if self._cpu_stats is False and device.type == "cpu":
# cpu stats are disabled
return

device_stats = trainer.accelerator.get_device_stats(device)

if self._cpu_stats and device.type != "cpu":
# Don't query CPU stats twice if CPU is accelerator
from pytorch_lightning.accelerators.cpu import get_cpu_stats

device_stats.update(get_cpu_stats())

for logger in trainer.loggers:
separator = logger.group_separator
prefixed_device_stats = _prefix_metric_keys(
device_stats, f"{self.__class__.__qualname__}.on_train_batch_start", separator
)
prefixed_device_stats = _prefix_metric_keys(device_stats, f"{self.__class__.__qualname__}.{key}", separator)
logger.log_metrics(prefixed_device_stats, step=trainer.fit_loop.epoch_loop._batches_that_stepped)

def on_train_batch_start(
self,
trainer: "pl.Trainer",
pl_module: "pl.LightningModule",
batch: Any,
batch_idx: int,
unused: Optional[int] = 0,
) -> None:
self._get_and_log_device_stats(trainer, "on_train_batch_start")

def on_train_batch_end(
self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", outputs: STEP_OUTPUT, batch: Any, batch_idx: int
) -> None:
if not trainer.loggers:
raise MisconfigurationException("Cannot use `DeviceStatsMonitor` callback with `Trainer(logger=False)`.")

if not trainer._logger_connector.should_update_logs:
return

device = trainer.strategy.root_device
device_stats = trainer.accelerator.get_device_stats(device)
for logger in trainer.loggers:
separator = logger.group_separator
prefixed_device_stats = _prefix_metric_keys(
device_stats, f"{self.__class__.__qualname__}.on_train_batch_end", separator
)
logger.log_metrics(prefixed_device_stats, step=trainer.fit_loop.epoch_loop._batches_that_stepped)
self._get_and_log_device_stats(trainer, "on_train_batch_end")


def _prefix_metric_keys(metrics_dict: Dict[str, float], prefix: str, separator: str) -> Dict[str, float]:
Expand Down
3 changes: 2 additions & 1 deletion pytorch_lightning/utilities/imports.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ def _compare_version(package: str, op: Callable, version: str, use_base_version:
_FAIRSCALE_OSS_FP16_BROADCAST_AVAILABLE = _FAIRSCALE_AVAILABLE and _compare_version("fairscale", operator.ge, "0.3.3")
_FAIRSCALE_FULLY_SHARDED_AVAILABLE = _FAIRSCALE_AVAILABLE and _compare_version("fairscale", operator.ge, "0.3.4")
_GROUP_AVAILABLE = not _IS_WINDOWS and _module_available("torch.distributed.group")
_HABANA_FRAMEWORK_AVAILABLE = _package_available("habana_frameworks")
_HIVEMIND_AVAILABLE = _package_available("hivemind")
_HOROVOD_AVAILABLE = _module_available("horovod.torch")
_HYDRA_AVAILABLE = _package_available("hydra")
Expand All @@ -115,7 +116,7 @@ def _compare_version(package: str, op: Callable, version: str, use_base_version:
_NEPTUNE_GREATER_EQUAL_0_9 = _NEPTUNE_AVAILABLE and _compare_version("neptune", operator.ge, "0.9.0")
_OMEGACONF_AVAILABLE = _package_available("omegaconf")
_POPTORCH_AVAILABLE = _package_available("poptorch")
_HABANA_FRAMEWORK_AVAILABLE = _package_available("habana_frameworks")
_PSUTIL_AVAILABLE = _package_available("psutil")
_RICH_AVAILABLE = _package_available("rich") and _compare_version("rich", operator.ge, "10.2.2")
_TORCH_QUANTIZE_AVAILABLE = bool([eg for eg in torch.backends.quantized.supported_engines if eg != "none"])
_TORCHTEXT_AVAILABLE = _package_available("torchtext")
Expand Down
1 change: 1 addition & 0 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ pytest-forked
cloudpickle>=1.3
scikit-learn>0.22.1
onnxruntime
psutil # for `DeviceStatsMonitor`
pandas # needed in benchmarks
68 changes: 62 additions & 6 deletions tests/callbacks/test_device_stats_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Dict, Optional
from unittest import mock
from unittest.mock import Mock

import pytest
import torch

from pytorch_lightning import Trainer
from pytorch_lightning.accelerators.cpu import _CPU_PERCENT, _CPU_SWAP_PERCENT, _CPU_VM_PERCENT, get_cpu_stats
from pytorch_lightning.callbacks import DeviceStatsMonitor
from pytorch_lightning.callbacks.device_stats_monitor import _prefix_metric_keys
from pytorch_lightning.loggers import CSVLogger
Expand All @@ -34,9 +38,13 @@ def test_device_stats_gpu_from_torch(tmpdir):
class DebugLogger(CSVLogger):
@rank_zero_only
def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
fields = ["allocated_bytes.all.freed", "inactive_split.all.peak", "reserved_bytes.large_pool.peak"]
fields = [
"allocated_bytes.all.freed",
"inactive_split.all.peak",
"reserved_bytes.large_pool.peak",
]
for f in fields:
assert any(f in h for h in metrics.keys())
assert any(f in h for h in metrics)

trainer = Trainer(
default_root_dir=tmpdir,
Expand All @@ -54,6 +62,41 @@ def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) ->
trainer.fit(model)


@RunIf(psutil=True)
@pytest.mark.parametrize("cpu_stats", (None, True, False))
@mock.patch("pytorch_lightning.accelerators.cpu.get_cpu_stats", side_effect=get_cpu_stats)
def test_device_stats_cpu(cpu_stats_mock, tmpdir, cpu_stats):
"""Test CPU stats are logged when no accelerator is used."""
model = BoringModel()
CPU_METRIC_KEYS = (_CPU_VM_PERCENT, _CPU_SWAP_PERCENT, _CPU_PERCENT)

class DebugLogger(CSVLogger):
def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
enabled = cpu_stats is not False
for f in CPU_METRIC_KEYS:
has_cpu_metrics = any(f in h for h in metrics)
assert has_cpu_metrics if enabled else not has_cpu_metrics

device_stats = DeviceStatsMonitor(cpu_stats=cpu_stats)
trainer = Trainer(
default_root_dir=tmpdir,
max_epochs=1,
limit_train_batches=2,
limit_val_batches=0,
log_every_n_steps=1,
callbacks=device_stats,
logger=DebugLogger(tmpdir),
enable_checkpointing=False,
enable_progress_bar=False,
accelerator="cpu",
)
trainer.fit(model)

expected = 4 if cpu_stats is not False else 0 # (batch_start + batch_end) * train_batches
assert cpu_stats_mock.call_count == expected


@pytest.mark.skipif(True, reason="TODO (@kaushikb11): fix this test, timeout")
@RunIf(tpu=True)
def test_device_stats_monitor_tpu(tmpdir):
"""Test TPU stats are logged using a logger."""
Expand All @@ -66,14 +109,14 @@ class DebugLogger(CSVLogger):
def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
fields = ["avg. free memory (MB)", "avg. peak memory (MB)"]
for f in fields:
assert any(f in h for h in metrics.keys())
assert any(f in h for h in metrics)

trainer = Trainer(
default_root_dir=tmpdir,
max_epochs=1,
limit_train_batches=1,
limit_train_batches=2,
accelerator="tpu",
devices=8,
devices=1,
log_every_n_steps=1,
callbacks=[device_stats],
logger=DebugLogger(tmpdir),
Expand All @@ -99,7 +142,7 @@ def test_device_stats_monitor_no_logger(tmpdir):
enable_progress_bar=False,
)

with pytest.raises(MisconfigurationException, match="Trainer that has no logger."):
with pytest.raises(MisconfigurationException, match="Cannot use `DeviceStatsMonitor` callback."):
trainer.fit(model)


Expand All @@ -110,3 +153,16 @@ def test_prefix_metric_keys(tmpdir):
separator = "."
converted_metrics = _prefix_metric_keys(metrics, prefix, separator)
assert converted_metrics == {"foo.1": 1.0, "foo.2": 2.0, "foo.3": 3.0}


def test_device_stats_monitor_warning_when_psutil_not_available(monkeypatch):
"""Test that warning is raised when psutil is not available."""
import pytorch_lightning.callbacks.device_stats_monitor as imports

monkeypatch.setattr(imports, "_PSUTIL_AVAILABLE", False)
monitor = DeviceStatsMonitor()
trainer = Trainer()
assert trainer.strategy.root_device == torch.device("cpu")
# TODO: raise an exception from v1.9
with pytest.warns(UserWarning, match="psutil` is not installed"):
monitor.setup(trainer, Mock(), "fit")
9 changes: 8 additions & 1 deletion tests/helpers/runif.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from packaging.version import Version
from pkg_resources import get_distribution

from pytorch_lightning.utilities import (
from pytorch_lightning.utilities.imports import (
_APEX_AVAILABLE,
_BAGUA_AVAILABLE,
_DEEPSPEED_AVAILABLE,
Expand All @@ -31,6 +31,7 @@
_HPU_AVAILABLE,
_IPU_AVAILABLE,
_OMEGACONF_AVAILABLE,
_PSUTIL_AVAILABLE,
_RICH_AVAILABLE,
_TORCH_GREATER_EQUAL_1_10,
_TORCH_QUANTIZE_AVAILABLE,
Expand Down Expand Up @@ -85,6 +86,7 @@ def __new__(
omegaconf: bool = False,
slow: bool = False,
bagua: bool = False,
psutil: bool = False,
hivemind: bool = False,
**kwargs,
):
Expand Down Expand Up @@ -113,6 +115,7 @@ def __new__(
omegaconf: Require that omry/omegaconf is installed.
slow: Mark the test as slow, our CI will run it in a separate job.
bagua: Require that BaguaSys/bagua is installed.
psutil: Require that psutil is installed.
hivemind: Require that Hivemind is installed.
**kwargs: Any :class:`pytest.mark.skipif` keyword arguments.
"""
Expand Down Expand Up @@ -234,6 +237,10 @@ def __new__(
conditions.append(not _BAGUA_AVAILABLE or sys.platform in ("win32", "darwin"))
reasons.append("Bagua")

if psutil:
conditions.append(not _PSUTIL_AVAILABLE)
reasons.append("psutil")

if hivemind:
conditions.append(not _HIVEMIND_AVAILABLE or sys.platform in ("win32", "darwin"))
reasons.append("Hivemind")
Expand Down