Skip to content

Commit b3203d9

Browse files
jerome-habanapre-commit-ci[bot]kaushikb11rohitgr7
authored
Added support for HPU device stats monitor (#13819)
* Added support for HPU device stats monitor Signed-off-by: Jerome <[email protected]> * Update changelog Signed-off-by: Jerome <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from code review Co-authored-by: Kaushik B <[email protected]> * Update reference Signed-off-by: Jerome <[email protected]> * Apply suggestions from code review Co-authored-by: Rohit Gupta <[email protected]> * fix alignment * add descriptions * Update hpu_intermediate.rst Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Rohit Gupta <[email protected]>
1 parent eb233ea commit b3203d9

File tree

5 files changed

+72
-4
lines changed

5 files changed

+72
-4
lines changed

docs/source-pytorch/accelerators/hpu_basic.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,5 +79,4 @@ Known limitations
7979
-----------------
8080

8181
* `Habana dataloader <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#habana-data-loader>`__ is not supported.
82-
* :class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is not supported.
8382
* :func:`torch.inference_mode` is not supported

docs/source-pytorch/accelerators/hpu_intermediate.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,34 @@ This enables advanced users to provide their own BF16 and FP32 operator list ins
6666
trainer.fit(model, datamodule=dm)
6767
6868
For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`__.
69+
70+
----
71+
72+
Enabling DeviceStatsMonitor with HPUs
73+
----------------------------------------
74+
75+
:class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is a callback that automatically monitors and logs device stats during the training stage.
76+
This callback can be passed for training with HPUs. It returns a map of the following metrics with their values in bytes of type uint64:
77+
78+
- **Limit**: amount of total memory on HPU device.
79+
- **InUse**: amount of allocated memory at any instance.
80+
- **MaxInUse**: amount of total active memory allocated.
81+
- **NumAllocs**: number of allocations.
82+
- **NumFrees**: number of freed chunks.
83+
- **ActiveAllocs**: number of active allocations.
84+
- **MaxAllocSize**: maximum allocated size.
85+
- **TotalSystemAllocs**: total number of system allocations.
86+
- **TotalSystemFrees**: total number of system frees.
87+
- **TotalActiveAllocs**: total number of active allocations.
88+
89+
The below snippet shows how DeviceStatsMonitor can be enabled.
90+
91+
.. code-block:: python
92+
93+
from pytorch_lightning import Trainer
94+
from pytorch_lightning.callbacks import DeviceStatsMonitor
95+
96+
device_stats = DeviceStatsMonitor()
97+
trainer = Trainer(accelerator="hpu", callbacks=[device_stats])
98+
99+
For more details, please refer to `Memory Stats APIs <https://docs.habana.ai/en/v1.5.0/PyTorch/PyTorch_User_Guide/Python_Packages.html#memory-stats-apis>`__.

src/pytorch_lightning/CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
111111
- Added support for async checkpointing ([#13658](https://github.com/Lightning-AI/lightning/pull/13658))
112112

113113

114+
- Added support for HPU Device stats monitor ([#13819](https://github.com/Lightning-AI/lightning/pull/13819))
115+
116+
114117
### Changed
115118

116119
- `accelerator="gpu"` now automatically selects an available GPU backend (CUDA and MPS currently) ([#13642](https://github.com/Lightning-AI/lightning/pull/13642))

src/pytorch_lightning/accelerators/hpu.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,24 @@ def setup_environment(self, root_device: torch.device) -> None:
3939
raise MisconfigurationException(f"Device should be HPU, got {root_device} instead.")
4040

4141
def get_device_stats(self, device: Union[str, torch.device]) -> Dict[str, Any]:
42-
"""HPU device stats aren't supported yet."""
43-
rank_zero_debug("HPU device stats aren't supported yet.")
44-
return {}
42+
"""Returns a map of the following metrics with their values:
43+
44+
- Limit: amount of total memory on HPU device.
45+
- InUse: amount of allocated memory at any instance.
46+
- MaxInUse: amount of total active memory allocated.
47+
- NumAllocs: number of allocations.
48+
- NumFrees: number of freed chunks.
49+
- ActiveAllocs: number of active allocations.
50+
- MaxAllocSize: maximum allocated size.
51+
- TotalSystemAllocs: total number of system allocations.
52+
- TotalSystemFrees: total number of system frees.
53+
- TotalActiveAllocs: total number of active allocations.
54+
"""
55+
try:
56+
return torch_hpu.hpu.memory_stats(device)
57+
except (AttributeError, NameError):
58+
rank_zero_debug("HPU `get_device_stats` failed")
59+
return {}
4560

4661
@staticmethod
4762
def parse_devices(devices: Union[int, str, List[int]]) -> Optional[int]:

tests/tests_pytorch/accelerators/test_hpu.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,3 +303,23 @@ def training_epoch_end(self, outputs) -> None:
303303
trainer.fit(model)
304304

305305
assert all(model.optims)
306+
307+
308+
@RunIf(hpu=True)
309+
def test_hpu_device_stats_monitor(tmpdir):
310+
311+
hpu_stats = HPUAccelerator().get_device_stats("hpu")
312+
fields = [
313+
"Limit",
314+
"InUse",
315+
"MaxInUse",
316+
"NumAllocs",
317+
"NumFrees",
318+
"ActiveAllocs",
319+
"MaxAllocSize",
320+
"TotalSystemAllocs",
321+
"TotalSystemFrees",
322+
"TotalActiveAllocs",
323+
]
324+
for f in fields:
325+
assert any(f in h for h in hpu_stats.keys())

0 commit comments

Comments
 (0)