Added support for HPU device stats monitor (#13819)

jerome-habana · pre-commit-ci[bot] · kaushikb11 · web-flow · commit b3203d93d046 · 2022-08-02T13:31:31.000+05:30
* Added support for HPU device stats monitor Signed-off-by: Jerome <janand@habana.ai> * Update changelog Signed-off-by: Jerome <janand@habana.ai> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from code review Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> * Update reference Signed-off-by: Jerome <janand@habana.ai> * Apply suggestions from code review Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> * fix alignment * add descriptions * Update hpu_intermediate.rst Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <45285388+kaushikb11@users.noreply.github.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
diff --git a/docs/source-pytorch/accelerators/hpu_basic.rst b/docs/source-pytorch/accelerators/hpu_basic.rst
@@ -79,5 +79,4 @@ Known limitations
 -----------------
 
 * `Habana dataloader <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#habana-data-loader>`__ is not supported.
-* :class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is not supported.
 * :func:`torch.inference_mode` is not supported
diff --git a/docs/source-pytorch/accelerators/hpu_intermediate.rst b/docs/source-pytorch/accelerators/hpu_intermediate.rst
@@ -66,3 +66,34 @@ This enables advanced users to provide their own BF16 and FP32 operator list ins
     trainer.fit(model, datamodule=dm)
 
 For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`__.
+
+----
+
+Enabling DeviceStatsMonitor with HPUs
+----------------------------------------
+
+:class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is a callback that automatically monitors and logs device stats during the training stage.
+This callback can be passed for training with HPUs. It returns a map of the following metrics with their values in bytes of type uint64:
+
+- **Limit**: amount of total memory on HPU device.
+- **InUse**: amount of allocated memory at any instance.
+- **MaxInUse**: amount of total active memory allocated.
+- **NumAllocs**: number of allocations.
+- **NumFrees**: number of freed chunks.
+- **ActiveAllocs**: number of active allocations.
+- **MaxAllocSize**: maximum allocated size.
+- **TotalSystemAllocs**: total number of system allocations.
+- **TotalSystemFrees**: total number of system frees.
+- **TotalActiveAllocs**: total number of active allocations.
+
+The below snippet shows how DeviceStatsMonitor can be enabled.
+
+.. code-block:: python
+
+    from pytorch_lightning import Trainer
+    from pytorch_lightning.callbacks import DeviceStatsMonitor
+
+    device_stats = DeviceStatsMonitor()
+    trainer = Trainer(accelerator="hpu", callbacks=[device_stats])
+
+For more details, please refer to `Memory Stats APIs <https://docs.habana.ai/en/v1.5.0/PyTorch/PyTorch_User_Guide/Python_Packages.html#memory-stats-apis>`__.
diff --git a/src/pytorch_lightning/CHANGELOG.md b/src/pytorch_lightning/CHANGELOG.md
@@ -111,6 +111,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added support for async checkpointing ([#13658](https://github.com/Lightning-AI/lightning/pull/13658))
 
 
+- Added support for HPU Device stats monitor ([#13819](https://github.com/Lightning-AI/lightning/pull/13819))
+
+
 ### Changed
 
 - `accelerator="gpu"` now automatically selects an available GPU backend (CUDA and MPS currently) ([#13642](https://github.com/Lightning-AI/lightning/pull/13642))
diff --git a/src/pytorch_lightning/accelerators/hpu.py b/src/pytorch_lightning/accelerators/hpu.py
@@ -39,9 +39,24 @@ def setup_environment(self, root_device: torch.device) -> None:
             raise MisconfigurationException(f"Device should be HPU, got {root_device} instead.")
 
     def get_device_stats(self, device: Union[str, torch.device]) -> Dict[str, Any]:
-        """HPU device stats aren't supported yet."""
-        rank_zero_debug("HPU device stats aren't supported yet.")
-        return {}
+        """Returns a map of the following metrics with their values:
+
+        - Limit: amount of total memory on HPU device.
+        - InUse: amount of allocated memory at any instance.
+        - MaxInUse: amount of total active memory allocated.
+        - NumAllocs: number of allocations.
+        - NumFrees: number of freed chunks.
+        - ActiveAllocs: number of active allocations.
+        - MaxAllocSize: maximum allocated size.
+        - TotalSystemAllocs: total number of system allocations.
+        - TotalSystemFrees: total number of system frees.
+        - TotalActiveAllocs: total number of active allocations.
+        """
+        try:
+            return torch_hpu.hpu.memory_stats(device)
+        except (AttributeError, NameError):
+            rank_zero_debug("HPU `get_device_stats` failed")
+            return {}
 
     @staticmethod
     def parse_devices(devices: Union[int, str, List[int]]) -> Optional[int]:
diff --git a/tests/tests_pytorch/accelerators/test_hpu.py b/tests/tests_pytorch/accelerators/test_hpu.py
@@ -303,3 +303,23 @@ def training_epoch_end(self, outputs) -> None:
     trainer.fit(model)
 
     assert all(model.optims)
+
+
+@RunIf(hpu=True)
+def test_hpu_device_stats_monitor(tmpdir):
+
+    hpu_stats = HPUAccelerator().get_device_stats("hpu")
+    fields = [
+        "Limit",
+        "InUse",
+        "MaxInUse",
+        "NumAllocs",
+        "NumFrees",
+        "ActiveAllocs",
+        "MaxAllocSize",
+        "TotalSystemAllocs",
+        "TotalSystemFrees",
+        "TotalActiveAllocs",
+    ]
+    for f in fields:
+        assert any(f in h for h in hpu_stats.keys())