|
| 1 | +# Test Plan: Accelerator HAL Migration |
| 2 | + |
| 3 | +This document outlines the test plan to verify that the migration to the Accelerator HAL (Hardware Abstraction Layer) preserves existing functionality for NVML-based monitoring and health checks. |
| 4 | + |
| 5 | +## Objective |
| 6 | + |
| 7 | +Ensure that all existing NVML paths (`nvml_monitor` and `check_nvidia_smi`) continue to function identically after being refactored to use the `AcceleratorManager` and `NVMLBackend` interface. |
| 8 | + |
| 9 | +## Coverage Areas |
| 10 | + |
| 11 | +1. **Metric Collection (`nvml_monitor`)**: Verifying GPU metrics (utilization, memory, power, temperature, clocks, ECC) are collected correctly. |
| 12 | +2. **Health Checks (`check_nvidia_smi`)**: Verifying GPU presence, running processes, and error detection. |
| 13 | +3. **Error Handling**: Ensuring that backend unavailability or device errors are handled gracefully and logged appropriately. |
| 14 | + |
| 15 | +## Test Cases |
| 16 | + |
| 17 | +### 1. Unit Tests |
| 18 | + |
| 19 | +Run existing unit tests to verify no regressions in logic. |
| 20 | + |
| 21 | +```bash |
| 22 | +pytest gcm/tests/test_accelerator_hal.py |
| 23 | +pytest gcm/tests/health_checks_tests/test_check_nvidia_smi.py |
| 24 | +pytest gcm/tests/test_nvml_monitor.py |
| 25 | +``` |
| 26 | + |
| 27 | +### 2. Manual Verification (Stubbed) |
| 28 | + |
| 29 | +Since we cannot run on actual GPU hardware in this environment, we rely on the stubbed NVML library used in tests. |
| 30 | + |
| 31 | +#### A. NVML Monitor |
| 32 | + |
| 33 | +**Refactored Logic:** |
| 34 | +`nvml_monitor` now instantiates `AcceleratorManager`, probes backends, and uses `AcceleratorTelemetryAdapter` to interact with device handles provided by `NVMLBackend`. |
| 35 | + |
| 36 | +**Verification Step:** |
| 37 | +Verify that `nvml_monitor.py` correctly fetches device count and metrics via the adapter. The adapter ensures that underlying `pynvml` calls are routed through the `AcceleratorManager`'s backend instance. |
| 38 | + |
| 39 | +#### B. Health Checks |
| 40 | + |
| 41 | +**Refactored Logic:** |
| 42 | +`check_nvidia_smi` now instantiates `AcceleratorManager` and uses `AcceleratorTelemetryAdapter` to perform checks. |
| 43 | + |
| 44 | +**Verification Step:** |
| 45 | +Verify that `check_nvidia_smi.py` correctly detects GPU count and running processes via the adapter. |
| 46 | + |
| 47 | +## Refactoring Status |
| 48 | + |
| 49 | +- **`gcm/accelerator`**: Core HAL interfaces and NVML backend implementation are complete. |
| 50 | +- **`nvml_monitor.py`**: Refactored to use `AcceleratorManager` via `AcceleratorTelemetryAdapter`. |
| 51 | +- **`check_nvidia_smi.py`**: Refactored to use `AcceleratorManager` via `AcceleratorTelemetryAdapter`. |
| 52 | +- **Legacy Shim**: Added `gcm/monitoring/accelerator_adapter.py` to bridge `DeviceTelemetryClient` calls to the HAL backend, ensuring 100% backward compatibility for methods not yet fully exposed in `MetricSet` (e.g., specific ECC error counts). |
| 53 | + |
| 54 | +## Rollout Strategy |
| 55 | + |
| 56 | +1. **Phase 1 (Current PR)**: Introduce HAL, migrate all NVML usage to `AcceleratorManager` via adapter shim. |
| 57 | +2. **Phase 2 (Future)**: Update `nvml_monitor` logic to use `AcceleratorManager.read_metrics()` directly, removing dependency on `DeviceTelemetryClient` interface once `MetricSet` is expanded to cover all needs. |
| 58 | + |
| 59 | +This incremental approach ensures that the new architecture is active immediately while minimizing risk to existing business logic. |
0 commit comments