Skip to content

Commit 44b770f

Browse files
committed
Fix healthchecking on old devices
In cases where registering events for a device are not supported, we should not mark the device as unhealthy, but skip the device instead. Signed-off-by: Evan Lezar <[email protected]>
1 parent b7ab21c commit 44b770f

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

internal/rm/health.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -102,10 +102,10 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
102102
}
103103

104104
ret = gpu.RegisterEvents(eventMask&supportedEvents, eventSet)
105-
if ret == nvml.ERROR_NOT_SUPPORTED {
105+
switch {
106+
case ret == nvml.ERROR_NOT_SUPPORTED:
106107
klog.Warningf("Device %v is too old to support healthchecking.", d.ID)
107-
}
108-
if ret != nvml.SUCCESS {
108+
case ret != nvml.SUCCESS:
109109
klog.Infof("Marking device %v as unhealthy: %v", d.ID, ret)
110110
unhealthy <- d
111111
}

0 commit comments

Comments
 (0)