Artem/fix gpu count by vertex451 · Pull Request #379 · akash-network/provider

vertex451 · 2026-03-23T11:45:25Z

Resolves:

Details

I haven't found a bug in how do we handle inventory events.

Also, I've found recent report about GPU driver reporting wrong values.

[Bug]: Allocatable nvidia.com/gpu=0 after v0.18.2 upgrade due to config-manager race on startup NVIDIA/k8s-device-plugin#1645

That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.

Changes

Safe int64→uint64 conversion - clamp negatives to 0 instead of raw cast; guard against -1 sentinel in Available() returning MaxInt64
Sanitize all resource quantities at ingest - reject negative values from knode.Status.Allocatable/Capacity for CPU, Memory, GPU, Storage
GPU allocatable upper-bound - clamp to Capacity if Allocatable > Capacity
Pod key uses namespace/name - prevents cross-namespace accounting collision
Handle watch.Added on node watcher reconnect - was silently dropped, leaving stale allocatable
podsWatch.Stop() on restart - old code drained one event and leaked the watcher
Precision-preserving Sub() - old Value() - Value() truncated milli-precision, drifting CPU allocated toward 0

Testing

Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.

…ix-gpu-count

coderabbitai · 2026-03-23T11:45:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29761cdf-6852-49ba-8783-57246d667bec

📥 Commits

Reviewing files that changed from the base of the PR and between b534eb1 and 824e565.

📒 Files selected for processing (2)

cluster/kube/operators/clients/inventory/client_test.go
operator/inventory/node-discovery.go

🚧 Files skipped from review as they are similar to previous changes (1)

cluster/kube/operators/clients/inventory/client_test.go

Walkthrough

Context is threaded into inventory objects; resource quantities are sanitized/clamped to zero when negative or underflowing; pod keys become namespace-aware; watcher and node update flows were adjusted. Tests added to verify clamping and underflow behavior.

Changes

Cohort / File(s)	Summary
Inventory client / context `cluster/kube/operators/clients/inventory/client.go`	Added `ctx` to the inventory struct and updated subscriber flow to construct outgoing inventory messages with the request context.
Inventory metrics & clamping `cluster/kube/operators/clients/inventory/inventory.go`	`newInventory` now accepts `context.Context`; added `clampUint64`/`clampAvailableUint64`; Metrics() uses context logger and applies clamping for CPU, GPU, Memory, Ephemeral Storage, and per-storage-class values; `dup()` simplified.
Node discovery, sanitization & pod tracking `operator/inventory/node-discovery.go`	Added `sanitizeResourceQuantity` and sentinel errors; node resource population now zeroes then repopulates values, clamps negatives, enforces GPU allocatable ≤ capacity; changed allocation subtraction to `subAllocatedNLZ` with logging and clamping; switched pod map keys to namespace-aware `podKey()`; adjusted watcher lifecycle and node event handling (filter by name, handle Added/Modified, deep-copy node).
Tests `cluster/kube/operators/clients/inventory/client_test.go`, `operator/inventory/node-discovery_test.go`	Added `TestInventoryMetrics_negative_quantities_clamped` (table-driven) verifying clamping in inventory metrics; added `TestSanitizeResourceQuantity` and `TestSubAllocatedNLZ_underflow_clamped` validating sanitization and underflow clamping.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through context, steady and spry,
I clip negatives down to a friendly 0, why —
Pods now wear namespaced shoes that fit,
Watchers restart clean; no more misfit! 🌿✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Artem/fix gpu count' is vague and generic; while it references GPU, it does not clearly convey the full scope of the changeset, which encompasses multiple resource sanitization, logging, and pod-tracking fixes beyond just GPU counting.	Consider a more descriptive title that captures the primary objective, such as 'Add resource sanitization and underflow guards for inventory tracking' or 'Fix resource quantity handling and pod accounting in node discovery'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The pull request description is comprehensive and directly related to the changeset, detailing the specific issues being addressed (resource underflow, sanitization, watch handling, precision preservation) and their implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch artem/fix-gpu-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ix-gpu-count

vertex451 · 2026-03-25T13:25:40Z

operator/inventory/node-discovery.go

 }

-func subAllocatedNLZ(allocated *resource.Quantity, val resource.Quantity) {
-	newVal := allocated.Value() - val.Value()


.Value() truncates to whole units. For CPU (stored in milli), allocated = 500m means .Value() == 0, val = 200m means .Value() == 0, so newVal = 0 and you'd Set(0) - losing 300m of tracked allocation.

Over time, every pod delete that involved sub-unit CPU quantities would drift Allocated toward 0, making Available appear larger than reality.

On the other hand, .Sub() operates on the full-precision quantity

vertex451 · 2026-03-25T13:26:07Z

operator/inventory/node-discovery.go

 	restartPodsWatcher := func() error {
 		if podsWatch != nil {
-			select {
-			case <-podsWatch.ResultChan():


case <-podsWatch.ResultChan() only drained one buffered event and didn't stop the watcher

…ix-gpu-count

vertex451 · 2026-03-25T13:43:05Z

operator/inventory/node-discovery.go

 }

+func podKey(pod *corev1.Pod) string {
+	return pod.Namespace + "/" + pod.Name


We need this to prevent a collision between pods with the same name but from different namespaces.

vertex451 · 2026-03-25T13:43:47Z

operator/inventory/node-discovery.go

+					continue
+				}
+				switch evt.Type {
+				case watch.Added:


This event was ignored before

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

operator/inventory/node-discovery.go (1)

541-546: ⚠️ Potential issue | 🟠 Major

Refresh node quantities on every Added/Modified event.

This guard skips updateNodeInfo() for capacity-only updates, and it also misses sub-core CPU changes because nodeAllocatableChanged() relies on resource.Quantity.Value(). That leaves stale allocatable/capacity data in node, so the new sanitization/clamping path is bypassed on a normal node update. restartPodsWatcher() can stay behind change detection if you want, but updateNodeInfo() itself should run for every watch.Added/watch.Modified.

🛠️ One safe way to split the refresh from the watcher restart

 				switch evt.Type {
 				case watch.Added:
 					fallthrough
 				case watch.Modified:
-					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
-						updateNodeInfo(ctx, obj, &node)
+					updateNodeInfo(ctx, obj, &node)
+					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
 						if err = restartPodsWatcher(); err != nil {
 							return err
 						}
 					}
 					signalLabels()

In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call
updateNodeInfo(...) for every watch.Added or watch.Modified event instead of
skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely
to decide whether to call restartPodsWatcher(), not to gate the update. Modify
the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node)
is invoked unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.

In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.

---

Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0116d940-8636-4609-8a76-8b97ce343688

📥 Commits

Reviewing files that changed from the base of the PR and between ba474df and b534eb1.

📒 Files selected for processing (5)

cluster/kube/operators/clients/inventory/client.go
cluster/kube/operators/clients/inventory/client_test.go
cluster/kube/operators/clients/inventory/inventory.go
operator/inventory/node-discovery.go
operator/inventory/node-discovery_test.go

cluster/kube/operators/clients/inventory/client_test.go

operator/inventory/node-discovery.go

…ix-gpu-count

vertex451 added 4 commits March 20, 2026 17:40

fix: gpu count sanitizing

1fd8769

fix: handle watch.Added event

1a01ea0

chore: simplified code

69a7bf0

Merge branch 'main' of github.com:akash-network/provider into artem/f…

a07a57a

…ix-gpu-count

vertex451 added 2 commits March 23, 2026 18:54

chore: logger

f578883

Merge branch 'main' of github.com:akash-network/provider into artem/f…

bd8ca1f

…ix-gpu-count

vertex451 commented Mar 25, 2026

View reviewed changes

vertex451 added 2 commits March 25, 2026 14:29

chore: reduced boilerplate code in tests

25ecd45

Merge branch 'main' of github.com:akash-network/provider into artem/f…

b534eb1

…ix-gpu-count

vertex451 commented Mar 25, 2026

View reviewed changes

vertex451 marked this pull request as ready for review March 25, 2026 13:45

vertex451 requested a review from a team as a code owner March 25, 2026 13:45

vertex451 self-assigned this Mar 25, 2026

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

cluster/kube/operators/clients/inventory/client_test.go Show resolved Hide resolved

operator/inventory/node-discovery.go Show resolved Hide resolved

operator/inventory/node-discovery.go Show resolved Hide resolved

vertex451 added 3 commits March 25, 2026 17:29

Merge branch 'main' of github.com:akash-network/provider into artem/f…

ea53c0e

…ix-gpu-count

Merge branch 'main' of github.com:akash-network/provider into artem/f…

4679cef

…ix-gpu-count

fix: clamp stale resources on node update and tighten GPU capacity guard

824e565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artem/fix gpu count#379

Artem/fix gpu count#379
vertex451 wants to merge 11 commits intomainfrom
artem/fix-gpu-count

vertex451 commented Mar 23, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

vertex451 Mar 25, 2026

Uh oh!

vertex451 Mar 25, 2026 •

edited

Loading

Uh oh!

vertex451 Mar 25, 2026

Uh oh!

vertex451 Mar 25, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vertex451 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Resolves:

Details

Changes

Testing

Uh oh!

coderabbitai bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

vertex451 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

vertex451 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vertex451 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

vertex451 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vertex451 commented Mar 23, 2026 •

edited

Loading

coderabbitai bot commented Mar 23, 2026 •

edited

Loading

vertex451 Mar 25, 2026 •

edited

Loading