Skip to content

Artem/fix gpu count#379

Open
vertex451 wants to merge 11 commits intomainfrom
artem/fix-gpu-count
Open

Artem/fix gpu count#379
vertex451 wants to merge 11 commits intomainfrom
artem/fix-gpu-count

Conversation

@vertex451
Copy link
Copy Markdown
Contributor

@vertex451 vertex451 commented Mar 23, 2026

Resolves:

akash-network/support#429

Details

I haven't found a bug in how do we handle inventory events.

Also, I've found recent report about GPU driver reporting wrong values.

That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.

Changes

  • Safe int64→uint64 conversion - clamp negatives to 0 instead of raw cast; guard against -1 sentinel in Available() returning MaxInt64
  • Sanitize all resource quantities at ingest - reject negative values from knode.Status.Allocatable/Capacity for CPU, Memory, GPU, Storage
  • GPU allocatable upper-bound - clamp to Capacity if Allocatable > Capacity
  • Pod key uses namespace/name - prevents cross-namespace accounting collision
  • Handle watch.Added on node watcher reconnect - was silently dropped, leaving stale allocatable
  • podsWatch.Stop() on restart - old code drained one event and leaked the watcher
  • Precision-preserving Sub() - old Value() - Value() truncated milli-precision, drifting CPU allocated toward 0

Testing

Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29761cdf-6852-49ba-8783-57246d667bec

📥 Commits

Reviewing files that changed from the base of the PR and between b534eb1 and 824e565.

📒 Files selected for processing (2)
  • cluster/kube/operators/clients/inventory/client_test.go
  • operator/inventory/node-discovery.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cluster/kube/operators/clients/inventory/client_test.go

Walkthrough

Context is threaded into inventory objects; resource quantities are sanitized/clamped to zero when negative or underflowing; pod keys become namespace-aware; watcher and node update flows were adjusted. Tests added to verify clamping and underflow behavior.

Changes

Cohort / File(s) Summary
Inventory client / context
cluster/kube/operators/clients/inventory/client.go
Added ctx to the inventory struct and updated subscriber flow to construct outgoing inventory messages with the request context.
Inventory metrics & clamping
cluster/kube/operators/clients/inventory/inventory.go
newInventory now accepts context.Context; added clampUint64/clampAvailableUint64; Metrics() uses context logger and applies clamping for CPU, GPU, Memory, Ephemeral Storage, and per-storage-class values; dup() simplified.
Node discovery, sanitization & pod tracking
operator/inventory/node-discovery.go
Added sanitizeResourceQuantity and sentinel errors; node resource population now zeroes then repopulates values, clamps negatives, enforces GPU allocatable ≤ capacity; changed allocation subtraction to subAllocatedNLZ with logging and clamping; switched pod map keys to namespace-aware podKey(); adjusted watcher lifecycle and node event handling (filter by name, handle Added/Modified, deep-copy node).
Tests
cluster/kube/operators/clients/inventory/client_test.go, operator/inventory/node-discovery_test.go
Added TestInventoryMetrics_negative_quantities_clamped (table-driven) verifying clamping in inventory metrics; added TestSanitizeResourceQuantity and TestSubAllocatedNLZ_underflow_clamped validating sanitization and underflow clamping.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through context, steady and spry,
I clip negatives down to a friendly 0, why —
Pods now wear namespaced shoes that fit,
Watchers restart clean; no more misfit! 🌿✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Artem/fix gpu count' is vague and generic; while it references GPU, it does not clearly convey the full scope of the changeset, which encompasses multiple resource sanitization, logging, and pod-tracking fixes beyond just GPU counting. Consider a more descriptive title that captures the primary objective, such as 'Add resource sanitization and underflow guards for inventory tracking' or 'Fix resource quantity handling and pod accounting in node discovery'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description is comprehensive and directly related to the changeset, detailing the specific issues being addressed (resource underflow, sanitization, watch handling, precision preservation) and their implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch artem/fix-gpu-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

}

func subAllocatedNLZ(allocated *resource.Quantity, val resource.Quantity) {
newVal := allocated.Value() - val.Value()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Value() truncates to whole units. For CPU (stored in milli), allocated = 500m means .Value() == 0, val = 200m means .Value() == 0, so newVal = 0 and you'd Set(0) - losing 300m of tracked allocation.

Over time, every pod delete that involved sub-unit CPU quantities would drift Allocated toward 0, making Available appear larger than reality.

On the other hand, .Sub() operates on the full-precision quantity

restartPodsWatcher := func() error {
if podsWatch != nil {
select {
case <-podsWatch.ResultChan():
Copy link
Copy Markdown
Contributor Author

@vertex451 vertex451 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case <-podsWatch.ResultChan() only drained one buffered event and didn't stop the watcher

}

func podKey(pod *corev1.Pod) string {
return pod.Namespace + "/" + pod.Name
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to prevent a collision between pods with the same name but from different namespaces.

continue
}
switch evt.Type {
case watch.Added:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event was ignored before

@vertex451 vertex451 marked this pull request as ready for review March 25, 2026 13:45
@vertex451 vertex451 requested a review from a team as a code owner March 25, 2026 13:45
@vertex451 vertex451 self-assigned this Mar 25, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
operator/inventory/node-discovery.go (1)

541-546: ⚠️ Potential issue | 🟠 Major

Refresh node quantities on every Added/Modified event.

This guard skips updateNodeInfo() for capacity-only updates, and it also misses sub-core CPU changes because nodeAllocatableChanged() relies on resource.Quantity.Value(). That leaves stale allocatable/capacity data in node, so the new sanitization/clamping path is bypassed on a normal node update. restartPodsWatcher() can stay behind change detection if you want, but updateNodeInfo() itself should run for every watch.Added/watch.Modified.

🛠️ One safe way to split the refresh from the watcher restart
 				switch evt.Type {
 				case watch.Added:
 					fallthrough
 				case watch.Modified:
-					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
-						updateNodeInfo(ctx, obj, &node)
+					updateNodeInfo(ctx, obj, &node)
+					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
 						if err = restartPodsWatcher(); err != nil {
 							return err
 						}
 					}
 					signalLabels()
In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call
updateNodeInfo(...) for every watch.Added or watch.Modified event instead of
skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely
to decide whether to call restartPodsWatcher(), not to gate the update. Modify
the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node)
is invoked unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.

In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.

---

Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0116d940-8636-4609-8a76-8b97ce343688

📥 Commits

Reviewing files that changed from the base of the PR and between ba474df and b534eb1.

📒 Files selected for processing (5)
  • cluster/kube/operators/clients/inventory/client.go
  • cluster/kube/operators/clients/inventory/client_test.go
  • cluster/kube/operators/clients/inventory/inventory.go
  • operator/inventory/node-discovery.go
  • operator/inventory/node-discovery_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant