Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughContext is threaded into inventory objects; resource quantities are sanitized/clamped to zero when negative or underflowing; pod keys become namespace-aware; watcher and node update flows were adjusted. Tests added to verify clamping and underflow behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } | ||
|
|
||
| func subAllocatedNLZ(allocated *resource.Quantity, val resource.Quantity) { | ||
| newVal := allocated.Value() - val.Value() |
There was a problem hiding this comment.
.Value() truncates to whole units. For CPU (stored in milli), allocated = 500m means .Value() == 0, val = 200m means .Value() == 0, so newVal = 0 and you'd Set(0) - losing 300m of tracked allocation.
Over time, every pod delete that involved sub-unit CPU quantities would drift Allocated toward 0, making Available appear larger than reality.
On the other hand, .Sub() operates on the full-precision quantity
| restartPodsWatcher := func() error { | ||
| if podsWatch != nil { | ||
| select { | ||
| case <-podsWatch.ResultChan(): |
There was a problem hiding this comment.
case <-podsWatch.ResultChan() only drained one buffered event and didn't stop the watcher
| } | ||
|
|
||
| func podKey(pod *corev1.Pod) string { | ||
| return pod.Namespace + "/" + pod.Name |
There was a problem hiding this comment.
We need this to prevent a collision between pods with the same name but from different namespaces.
| continue | ||
| } | ||
| switch evt.Type { | ||
| case watch.Added: |
There was a problem hiding this comment.
This event was ignored before
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
operator/inventory/node-discovery.go (1)
541-546:⚠️ Potential issue | 🟠 MajorRefresh node quantities on every
Added/Modifiedevent.This guard skips
updateNodeInfo()for capacity-only updates, and it also misses sub-core CPU changes becausenodeAllocatableChanged()relies onresource.Quantity.Value(). That leaves stale allocatable/capacity data innode, so the new sanitization/clamping path is bypassed on a normal node update.restartPodsWatcher()can stay behind change detection if you want, butupdateNodeInfo()itself should run for everywatch.Added/watch.Modified.🛠️ One safe way to split the refresh from the watcher restart
switch evt.Type { case watch.Added: fallthrough case watch.Modified: - if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) { - updateNodeInfo(ctx, obj, &node) + updateNodeInfo(ctx, obj, &node) + if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) { if err = restartPodsWatcher(); err != nil { return err } } signalLabels()In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call updateNodeInfo(...) for every watch.Added or watch.Modified event instead of skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely to decide whether to call restartPodsWatcher(), not to gate the update. Modify the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked unconditionally when evt.Type is Added or Modified, and only if nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher(). Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue() (or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core (milli) CPU values and use proper Quantity.Cmp/Equal semantics for other resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.
In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.
---
Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0116d940-8636-4609-8a76-8b97ce343688
📒 Files selected for processing (5)
cluster/kube/operators/clients/inventory/client.gocluster/kube/operators/clients/inventory/client_test.gocluster/kube/operators/clients/inventory/inventory.gooperator/inventory/node-discovery.gooperator/inventory/node-discovery_test.go
Resolves:
akash-network/support#429
Details
I haven't found a bug in how do we handle inventory events.
Also, I've found recent report about GPU driver reporting wrong values.
That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.
Changes
Testing
Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.