Adding CancelDrainTask to ASG termination to close orphaned generated heartbeat from nodes failing to cordon and drain #1173
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Fixes #1172
Problem Description
The Node Termination Handler has a critical bug in ASG termination event handling that creates orphaned heartbeat goroutines when node drain operations fail.
Current Behavior (Buggy)
When an ASG termination event fails to drain a node:
PreDrainTask
starts a heartbeat goroutinecordonAndDrainNode
fails to evict podsCancelInterruptionEvent
removes the event but never stops the heartbeatImpact
Solution
Implemented a
CancelDrainTask
mechanism that mirrors the existingPreDrainTask
/PostDrainTask
pattern to properly terminate heartbeats on drain failures.Key Changes
pkg/monitor/sqsevent/asg-lifecycle-event.go
cancelHeartbeatCh
channel for heartbeat cancellationCancelDrainTask
function to close the cancel channelSendHeartbeats
to listen for cancellation signalspkg/interruptionevent/draincordon/handler.go
RunCancelDrainTask
when drain operations fail andCancelDrainTask
existspkg/monitor/sqsevent/sqs-monitor_test.go
CancelDrainTask
creation and executionTesting
Automated Tests (All Passing)
make unit-test
)make e2e-test
)make compatibility-test
)make license-test
)make go-linter
)make helm-lint
)make spellcheck
)Tested on: macOS (ARM64) (also ran
make unit-test
on Linux x86_64)Kubernetes Version: 1.30
Manual Validation
Scenario: Deployed NTH in EKS cluster and blocked Kubernetes API calls to simulate drain failures
Before Fix:
After Fix:
Backward Compatibility
CancelDrainTask
is optional (nil-safe)Code Implementation
Possible Reproduction Steps (for verification):
deleteSqsMsgIfNodeNotFound=false
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.