Skip to content

Adding CancelDrainTask to ASG termination to close orphaned generated heartbeat from nodes failing to cordon and drain #1173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2025

Conversation

pshand-1
Copy link
Contributor

Issue

Fixes #1172

Problem Description

The Node Termination Handler has a critical bug in ASG termination event handling that creates orphaned heartbeat goroutines when node drain operations fail.

Current Behavior (Buggy)

When an ASG termination event fails to drain a node:

  1. PreDrainTask starts a heartbeat goroutine
  2. cordonAndDrainNode fails to evict pods
  3. CancelInterruptionEvent removes the event but never stops the heartbeat
  4. The heartbeat goroutine continues running indefinitely, sending heartbeats every 30 seconds

Impact

  • Resource leak: Failed drain attempts create orphaned goroutines
  • Cascading failure: Events retry every 20 seconds, creating new orphaned heartbeats each time
  • AWS API spam: Orphaned heartbeats continue sending unnecessary API calls

Solution

Implemented a CancelDrainTask mechanism that mirrors the existing PreDrainTask/PostDrainTask pattern to properly terminate heartbeats on drain failures.

Key Changes

pkg/monitor/sqsevent/asg-lifecycle-event.go

  • Added cancelHeartbeatCh channel for heartbeat cancellation
  • Created CancelDrainTask function to close the cancel channel
  • Enhanced SendHeartbeats to listen for cancellation signals
  • Added proper logging for heartbeat cancellation events

pkg/interruptionevent/draincordon/handler.go

  • Added drain failure detection in the error handling path
  • Calls RunCancelDrainTask when drain operations fail and CancelDrainTask exists
  • Maintains existing error handling flow while adding cleanup

pkg/monitor/sqsevent/sqs-monitor_test.go

  • Added unit tests for CancelDrainTask creation and execution
  • Added heartbeat cancellation test to verify proper termination
  • Integrated with existing test patterns

Testing

Automated Tests (All Passing)

  • ✅ Unit tests (make unit-test)
  • ✅ E2E tests (make e2e-test)
  • ✅ Compatibility tests (make compatibility-test)
  • ✅ License validation (make license-test)
  • ✅ Linting (make go-linter)
  • ✅ Helm validation (make helm-lint)
  • ✅ Spell check (make spellcheck)

Tested on: macOS (ARM64) (also ran make unit-test on Linux x86_64)
Kubernetes Version: 1.30

Manual Validation

Scenario: Deployed NTH in EKS cluster and blocked Kubernetes API calls to simulate drain failures

Before Fix:

- New heartbeats created every 20 seconds
- Heartbeats continue indefinitely (tested over 2+ hours)
- Multiple orphaned goroutines accumulating

After Fix:

2025/06/26 23:15:30 INF Failed to cordon and drain the node, stopping heartbeat asgName=...
- Heartbeat properly terminated on drain failure
- No orphaned goroutines
- Clean event cleanup

Backward Compatibility

  • ✅ No breaking changes to existing APIs
  • ✅ Maintains existing successful drain flow
  • ✅ Only adds cleanup for failure scenarios
  • CancelDrainTask is optional (nil-safe)

Code Implementation

  • Follows existing code patterns and conventions
  • Comprehensive error handling and logging
  • Proper resource cleanup

Possible Reproduction Steps (for verification):

  1. Create two EKS Clusters
  2. Deploy NTH on one of them with deleteSqsMsgIfNodeNotFound=false
  3. Terminate instance on different cluster with same tag as NTH managed tag
  4. Observe repeated drain failures creating orphaned heartbeats
  5. With fix: heartbeats properly terminate on failure

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pshand-1 pshand-1 requested a review from a team as a code owner June 27, 2025 03:14
Copy link
Contributor

@tiationg-kho tiationg-kho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for fixing this issue.

@tiationg-kho tiationg-kho merged commit 82f5378 into aws:main Jun 30, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS Node Termination Handler Bug Fix Request Heartbeat
2 participants