Skip to content

Conversation

@dcoppa
Copy link
Contributor

@dcoppa dcoppa commented Sep 16, 2025

  • Wrap netlink.ErrDumpInterrupted errors with retry logic in netlinkwrapper
  • Implement retry mechanism for LinkList, RouteList, AddrList, and RuleList operations
  • Update files to use netlinkwrapper instead of direct netlink calls

What type of PR is this?

improvement

Which issue does this PR fix?:

Errors like:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b413c572e7bac7a27767123be3da3e031a3ac5b3695efc995c15057996b2c930": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistent

What does this PR do / Why do we need it?:

This PR adds a retry mechanism when netlink fails with error ErrDumpInterrupted.

Adapted from https://github.com/containernetworking/plugins/blob/main/pkg/netlinksafe/netlink.go

This new PR should address all the remarks made on my previous submission.
Sorry for the delay, life got in the way.

Testing done on this change:

Unit tests passed.
Also tested on my sandbox EKS cluster.

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:

No / Yes

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

Implement a retry mechanism for netlink that, on an IPv6 EKS cluster, should fix (or mitigate) the following error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e868238d122df610228688f51fee6a187544a45ce43a9d613307b6b08246e65a": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistent

No additional actions from users required.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dcoppa dcoppa requested a review from a team as a code owner September 16, 2025 12:19
@dcoppa dcoppa force-pushed the master branch 2 times, most recently from 8be2075 to 655920a Compare September 23, 2025 05:44
@dcoppa dcoppa changed the title chore: add configurable retry logic for netlink dump operations feat: add configurable retry logic for netlink dump operations Sep 23, 2025
@dcoppa dcoppa force-pushed the master branch 3 times, most recently from 51648a3 to a1bb512 Compare September 27, 2025 06:39
@viveksb007
Copy link
Contributor

Thanks @dcoppa for the PR.

can you also open a corresponding issue for this PR? (under what conditions in IPv6 cluster do you see this issue? if we can re-produce this issue on our side, that will be great)

@dcoppa
Copy link
Contributor Author

dcoppa commented Sep 28, 2025

Thanks @dcoppa for the PR.

can you also open a corresponding issue for this PR? (under what conditions in IPv6 cluster do you see this issue? if we can re-produce this issue on our side, that will be great)

Hi!

Here's the issue: #3461

The problem has also already been discussed in issue number #3196.
Take a look in particular at @yash97 ’s comments.

@jaydeokar
Copy link
Collaborator

Thanks for raising this PR. We will take a look

@jaydeokar jaydeokar requested a review from Copilot September 30, 2025 20:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds configurable retry logic for netlink dump operations to handle ErrDumpInterrupted errors that can occur during IPv6 network operations on EKS clusters. The implementation wraps netlink operations with automatic retry functionality and provides an environment variable for configuration.

  • Implements retry mechanism in netlinkwrapper package for LinkList, RouteList, AddrList, and RuleList operations
  • Updates multiple files to use the netlinkwrapper instead of direct netlink calls
  • Adds hardcoded retry logic with 5 maximum attempts (despite PR description mentioning configurable environment variable)

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/netlinkwrapper/netlink.go Adds retry logic wrapper function and implements retry mechanism for dump operations
test/agent/cmd/networking/tester/network.go Updates to use netlinkwrapper instead of direct netlink calls
test/_cmd/packet-verifier/packet-verifier.go Updates to use netlinkwrapper instead of direct netlink calls
pkg/networkutils/network.go Updates parameter name for consistency
cmd/routed-eni-cni-plugin/driver/driver.go Minor fix to use netLink interface method
cmd/aws-vpc-cni-init/main.go Updates to use netlinkwrapper instead of direct netlink calls

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@dcoppa dcoppa changed the title feat: add configurable retry logic for netlink dump operations feat: add retry logic for netlink dump operations Oct 1, 2025
@dcoppa dcoppa force-pushed the master branch 3 times, most recently from f939764 to e89ac62 Compare October 3, 2025 19:42
@yash97
Copy link
Contributor

yash97 commented Oct 5, 2025

/lgtm

- Wrap netlink.ErrDumpInterrupted errors with retry logic in netlinkwrapper
- Implement retry mechanism for LinkList, RouteList, AddrList, and RuleList operations
- Add AWS_VPC_K8S_CNI_NETLINK_MAX_RETRIES environment variable (default: 5)
- Update test files to use netlinkwrapper instead of direct netlink calls
@yash97 yash97 merged commit ff786a6 into aws:master Oct 13, 2025
7 checks passed
@jaydeokar jaydeokar modified the milestones: v1.20.4, v1.20.5 Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority/P0 Highest priority. Someone needs to actively work on this. review-pending

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants