-
Notifications
You must be signed in to change notification settings - Fork 794
feat: add retry logic for netlink dump operations #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8be2075 to
655920a
Compare
51648a3 to
a1bb512
Compare
|
Thanks @dcoppa for the PR. can you also open a corresponding issue for this PR? (under what conditions in IPv6 cluster do you see this issue? if we can re-produce this issue on our side, that will be great) |
Hi! Here's the issue: #3461 The problem has also already been discussed in issue number #3196. |
|
Thanks for raising this PR. We will take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds configurable retry logic for netlink dump operations to handle ErrDumpInterrupted errors that can occur during IPv6 network operations on EKS clusters. The implementation wraps netlink operations with automatic retry functionality and provides an environment variable for configuration.
- Implements retry mechanism in
netlinkwrapperpackage for LinkList, RouteList, AddrList, and RuleList operations - Updates multiple files to use the
netlinkwrapperinstead of direct netlink calls - Adds hardcoded retry logic with 5 maximum attempts (despite PR description mentioning configurable environment variable)
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/netlinkwrapper/netlink.go | Adds retry logic wrapper function and implements retry mechanism for dump operations |
| test/agent/cmd/networking/tester/network.go | Updates to use netlinkwrapper instead of direct netlink calls |
| test/_cmd/packet-verifier/packet-verifier.go | Updates to use netlinkwrapper instead of direct netlink calls |
| pkg/networkutils/network.go | Updates parameter name for consistency |
| cmd/routed-eni-cni-plugin/driver/driver.go | Minor fix to use netLink interface method |
| cmd/aws-vpc-cni-init/main.go | Updates to use netlinkwrapper instead of direct netlink calls |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
f939764 to
e89ac62
Compare
|
/lgtm |
- Wrap netlink.ErrDumpInterrupted errors with retry logic in netlinkwrapper - Implement retry mechanism for LinkList, RouteList, AddrList, and RuleList operations - Add AWS_VPC_K8S_CNI_NETLINK_MAX_RETRIES environment variable (default: 5) - Update test files to use netlinkwrapper instead of direct netlink calls
…er before it is initialized
What type of PR is this?
improvement
Which issue does this PR fix?:
Errors like:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b413c572e7bac7a27767123be3da3e031a3ac5b3695efc995c15057996b2c930": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistentWhat does this PR do / Why do we need it?:
This PR adds a retry mechanism when netlink fails with error ErrDumpInterrupted.
Adapted from https://github.com/containernetworking/plugins/blob/main/pkg/netlinksafe/netlink.go
This new PR should address all the remarks made on my previous submission.
Sorry for the delay, life got in the way.
Testing done on this change:
Unit tests passed.
Also tested on my sandbox EKS cluster.
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No / Yes
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
Implement a retry mechanism for netlink that, on an IPv6 EKS cluster, should fix (or mitigate) the following error:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e868238d122df610228688f51fee6a187544a45ce43a9d613307b6b08246e65a": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistentNo additional actions from users required.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.