Skip to content

Conversation

@phuhung273
Copy link
Contributor

@phuhung273 phuhung273 commented Apr 6, 2025

What type of PR is this?
improvement

Which issue does this PR fix?:
Close #3094

What does this PR do / Why do we need it?:

  • Do not reserve ENI slot for non-supported instance type
  • Integration test util add CreateAndWaitTillManagedNGReady

Testing done on this change:
make unit-test

Integration test

# ginkgo -v --fail-on-pending --  --cluster-kubeconfig=$KUBECONFIG  --cluster-name=$CLUSTER_NAME  --aws-region=$AWS_REGION  --aws-vpc-id=$VPC_ID
2025/09/23 22:03:22 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
Running Suite: ENI Trunking Suite - /home/hungtran/dev/amazon-vpc-cni-k8s/test/integration/eni-trunking
=======================================================================================================
Random Seed: 1758639804

Will run 1 of 1 specs
------------------------------
[BeforeSuite]
/home/hungtran/dev/amazon-vpc-cni-k8s/test/integration/eni-trunking/eni_trunking_suite_test.go:31
  STEP: creating test namespace @ 09/23/25 22:03:40.419
  STEP: Getting Private subnets @ 09/23/25 22:03:41.199
  STEP: Deploying non-eni-trunking t3.medium managed nodegroup of size 1 @ 09/23/25 22:03:44.803
[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
        >  goroutine 198 [running]:
        >  runtime/debug.Stack()
        >       /snap/go/10938/src/runtime/debug/stack.go:26 +0x5e
        >  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
        >       /root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/log.go:60 +0xcd
        >  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc0007a2600, 0x3)
        >       /root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:111 +0x32
        >  github.com/go-logr/logr.Logger.Info({{0x6a4ca00?, 0xc0007a2600?}, 0xc000974000?}, {0x633272e, 0x12}, {0xc000d44c60, 0x6, 0x6})
        >       /root/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:276 +0x6e
        >  k8s.io/client-go/tools/cache.(*Reflector).RunWithContext(0xc000256780, {0x6a44b98, 0xc0007662d0})
        >       /root/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:357 +0x1c8
        >  k8s.io/client-go/tools/cache.(*controller).RunWithContext.(*Group).StartWithContext.func3()
        >       /root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:63 +0x1f
        >  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        >       /root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x4c
        >  created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 176
        >       /root/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73
[BeforeSuite] PASSED [131.338 seconds]
------------------------------
ENI Trunking Suite ENABLE_POD_ENI=true Non ENI trunking instance can scale to maxPods
/home/hungtran/dev/amazon-vpc-cni-k8s/test/integration/eni-trunking/eni_trunking_test.go:23
  STEP: setting the environment variables on the ds to map[ENABLE_POD_ENI:true] @ 09/23/25 22:06:02.66
  STEP: getting the aws-node daemon set in namespace kube-system @ 09/23/25 22:06:02.66
  STEP: updating the daemon set with new environment variable @ 09/23/25 22:06:02.961
  STEP: Deploying 13 Busybox pods @ 09/23/25 22:06:09.248
• [25.472 seconds]
------------------------------
[AfterSuite]
/home/hungtran/dev/amazon-vpc-cni-k8s/test/integration/eni-trunking/eni_trunking_suite_test.go:61
  STEP: Deleting test namespace @ 09/23/25 22:06:28.132
  STEP: Deleting Managed Nodegroup @ 09/23/25 22:06:44.33
[AfterSuite] PASSED [395.832 seconds]
------------------------------

Ran 1 of 1 Specs in 552.642 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Ginkgo ran 1 suite in 9m22.027200733s
Test Suite Passed

Manual test

Prerequisites:

SGP enabled
Screenshot 2025-09-06 130136

Instance types for testing:

  • t3.medium, no trunking, maxPods=17
Screenshot 2025-09-06 125251
  • m7a.medium, trunking supported, maxPods=8
Screenshot 2025-09-06 125214

Reproducing

On latest CNI, no-trunking cannot scale to maxPods
Screenshot 2025-09-06 130044

Test new code

  1. Build test image IMAGE=phuhung273/amazon-k8s-cni make docker: phuhung273/amazon-k8s-cni:089fc089
  2. Modify aws-node to use test image
  3. no-trunking can scale to maxPods
Screenshot 2025-09-06 124920
  1. trunking still cannot scale to maxPods
Screenshot 2025-09-06 124834

ipamd.log

{"level":"info","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:775","msg":"Successfully added feature SecurityGroupsForPods to CNINode if not existing"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:779","msg":"IP pool is too low for Network Card 0: available (0) < ENI target (1) * addrsPerENI (3)"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:2523","msg":"IP pool stats for network card 0: Total IPs/Prefixes = 3/0, AssignedIPs/CooldownIPs: 3/0, c.maxIPsPerENI = 3"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:765","msg":"IP stats for Network Card 0 - total IPs: 3, assigned IPs: 3, cooldown IPs: 0"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:783","msg":"Starting to increase pool size for network card 0"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:936","msg":"Node found \"ip-172-31-6-102.ec2.internal\" - no of taints - 1"}
{"level":"debug","ts":"2025-09-06T05:40:19.098Z","caller":"ipamd/ipamd.go:783","msg":"Skipping ENI allocation as the max ENI limit is already reached"}

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Can the team show me how to test this case ?

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:

Optimize ENI slot reservation for non-supported instance type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@phuhung273 phuhung273 requested a review from a team as a code owner April 6, 2025 14:10
@phuhung273 phuhung273 changed the title optimize ENI slot reservation for non-supported instance type Optimize ENI slot reservation for non-supported instance type Apr 6, 2025
@phuhung273 phuhung273 marked this pull request as draft April 6, 2025 14:17
@phuhung273 phuhung273 marked this pull request as ready for review April 7, 2025 05:06
@jayanthvn jayanthvn requested a review from Copilot May 27, 2025 01:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR prevents ENI slot reservation on instance types that don’t support ENI trunking by introducing a check in the IPAMD logic and wiring it through AWS utils and tests.

  • Add lists of non-supported ENI trunking instance types/families and implement IsENITrunkingSupported
  • Update hasRoomForEni to account for trunking support
  • Extend mocks and unit tests in both ipamd and awsutils packages to cover the new trunking flag

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/ipamd/ipamd.go Added IsENITrunkingSupported check in hasRoomForEni
pkg/ipamd/ipamd_test.go Expanded test cases (testIncreaseIPPool) to include trunking
pkg/awsutils/no_eni_trunking_instance_types.go Defined noENITrunkingInstanceTypes and noENITrunkingInstanceFamilies
pkg/awsutils/awsutils.go Implemented IsENITrunkingSupported using new slices
pkg/awsutils/mocks/awsutils_mocks.go Added mock recorder and implementation for IsENITrunkingSupported
pkg/awsutils/awsutils_test.go Added unit tests for IsENITrunkingSupported
Comments suppressed due to low confidence (3)

pkg/ipamd/ipamd_test.go:514

  • The parameter name 'unschedulabeNode' is misspelled and may confuse readers. Consider renaming it to 'unschedulableNode'.
func testIncreaseIPPool(t *testing.T, useENIConfig bool, unschedulabeNode bool, subnetDiscovery bool, eniTrunking bool) {

pkg/ipamd/ipamd.go:2236

  • Consider adding unit tests for hasRoomForEni to verify its behavior when ENI trunking is supported versus not supported, ensuring the trunkEni offset is applied correctly.
if c.awsClient.IsENITrunkingSupported() && c.enablePodENI && c.dataStore.GetTrunkENI() == "" {

pkg/ipamd/ipamd_test.go:668

  • [nitpick] This test sets MY_NODE_NAME but doesn’t unset it; environment state may leak between tests. Consider using defer os.Unsetenv("MY_NODE_NAME") or explicitly calling os.Unsetenv in teardown.
func TestIncreasePrefixPoolDefault(t *testing.T) {

@phuhung273 phuhung273 force-pushed the non-supported-eni-trunking-instance-type branch from 92b3109 to d8372ce Compare July 19, 2025 04:04
@phuhung273
Copy link
Contributor Author

Hi @jayanthvn @yash97, do you think this PR reasonable to go ahead ?

@jayanthvn jayanthvn requested a review from jaydeokar September 3, 2025 19:06
@jaydeokar
Copy link
Collaborator

@phuhung273 could you rebase your changes.

@phuhung273 phuhung273 force-pushed the non-supported-eni-trunking-instance-type branch from d8372ce to 19f9fd3 Compare September 3, 2025 23:14
@phuhung273
Copy link
Contributor Author

Thanks for taking a look @jaydeokar, I just rebased the branch

@jaydeokar
Copy link
Collaborator

Thanks for making this change. Since the instance types and families which currently don't support trunking is fixed, lets add this to the scripts/gen_vpc_ip_limits.go and add a property IsTrunkingEnabled.
This script generates the file pkg/vpc/vpc_ip_resource_limit.go which CNI refers for getting instance details.

The default value should be true, but only set false for the instance types and families which you've added in the new file

@jaydeokar
Copy link
Collaborator

For testing -
I'd recommend creating a cluster with 2/3 managed node groups. One of the managed node group should be of a supported instance type and other non supported.

  1. set ENABLE_POD_ENI on aws-node
  2. Verify if SGP is enabled (You should see a custom resource of type CNINode for each instance)
  3. Add a deployment and scale up such that it saturates the node (max-pods). On the non-supported node, it should be able to scale up to max-pods value, whereas on the trunking supported node it should not

@yash97
Copy link
Contributor

yash97 commented Sep 4, 2025

instead of maintaining list of non supported instances, can you explore using this file https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go instead. You can check this field IsTrunkingCompatible

}

// IsENITrunkingSupported return true if the instance type is not in non-supported list
func (cache *EC2InstanceMetadataCache) IsENITrunkingSupported() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know this super helpful file exist. Thank you so much. I've updated the function to use it.

@phuhung273 phuhung273 force-pushed the non-supported-eni-trunking-instance-type branch from 19f9fd3 to 089fc08 Compare September 5, 2025 11:20
@phuhung273
Copy link
Contributor Author

Thanks @jaydeokar for the test instruction. Let me try it.

@phuhung273 phuhung273 requested a review from yash97 September 8, 2025 00:24
@phuhung273
Copy link
Contributor Author

Hi team, I've included e2e manual test in PR description.

Can also find a similar (require creating nodegroup) SNAT integration test

props = utils.NodeGroupProperties{
NgLabelKey: "test-label-key",
NgLabelVal: "test-label-val",
AsgSize: 1,
NodeGroupName: "snat-test-ng",
Subnet: []string{
privateSubnetId,
},
InstanceType: "m5.large",
KeyPairName: DEFAULT_KEY_PAIR,
}
err = utils.CreateAndWaitTillSelfManagedNGReady(f, props)

Let me know if you want to add integration test @jaydeokar @yash97, I will just learn from the code base. Thanks team.

@jaydeokar
Copy link
Collaborator

Yes please could you add a small test suite ? You can create managed nodegroups as well, doesn't have to be self managed in this case. Just need to pass the right instance type.

The change overall looks good to me.. Thanks for working on this

@phuhung273 phuhung273 force-pushed the non-supported-eni-trunking-instance-type branch from 089fc08 to f0fe88e Compare September 23, 2025 15:17
@phuhung273
Copy link
Contributor Author

phuhung273 commented Sep 23, 2025

I've added a new integration suite using Managed NodeGroup as discussed. Test output is in PR description. Let me know what you think @jaydeokar. Thank you.

@jaydeokar jaydeokar requested a review from Copilot September 24, 2025 00:26
@jaydeokar jaydeokar added this to the v1.20.3 milestone Sep 24, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jaydeokar jaydeokar force-pushed the non-supported-eni-trunking-instance-type branch from f0fe88e to dadd07a Compare September 24, 2025 19:30
@jaydeokar jaydeokar removed this from the v1.20.3 milestone Sep 24, 2025
@jaydeokar jaydeokar added this to the v1.20.4 milestone Sep 24, 2025
@jaydeokar jaydeokar self-assigned this Oct 1, 2025
Copy link
Collaborator

@jaydeokar jaydeokar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jaydeokar
Copy link
Collaborator

@phuhung273, thanks for making the change. We'll be tracking this with the next CNI release

@phuhung273
Copy link
Contributor Author

Thanks so much team for your guidance @jaydeokar @yash97 @jayanthvn.

@jaydeokar jaydeokar merged commit a637a52 into aws:master Oct 1, 2025
7 checks passed
@phuhung273 phuhung273 deleted the non-supported-eni-trunking-instance-type branch October 1, 2025 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VPC CNI shouldn't reserve a ENI for trunk ENI if the instance doesn't support ENI trunking

3 participants