Skip to content

Conversation

@jaydeokar
Copy link
Collaborator

@jaydeokar jaydeokar commented Jul 9, 2025

What type of PR is this?
feature

Which issue does this PR fix?:

N/A

What does this PR do / Why do we need it?:
Amazon VPC CNI only manages Network Card 0 on all instances (including multicard supported instances). This restricts the bandwidth usage to only network card 0 as pod interfaces are only connected to NIC 0. With this change, CNI now starts to manage all the available network cards.
A pod can now request access to these network cards via an annotation to make use of all the available bandwidth. A pod which requires this support can do so via annotation. This will create interfaces in the pod namespace equal to the number of network cards available for use on the instance. The pods can then use these interfaces for their egress traffic which has certain BW requirements.

Describing major changes as the change log is significant

CNI

Add flow

  • CNI looks for annotations on the pod using the capabilities option. If the pod contains annotation k8s.amazonaws.com/nicConfig: multi-nic-attachment, CNI will ask for multiple IPs from network cards available on the instance.
  • IPAMD now returns a list of IPs (IPAddress field in GRPC response). The number of IPs returned equals the number of interfaces created by CNI inside the Pod.
  • Host veth name for multi-nic interfaces is generated by <pod-namespace>.<pod-name>.<index>. However we still retain the original naming convention for the first interface of the pod <pod-namespace>.<pod-name>
  • Container veth name for multi-nic interfaces has a prefix mNicIf eg mNicf1, mNicIf2 ...
  • IPAMD now returns route table number instead of just the device number. The Route table number is calculated as (network card index * max ENIs per NIC) + device-number + 1. Note Route table number 1 (device 0, network card 0) is the main route table for CNI
  • CNI now stores the Route table number of the interface in the container interface struct MAC, which is used to cleanup pod networking when IPAMD is down (del with prev result)
  • CNI runs a loop over the IP address and attaches to the Pod network.

Delete flow

  • CNI relies on IPAMD to return the number of IPs and interface to delete
  • If IPAMD is down, CNI uses the prev result to delete the Pod and Host Network

IPAMD

Node Init Flow

  1. IPAMD now initializes all the network card on the instance and creates datastore for network cards which it needs to manage
  2. CNI by default assigns 1 WARM IP and 1 MIN IP for secondary ENIs on Network card > 0
  3. CNI manages all network cards by default except the following cases
    a. Network card just has an efa-only ENI attached with no additional ENIs
  4. CNI manages the network cards when
    a. It is empty i.e no ENIs are attached to it
    b. It has an EFA ENI attached
    c. It has an EFA-only ENI attached to an device number but also has an ENA device attached on the network card
  5. For IPv4 and IPv6, it initializes the host network for the primary interface
    a. For IPv6, all traffic will egress out of the ENIs the traffic comes to i.e we don't add any SNAT or Connmark rules for IPv6
  6. IPAMD now includes checks for UpdatingCIDRRules when the VPC CIDR or the Security Group associated with the ENIs changes for IPv6 (which already existed for IPv4)
  7. On node init, IPAMD initializes all the available datastores and if the datastore is low, it creates interfaces for each network card. It has a new function for IPv6 ENIs where we create 1 ENI for each network card with IPv6 address and then assign an Prefix of size /80. If the creation of ENI fails (due to missing permission), we don't fail but you can see the permisisons error in ipamd logs

Testing done on this change:
Yes, ran all the test suites on a single card instance and ran manual tests on a multicard- instance

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Upgrades should be fine. Downgrade requires to delete the pods using multi-nic annotation and then downgrade otherwise the Pod IPs/Host Networking setup can leak

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
Yes, customers will now see interfaces attached to NIC > 0 on supported instances


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaydeokar jaydeokar marked this pull request as ready for review July 9, 2025 21:46
@jaydeokar jaydeokar requested a review from a team as a code owner July 9, 2025 21:46
@jaydeokar jaydeokar changed the title Adding multi-NIC changes Adding support in CNI for managing multiple network interface card on the instance Jul 9, 2025
@jaydeokar jaydeokar requested a review from Copilot July 9, 2025 21:56

This comment was marked as outdated.


// Since we are always using dummy interface to store the device number of network card 0 IP
// RT ID for NC-0 is also stored in the container interface entry. So we have a path for migration where
// getting the device number from dummy interface can be deprecated entirely. This is currently done to keep it backwards compatible
Copy link
Contributor

@M00nF1sh M00nF1sh Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we are still creating "dummy interface" for both old & new(multi-nic) path, however, for multi-nic, you already encoded the deviceNumber in container mac.

However on deletes path, your new logic are still depends on "dummy interface" first(which will exist since you created it)

i think you should structure the code such that "new logic" no longer depends on dummy interface.

we need a clear migration here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized, we will have to keep the dummy interface logic for downgrade scenarios. I'll add the path to migration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the details about how we can remove this dummy interface. We have to keep it for now for downgrade path


func (ds *DataStoreAccess) GetDataStore(networkCard int) *DataStore {

for index, datastore := range ds.DataStores {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not store a mapping of index -> ds?

@jaydeokar jaydeokar requested a review from Copilot July 14, 2025 18:49

This comment was marked as outdated.

@jaydeokar jaydeokar force-pushed the multi-nic-code-changes branch 2 times, most recently from 45cd88e to b042c0f Compare July 14, 2025 18:57
@jaydeokar jaydeokar requested a review from Copilot July 14, 2025 18:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for managing multiple network interface cards (NICs) in the Amazon VPC CNI plugin by extending RPC definitions, network utilities, and tests to handle multi-NIC attachments.

  • Protobuf and gRPC stubs updated to return repeated IPAllocationMetadata and include a RequiresMultiNICAttachment flag.
  • Network utility APIs (SetupENINetwork, routing rule methods) extended to accept a network card index and calculate per-NIC route tables.
  • Helper and naming functions added to generate veth names for multi-NIC pods, and lots of tests updated to cover new parameters.

Reviewed Changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
rpc/rpc.proto Updated messages to support repeated IP allocations and a multi-NIC attachment flag.
pkg/networkutils/network.go Extended ENI setup and rule management to accept networkCard and maxENIPerNIC, added DeleteRulesBySrc.
pkg/networkutils/helper.go Introduced CalculateRouteTableId, CalculatePodIPv4GatewayIP, CalculatePodIPv6GatewayIP, and IsIPv4.
pkg/networkutils/names.go Enhanced veth name generators to accept an interface index for host and container sides.
pkg/networkutils/network_test.go Updated ENI network tests to pass networkCard and maxENIPerNIC parameters.
pkg/ipamd/rpc_handler_test.go Expanded IPAMD RPC handler tests to assert multi-NIC metadata and combined errors.
Comments suppressed due to low confidence (5)

pkg/networkutils/network.go:1363

  • No unit tests cover the new DeleteRulesBySrc method; please add tests to verify that rules are correctly removed for both IPv4 and IPv6 inputs.
func (n *linuxNetwork) DeleteRulesBySrc(eniIP string, isV6 bool) error {

pkg/networkutils/helper.go:13

  • Add unit tests for the new gateway calculation helpers (CalculatePodIPv4GatewayIP and CalculatePodIPv6GatewayIP) and IsIPv4 to ensure correct output across different inputs.
func CalculatePodIPv4GatewayIP(index int) net.IP {

pkg/networkutils/helper.go:5

  • Consider adding a doc comment to explain how the route table ID is calculated, including parameter roles and valid ranges.
func CalculateRouteTableId(deviceNumber int, networkCardIndex int, maxENIsPerNetworkCard int) int {

pkg/networkutils/network.go:1363

  • [nitpick] Parameter name 'isV6' differs from other methods that use 'v6Enabled' or 'v6enabled'; standardize naming for consistency across the API.
func (n *linuxNetwork) DeleteRulesBySrc(eniIP string, isV6 bool) error {

rpc/rpc.proto:36

  • [nitpick] Consider renaming the repeated field 'IPAllocationMetadata' to a plural form such as 'ip_allocations' or 'allocations' to avoid confusion with the message type.
  repeated IPAllocationMetadata IPAllocationMetadata = 2;

@jaydeokar jaydeokar force-pushed the multi-nic-code-changes branch from b042c0f to a8957df Compare July 14, 2025 19:01
@jaydeokar jaydeokar force-pushed the multi-nic-code-changes branch from a8957df to 75ba115 Compare July 14, 2025 19:18
@M00nF1sh M00nF1sh self-assigned this Jul 14, 2025
@M00nF1sh
Copy link
Contributor

/lgtm
/approve
as discussed offline, this code is currently in prod usage by Cx, we decided to fast-forward with this change and take improvements as follow up

@M00nF1sh M00nF1sh self-requested a review July 14, 2025 21:59
@jaydeokar jaydeokar merged commit 8280a55 into aws:master Jul 14, 2025
6 of 7 checks passed
@jaydeokar jaydeokar added this to the v1.20.0 milestone Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants