Skip to content

Service CIDR must be manually configured when using kube-proxy in nftables mode with VPC CNI iptables-nft #3563

@shaikatzz

Description

@shaikatzz

What happened?

When running kube-proxy in native nftables mode while VPC CNI uses iptables-nft, pods on secondary ENIs cannot reach Kubernetes services. Traffic gets incorrectly marked with 0x80 and routed via eth0 instead of the correct secondary interface.

Root Cause Analysis

Using nft monitor trace, I identified the issue:

  1. MANGLE PREROUTING (priority -150) runs before kube-proxy's NAT DNAT
  2. VPC CNI's CONNMARK --restore-mark rule in mangle restores mark 0x80 from conntrack
  3. At this point, destination is still the ClusterIP (not yet DNAT'd to pod IP)
  4. The routing decision uses fwmark 0x80 → routes via main table → eth0
  5. kube-proxy DNAT happens after the routing decision

Trace evidence:

ip mangle PREROUTING rule iifname "eni*" xt match "comment" counter packets 3325625 bytes 644878056 xt target "CONNMARK" (verdict continue)
ip mangle PREROUTING policy accept meta mark 0x00000080   ← MARK SET BEFORE DNAT
...
ip mangle FORWARD packet: iif "eni1caf450b99a" oif "eth0" ... ip daddr 10.71.20.123  ← WRONG INTERFACE

The packet enters with dst=172.20.96.189 (ClusterIP), gets marked 0x80 in mangle PREROUTING, routing decides to use eth0, then DNAT changes destination to pod IP.

Why this doesn't happen with iptables mode kube-proxy

In iptables mode, both VPC CNI and kube-proxy use the same ip nat table. The rule ordering within a single table is deterministic - kube-proxy's DNAT rules in PREROUTING run before VPC CNI's CONNMARK chain checks the destination.

In nftables mode, kube-proxy creates rules in the ip kube-proxy table while VPC CNI (via iptables-nft) creates rules in the ip nat table. The mangle table (priority -150) runs before both, and the CONNMARK restore happens there.

Expected behavior

  • Traffic from pods on secondary ENIs to ClusterIP services should be routed via the correct interface (eth1, eth2, etc.)
  • Service CIDR should be automatically excluded from CONNMARK, or documentation should clearly state this requirement for nftables mode

Workaround

Manually set the service CIDR in environment variables:

kubectl set env daemonset aws-node -n kube-system
AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS="172.20.0.0/16"

Suggested improvements

  1. Documentation: Add a note in the README and troubleshooting guide that when using kube-proxy in nftables mode, AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS must be configured with the cluster's service CIDR

  2. Consider auto-detection: The CNI could query the Kubernetes API for the service CIDR.

Environment

  • VPC CNI version: 1.21.1
  • Kubernetes version: 1.34.2
  • kube-proxy version: 1.34.2
  • OS: Bottlerocket OS 1.51.0

Metadata

Metadata

Assignees

Labels

priority/P0Highest priority. Someone needs to actively work on this.

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions