Service CIDR must be manually configured when using kube-proxy in nftables mode with VPC CNI iptables-nft

### What happened?

When running kube-proxy in **native nftables mode** while VPC CNI uses **iptables-nft**, pods on secondary ENIs cannot reach Kubernetes services. Traffic gets incorrectly marked with `0x80` and routed via eth0 instead of the correct secondary interface.

### Root Cause Analysis

Using `nft monitor trace`, I identified the issue:

1. **MANGLE PREROUTING** (priority -150) runs **before** kube-proxy's NAT DNAT
2. VPC CNI's CONNMARK `--restore-mark` rule in mangle restores mark `0x80` from conntrack
3. At this point, destination is still the **ClusterIP** (not yet DNAT'd to pod IP)
4. The routing decision uses fwmark `0x80` → routes via main table → eth0
5. kube-proxy DNAT happens **after** the routing decision

**Trace evidence:**
```
ip mangle PREROUTING rule iifname "eni*" xt match "comment" counter packets 3325625 bytes 644878056 xt target "CONNMARK" (verdict continue)
ip mangle PREROUTING policy accept meta mark 0x00000080   ← MARK SET BEFORE DNAT
...
ip mangle FORWARD packet: iif "eni1caf450b99a" oif "eth0" ... ip daddr 10.71.20.123  ← WRONG INTERFACE
```

The packet enters with `dst=172.20.96.189` (ClusterIP), gets marked `0x80` in mangle PREROUTING, routing decides to use eth0, then DNAT changes destination to pod IP.

### Why this doesn't happen with iptables mode kube-proxy

In iptables mode, both VPC CNI and kube-proxy use the same `ip nat` table. The rule ordering within a single table is deterministic - kube-proxy's DNAT rules in PREROUTING run before VPC CNI's CONNMARK chain checks the destination.

In nftables mode, kube-proxy creates rules in the `ip kube-proxy` table while VPC CNI (via iptables-nft) creates rules in the `ip nat` table. The **mangle table** (priority -150) runs before both, and the CONNMARK restore happens there.

### Expected behavior

- Traffic from pods on secondary ENIs to ClusterIP services should be routed via the correct interface (eth1, eth2, etc.)
- Service CIDR should be automatically excluded from CONNMARK, or documentation should clearly state this requirement for nftables mode

### Workaround

Manually set the service CIDR in environment variables:

kubectl set env daemonset aws-node -n kube-system \
  AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS="172.20.0.0/16"

### Suggested improvements

1. **Documentation**: Add a note in the README and troubleshooting guide that when using kube-proxy in nftables mode, `AWS_VPC_K8S_CNI_EXCLUDE_SNAT_CIDRS` **must** be configured with the cluster's service CIDR

2. **Consider auto-detection**: The CNI could query the Kubernetes API for the service CIDR.

### Environment

- **VPC CNI version**: 1.21.1
- **Kubernetes version**: 1.34.2
- **kube-proxy version**: 1.34.2
- **OS**: Bottlerocket OS 1.51.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service CIDR must be manually configured when using kube-proxy in nftables mode with VPC CNI iptables-nft #3563

What happened?

Root Cause Analysis

Why this doesn't happen with iptables mode kube-proxy

Expected behavior

Workaround

Suggested improvements

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service CIDR must be manually configured when using kube-proxy in nftables mode with VPC CNI iptables-nft #3563

Description

What happened?

Root Cause Analysis

Why this doesn't happen with iptables mode kube-proxy

Expected behavior

Workaround

Suggested improvements

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions