Skip to content

[FlowExporter] Need uniform handling for "External-to-Pod" traffic #5706

@antoninbas

Description

@antoninbas

Describe the bug
The FlowExporter / FlowAggregator currently don't support "External-to-Pod" traffic, i.e. traffic coming from outside the cluster and targeting a NodePort or LoadBalancer Service.

This is mentioned in the documentation:

Currently, the Flow Exporter feature provides visibility for Pod-to-Pod, Pod-to-Service and Pod-to-External network flows along with the associated statistics such as data throughput (bits per second), packet throughput (packets per second), cumulative byte count and cumulative packet count. Pod-To-Service flow visibility is supported only when Antrea Proxy enabled, which is the case by default starting with Antrea v0.11. In the future, we will enable the support for External-To-Service flows.

However, when looking at NodePort Service traffic, it seems that all cases are not handled uniformly. In some cases, flows are actually exported to the FlowAggregator (with some misleading log messages), while in other cases, flows are not exported. More details below:

Case 1: NodePort with default externalTrafficPolicy

In this case, connections are ignored, thanks to this code:

// Consider Pod-to-Pod, Pod-To-Service and Pod-To-External flows.
if srcIP == gwIPv4 || dstIP == gwIPv4 {
continue
}
if srcIP == gwIPv6 || dstIP == gwIPv6 {
continue
}

In my opinion, this matches the documented behavior, even though ideally we would support this type of connections

Case 2: NodePort with externalTrafficPolicy=Local

In this case, filtering based on the source IP "fails", and connections are actually exported.
For example, the following may be logged by the FlowAggregator (when using the flow logger sink):

1699997594,1699997598,192.168.77.1,10.10.1.5,56171,80,TCP,,,,nginx-hvhjs,default,k8s-node-worker-1,0.0.0.0,0,,,,,,,,,,,,,

192.168.77.1 is the IP address of my local machine, from which I am accessing the NodePort Service. 10.10.1.5 is the IP address of the Pod implementing the Service.

So there is already a discrepancy with case 1. We also see the following warning in the Agent logs:

W1114 21:35:08.430174       1 exporter.go:615] Source IP: 192.168.77.1 doesn't exist in PodCIDRs

Case 3: NodePort with Antrea proxyAll enabled

This is similar to case 1, but this time we remove kube-proxy and enable proxyAll in AntreaProxy to handle NodePort traffic.
In this case, the flow is exported:

1699996576,1699996577,192.168.77.1,10.10.1.6,55418,80,TCP,,,,nginx-2zskx,default,k8s-node-worker-1,0.0.0.0,0,,,,,,,,,,,,,

and we see the following logs:

I1114 21:19:37.036002       1 connections.go:128] "Could not retrieve the Service info from antrea-agent-proxier" serviceStr="169.254.0.252:31749/TCP"
W1114 21:19:37.141838       1 exporter.go:615] Source IP: 192.168.77.1 doesn't exist in PodCIDRs

169.254.0.252 is a link-local address used by the proxyAll implementation to redirect traffic from the host network to OVS.

Case 4: NodePort with Pod as source

This is not a common use case by any means, but I thought I would also test this edge case.

The behavior depends on the value of externalTrafficPolicy. When using the default, the connection is treated as "Pod-to-External":

1699999841,1699999844,10.10.1.6,192.168.77.101,52136,30415,TCP,toolbox-pr96d,default,k8s-node-worker-1,antrea-agent-xwf6w,kube-system,k8s-node-worker-1,0.0.0.0,0,,,,,,,,,,,,,

When using externalTrafficPolicy=Local, the connection is actually exported twice:

1700000109,1700000114,10.10.1.6,10.10.1.5,37750,80,TCP,toolbox-pr96d,default,k8s-node-worker-1,nginx-hvhjs,default,k8s-node-worker-1,0.0.0.0,0,,,,,,,,,,,,,
1700000109,1700000114,10.10.1.6,192.168.77.101,37750,30415,TCP,toolbox-pr96d,default,k8s-node-worker-1,antrea-agent-xwf6w,kube-system,k8s-node-worker-1,0.0.0.0,0,,,,,,,,,,,,,

Once as Pod-to-External, and once as a Pod-to-Pod connection.

There is no log message from the FlowExporter in this case.

Note that if we enable proxyAll, results may differ yet again.

Other cases

There are potentially other cases to consider. In particular, I did not test with LoadBalancer Services. I imagine we have very similar issues.

Versions:
ToT version, which includes this PR: #5592

What should we do?
The main issue here IMO is the lack of consistency across all cases, based on whether externalTrafficPolicy is set to Local (which determines whether SNAT is needed) and whether proxyAll is enabled.
We should try to add support for "External-to-Pod" traffic and handle all possible cases consistently. We should include the Service information in the exported flow record, and avoid flooding the logs with warnings.
Because accessing a Service using NodePort or LoadBalancer from a Pod is not common, we don't have to handle this case for now. But ideally, we would handle this case gracefully (e.g., treat it as Pod-to-External, or not export it at all).

Metadata

Metadata

Labels

area/flow-visibility/exporterIssues or PRs related to the Flow Exporter functions in the Agentkind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions