Skip to content

fix: calico dnat network policy for egress api server#383

Open
cloud-j-luna wants to merge 8 commits intomainfrom
fix/permissions-netpol
Open

fix: calico dnat network policy for egress api server#383
cloud-j-luna wants to merge 8 commits intomainfrom
fix/permissions-netpol

Conversation

@cloud-j-luna
Copy link
Copy Markdown
Member

This is WIP with messy code.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c5a0e57f-2080-4800-bd8b-334fe594d931

📥 Commits

Reviewing files that changed from the base of the PR and between 25f9dd6 and 034e9d6.

📒 Files selected for processing (1)
  • cmd/provider-services/cmd/run.go

Walkthrough

Discover Kubernetes API server endpoints and add per-service egress NetworkPolicy generation for services that declare read permissions when deployment network policies are enabled.

Changes

Cohort / File(s) Summary
Network Policy Generation
cluster/kube/builder/netpol.go
Add per-service egress NetworkPolicy creation for services with non-empty Params.Permissions.Read. Policies select pods by service label, allow egress to each discovered API server IP as <ip>/32, and restrict ports to the deduplicated set of API server endpoint ports (TCP only). Added helper serviceHasReadPermissions.
Settings Configuration
cluster/kube/builder/settings.go
Add exported field APIServerEndpoints []net.TCPAddr to builder.Settings to hold discovered API server backend addresses; import net.
Endpoint Discovery
cmd/provider-services/cmd/run.go
When network policies are enabled, create a kube client from context and GET the "kubernetes" Endpoints resource in the default namespace; append all {IP,port} pairs into kubeSettings.APIServerEndpoints as net.TCPAddr. Returns an error if the client/endpoints cannot be obtained or if no endpoint IP/port pairs are found; logs discovered endpoints on success.

Sequence Diagram

sequenceDiagram
    participant Run as doRunCmd
    participant KubeClient as KubeClient
    participant Endpoints as "kubernetes Endpoints"
    participant Settings as "builder.Settings"
    participant Builder as "NetPol Builder"

    Run->>KubeClient: create client from context
    Run->>Endpoints: GET "kubernetes" Endpoints (default ns)
    Endpoints-->>Run: return subsets with addresses & ports
    Run->>Settings: append each {IP,port} as net.TCPAddr
    Run->>Builder: call ValidateSettings / Create netpols
    Builder->>Builder: for each service with read perms
    Builder->>Endpoints: allow egress to each APIServerEndpoint (IP/32) on deduplicated TCP ports
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped to the cluster, sniffed endpoints with glee,

IPs and ports gathered, a neat little spree.
For each service that reads I wove an egress gate,
Tight threads of TCP ports keep the path straight.
Hooray — safe hops through the network, tidy and light.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description 'This is WIP with messy code' is extremely vague and provides no meaningful information about the changeset or its objectives. Replace the description with details about the changes: the new per-service NetworkPolicy generation, APIServerEndpoints field addition, and how endpoint discovery is implemented.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: calico dnat network policy for egress api server' is directly related to the changeset, which implements per-service egress NetworkPolicy generation for API server endpoints.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/permissions-netpol

Comment @coderabbitai help to get the list of available commands and usage tips.

// (from the "kubernetes" endpoints in the default namespace, not the ClusterIP).
// This is needed for network policies because CNIs like Calico evaluate
// egress rules after DNAT, so the ClusterIP is not what gets matched.
APIServerEndpointIP string
Copy link
Copy Markdown
Member

@troian troian Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace APIServerEndpointIP and APIServerEndpointIP with KubeAPI url.URL

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about using the TCPAddr type rather https://pkg.go.dev/net#TCPAddr after refactoring as it holds IP and Port only without scheme and all the other URL fields. Thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are endpoints always ip addresses tho?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although the type definition has a Hostname field, it is optional. IP Address is always present

@cloud-j-luna cloud-j-luna requested a review from troian April 1, 2026 18:48
cloud-j-luna and others added 3 commits April 1, 2026 21:02
Signed-off-by: Joao Luna <7607329+cloud-j-luna@users.noreply.github.com>
@cloud-j-luna cloud-j-luna marked this pull request as ready for review April 2, 2026 10:54
@cloud-j-luna cloud-j-luna requested a review from a team as a code owner April 2, 2026 10:54
@cloud-j-luna cloud-j-luna added this to the provider 0.12.0 milestone Apr 2, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cluster/kube/builder/settings.go (1)

57-68: ⚠️ Potential issue | 🟠 Major

Validate APIServerEndpoint when network policies are enabled.

ValidateSettings still ignores this field, so a discovery miss in cmd/provider-services/cmd/run.go passes startup and only fails later when cluster/kube/builder/netpol.go builds the API-server egress rule. Reject empty IPs and ports outside 1..65535 here so misconfiguration fails fast.

Proposed fix
 func ValidateSettings(settings Settings) error {
+	if settings.NetworkPoliciesEnabled {
+		if len(settings.APIServerEndpoint.IP) == 0 ||
+			settings.APIServerEndpoint.Port <= 0 ||
+			settings.APIServerEndpoint.Port > 65535 {
+			return fmt.Errorf("%w: invalid API server endpoint", ErrSettingsValidation)
+		}
+	}
+
 	if settings.DeploymentIngressStaticHosts {
 		if settings.DeploymentIngressDomain == "" {
 			return fmt.Errorf("%w: empty ingress domain", ErrSettingsValidation)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cluster/kube/builder/settings.go` around lines 57 - 68, ValidateSettings
currently ignores Settings.APIServerEndpoint so invalid or empty API server
endpoints slip through; update ValidateSettings to validate APIServerEndpoint on
Settings when network policies are enabled (or whenever
DeploymentNetworkPolicies flag is true) by parsing the endpoint into host:port,
rejecting empty host/IP and ports not in range 1..65535, and returning
fmt.Errorf("%w: ...", ErrSettingsValidation) on failure; use the
Settings.APIServerEndpoint symbol and ErrSettingsValidation in the new checks so
misconfiguration fails fast before netpol.go's API-server egress rule is built.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster/kube/builder/netpol.go`:
- Around line 244-247: The code only checks APIServerEndpoint.Port for range but
allows Port==0 and doesn't validate the IP; update the validation for
b.settings.APIServerEndpoint so that APIServerEndpoint.IP is non-empty (and
optionally a valid IP string) and APIServerEndpoint.Port is in 1..65535 (reject
0); return an error early from the same function instead of proceeding to the
CIDR construction (the code that builds the CIDR from APIServerEndpoint.IP and
the rule-appending logic that uses APIServerEndpoint.Port should only run after
these checks pass).

In `@cmd/provider-services/cmd/run.go`:
- Around line 551-560: The current endpoint discovery only captures the first
backend and assigns it to kubeSettings.APIServerEndpoint (using
kubeAPIServerEndpointName), which breaks HA control planes; modify the discovery
loop to collect all subset.Addresses (each IP from subset.Addresses[*].IP) into
a slice stored on the settings/builder (e.g., add a field like
APIServerEndpoints []net.TCPAddr or []net.IP on kubeSettings/builder.Settings)
instead of overwriting kubeSettings.APIServerEndpoint, and then update netpol.go
(the code that generates egress peers around the lines that create single /32
peers) to iterate that slice and emit one egress peer per backend address (or
emit a broader CIDR if preferred) so all control-plane backends are authorized.

---

Outside diff comments:
In `@cluster/kube/builder/settings.go`:
- Around line 57-68: ValidateSettings currently ignores
Settings.APIServerEndpoint so invalid or empty API server endpoints slip
through; update ValidateSettings to validate APIServerEndpoint on Settings when
network policies are enabled (or whenever DeploymentNetworkPolicies flag is
true) by parsing the endpoint into host:port, rejecting empty host/IP and ports
not in range 1..65535, and returning fmt.Errorf("%w: ...",
ErrSettingsValidation) on failure; use the Settings.APIServerEndpoint symbol and
ErrSettingsValidation in the new checks so misconfiguration fails fast before
netpol.go's API-server egress rule is built.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dedf9751-4b4d-4b03-b2fc-5fef04da5586

📥 Commits

Reviewing files that changed from the base of the PR and between 0094f0c and 8e47b0c.

📒 Files selected for processing (3)
  • cluster/kube/builder/netpol.go
  • cluster/kube/builder/settings.go
  • cmd/provider-services/cmd/run.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/provider-services/cmd/run.go`:
- Around line 556-559: The code appends net.TCPAddr using net.ParseIP(addr.IP)
which can return nil and later cause a panic when ep.IP.String() is called (see
kubeSettings.APIServerEndpoints and net.TCPAddr usage); update the block that
builds kubeSettings.APIServerEndpoints to defensively validate the parsed IP
(using the result of net.ParseIP) and skip or log any endpoints with invalid IPs
instead of appending a TCPAddr with a nil IP, ensuring downstream code that
calls ep.IP.String() cannot panic.
- Around line 544-569: When deploymentNetworkPoliciesEnabled is true and
endpoint discovery via fromctx.KubeClientFromCtx/Get for the "kubernetes"
Endpoints fails, do not silently continue with an empty
kubeSettings.APIServerEndpoints; instead propagate an error so the process fails
fast. Replace the logger.Error-only branch (the else after
kc.CoreV1().Endpoints(...).Get) with code that returns or wraps the discovery
error (e.g., return fmt.Errorf("discover API server endpoints for network
policies: %w", err)) from the enclosing function so callers see the failure;
keep successful population of kubeSettings.APIServerEndpoints as-is. This
ensures services with Permissions.Read (see netpol.go) won’t be left silently
non-functional.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43e24af5-2f0b-475d-bc5c-3855fbe50108

📥 Commits

Reviewing files that changed from the base of the PR and between 8e47b0c and 25f9dd6.

📒 Files selected for processing (3)
  • cluster/kube/builder/netpol.go
  • cluster/kube/builder/settings.go
  • cmd/provider-services/cmd/run.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cluster/kube/builder/netpol.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Permissioned deployments API Server netpol for egress rules after DNAT

2 participants