feat: add event handling strategy changes in kubernetes object monitor #639

tanishagoyal2 · 2025-12-22T15:58:00Z

Summary

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- Configurable processing strategy (EXECUTE_REMEDIATION default or STORE_ONLY) via CLI/Helm; controls whether events trigger remediation or are observability-only.
- Health events and exported payloads now include processing strategy metadata.
- Helm/chart and runtime flags added to propagate the configured strategy.
Documentation
- Database schema updated with a processing_strategy column.
Tests
- New unit and e2e tests validating STORE_ONLY vs EXECUTE_REMEDIATION behavior and emitted event contents.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai · 2025-12-22T15:58:15Z

📝 Walkthrough

Walkthrough

Adds a ProcessingStrategy enum and HealthEvent.processingStrategy field; threads a processing-strategy config/flag through object-monitor, publisher, exporter, connector, and store pipelines; filters STORE_ONLY events from remediation/control paths while still exporting them.

Changes

Cohort / File(s)	Summary
Protobuf & generated stubs `data-models/protobufs/health_event.proto`, `health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`, `.../health_event_pb2.pyi`	Add `ProcessingStrategy` enum (`EXECUTE_REMEDIATION`, `STORE_ONLY`) and `HealthEvent.processingStrategy` (field 16); update generated Python descriptors and typings.
Helm chart & manifests `distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml`, `.../templates/configmap.yaml`, `.../templates/deployment.yaml`, `distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml`	Add `processingStrategy` value (default `EXECUTE_REMEDIATION`), conditional ConfigMap entry, `--processing-strategy` container arg, and DB schema addition `processing_strategy VARCHAR(50)` in tilt values.
Object monitor CLI / init / config `health-monitors/kubernetes-object-monitor/main.go`, `pkg/config/types.go`, `pkg/initializer/initializer.go`, `pkg/initializer/init.go`	Add CLI flag/Params field `ProcessingStrategy`; validate/map string→proto enum; pass enum into controller registration, publisher and annotation manager constructors; switch pipeline builder call to processable variant.
Annotation manager `health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go`	Manager gains `processingStrategy pb.ProcessingStrategy`; constructor signature updated; AddMatch/RemoveMatch skip when strategy == STORE_ONLY.
Publisher `health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go`	Publisher gains `processingStrategy pb.ProcessingStrategy`; constructor signature changed to accept it; PublishHealthEvent resolves strategy per-event or default and sets `HealthEvent.ProcessingStrategy` in emitted payload.
Platform connector processing `platform-connectors/pkg/connectors/kubernetes/process_node_events.go`, `.../k8s_platform_connector_test.go`	Add `filterProcessableEvents` to drop STORE_ONLY events; introduce `createK8sEvent`; refactor processing to act only on processable events; add tests covering STORE_ONLY and EXECUTE_REMEDIATION behaviors.
Store client pipelines `store-client/pkg/client/pipeline_builder.go`, `.../mongodb_pipeline_builder.go`, `.../postgresql_pipeline_builder.go`, `.../pipeline_builder_test.go`	Add `BuildProcessableHealthEventInsertsPipeline()` to interface and implementations to match inserts where `processingStrategy=EXECUTE_REMEDIATION`; add unit test.
Event exporter & tests/helpers `event-exporter/pkg/transformer/cloudevents.go`, `.../cloudevents_test.go`, `tests/event_exporter_test.go`, `tests/helpers/event_exporter.go`	Include `processingStrategy` in CloudEvent data; update tests and helper validation to assert expected strategy; add helper to find events by node/check.
Fault-quarantine & initializer tests `fault-quarantine/pkg/evaluator/rule_evaluator_test.go`, `fault-quarantine/pkg/initializer/init.go`, `tests/fault_quarantine_test.go`	Update test expectations to include `processingStrategy`; initializer switched to build processable inserts pipeline; add end-to-end test verifying behavior under both strategies.
Tests, templates, kube helpers `tests/helpers/healthevent.go`, `tests/helpers/kube.go`, `tests/kubernetes_object_monitor_test.go`, `tests/platform-connector_test.go`, `tests/*`	Add `ProcessingStrategy` to test HealthEventTemplate and builder; add kube helpers (deployment arg set/remove, rollout wait, EnsureNodeEventNotPresent); add integration tests exercising STORE_ONLY and EXECUTE_REMEDIATION paths.
Docs / SQL schema `docs/postgresql-schema.sql`	Add `processing_strategy VARCHAR(50)` column to `health_events` schema.

Sequence Diagram

sequenceDiagram
    actor Operator
    participant Helm as Helm Values
    participant Monitor as Object Monitor
    participant Publisher
    participant Exporter as Event Exporter
    participant Connector as Platform Connector
    participant Node as Kubernetes Node

    Operator->>Helm: set processingStrategy (EXECUTE_REMEDIATION|STORE_ONLY)
    Helm->>Monitor: supply flag/value
    Monitor->>Monitor: validate & map -> pb.ProcessingStrategy
    Monitor->>Publisher: New(..., processingStrategy)
    Monitor->>Monitor: create AnnotationMgr(processingStrategy)

    rect rgb(200,255,200)
    Note over Monitor,Connector: EXECUTE_REMEDIATION path
    Monitor->>Publisher: PublishHealthEvent(event)
    Publisher->>Publisher: resolve strategy (event or default)
    Publisher->>Exporter: emit CloudEvent (includes processingStrategy)
    Exporter->>Connector: forward event
    Connector->>Connector: filterProcessableEvents() -> passes
    Connector->>Node: create/update conditions & events
    end

    rect rgb(255,230,230)
    Note over Monitor,Connector: STORE_ONLY path
    Monitor->>Publisher: PublishHealthEvent(event)
    Publisher->>Exporter: emit CloudEvent (processingStrategy=STORE_ONLY)
    Exporter->>Connector: forward event
    Connector->>Connector: filterProcessableEvents() -> drops event
    Connector->>Node: skip condition/annotation updates
    Monitor->>Monitor: AnnotationMgr Add/Remove skip when STORE_ONLY
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

"A rabbit hops with a hopeful grin,
A new strategy tucked neatly in,
STORE_ONLY watches, EXECUTE remediates,
Events flow, the pipeline navigates 🐇
Cheers for the tiny, careful win!"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add event handling strategy changes in kubernetes object monitor' accurately describes the main change—introducing a ProcessingStrategy feature across multiple components to control event handling behavior (EXECUTE_REMEDIATION vs STORE_ONLY) in the Kubernetes object monitor and related systems.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)

107-112: Well-documented configuration option.

The new processingStrategy field is properly documented with valid values and behavioral explanations. The default EXECUTE_REMEDIATION maintains backward compatibility.

Consider adding validation in the Helm templates to fail early if an invalid value is provided (e.g., via .Values.processingStrategy | upper | mustRegexMatch "^(EXECUTE_REMEDIATION|STORE_ONLY)$"), though this is optional if validation happens at the application level.
tests/helpers/kube.go (1)
2341-2366: Missing container-not-found check.

Unlike SetDeploymentArgs (lines 2282-2284), RemoveDeploymentArgs doesn't return an error if the specified containerName is not found in the deployment. This inconsistency could hide configuration errors.
🔎 Proposed fix
 func RemoveDeploymentArgs(
 	ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string,
 ) error {
 	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		deployment := &appsv1.Deployment{}
 		if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil {
 			return err
 		}

 		if len(deployment.Spec.Template.Spec.Containers) == 0 {
 			return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
 		}

+		found := false
+
 		for i := range deployment.Spec.Template.Spec.Containers {
 			container := &deployment.Spec.Template.Spec.Containers[i]

 			if containerName != "" && container.Name != containerName {
 				continue
 			}

+			found = true
+
 			removeArgsFromContainer(container, args)
 		}

+		if containerName != "" && !found {
+			return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+		}
+
 		return c.Resources().Update(ctx, deployment)
 	})
 }
health-monitors/kubernetes-object-monitor/pkg/config/types.go (2)

49-50: Consider adding validation for ProcessingStrategy values.

The ProcessingStrategy field is a string without validation, which could allow invalid values to be silently accepted. Consider validating against allowed values (e.g., "EXECUTE_REMEDIATION", "STORE_ONLY") either at config load time or through a custom unmarshal function.

50-50: Add godoc comment for exported field.

The exported ProcessingStrategy field should have a godoc comment that describes its purpose and allowed values, following Go conventions.

As per coding guidelines: "Function comments required for all exported Go functions" (applies to exported fields as well).
health-monitors/kubernetes-object-monitor/main.go (1)
71-75: Consider validating the processing-strategy flag value.

The flag accepts any string value but only "EXECUTE_REMEDIATION" and "STORE_ONLY" are valid. Consider adding validation in the run() function to fail fast with a clear error message if an invalid value is provided.
🔎 Suggested validation
 func run() error {
 	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
 	defer stop()
 
+	// Validate processing strategy
+	validStrategies := []string{"EXECUTE_REMEDIATION", "STORE_ONLY"}
+	isValid := false
+	for _, valid := range validStrategies {
+		if *processingStrategyFlag == valid {
+			isValid = true
+			break
+		}
+	}
+	if !isValid {
+		return fmt.Errorf("invalid processing-strategy %q, must be one of: %v", *processingStrategyFlag, validStrategies)
+	}
+
 	params := initializer.Params{
 		PolicyConfigPath:        *policyConfigPath,
Also applies to: 101-101
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
50-73: Strategy override logic is well-designed.

The pattern of defaulting to the publisher's strategy while allowing per-policy overrides is flexible. The validation against the generated pb.ProcessingStrategy_value map ensures consistency with the protobuf definition.

One minor observation: consider adding context to the error message (e.g., "policy %s: unexpected processingStrategy...") to help identify which policy has the invalid configuration during debugging.
🔎 Optional: Add policy context to error message
 	if policy.HealthEvent.ProcessingStrategy != "" {
 		value, ok := pb.ProcessingStrategy_value[policy.HealthEvent.ProcessingStrategy]
 		if !ok {
-			return fmt.Errorf("unexpected processingStrategy value: %q", policy.HealthEvent.ProcessingStrategy)
+			return fmt.Errorf("policy %q: unexpected processingStrategy value: %q", policy.Name, policy.HealthEvent.ProcessingStrategy)
 		}
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Consider using Debug log level for filtered events.

Using slog.Info for every skipped STORE_ONLY event could generate high log volume in production. Consider slog.Debug for consistency with similar skip-logging patterns elsewhere in the codebase (e.g., manager.go lines 51, 70).
🔎 Proposed change
 	for _, healthEvent := range healthEvents.Events {
 		if healthEvent.ProcessingStrategy == protos.ProcessingStrategy_STORE_ONLY {
-			slog.Info("Skipping STORE_ONLY health event (no node conditions / node events)",
+			slog.Debug("Skipping STORE_ONLY health event (no node conditions / node events)",
 				"node", healthEvent.NodeName,
 				"checkName", healthEvent.CheckName,
 				"agent", healthEvent.Agent)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (1)
116-116: Minor: duplicate cast of strategyValue.

The pb.ProcessingStrategy(strategyValue) cast is performed twice (once on line 91 for the publisher, and again on line 116 for registerControllers). Consider storing the cast result in a variable to avoid repetition.
🔎 Proposed refactor
+	processingStrategy := pb.ProcessingStrategy(strategyValue)
+
 	slog.Info("Event handling strategy configured", "processingStrategy", params.ProcessingStrategy)

-	pub := publisher.New(pcClient, pb.ProcessingStrategy(strategyValue))
+	pub := publisher.New(pcClient, processingStrategy)

 	// ... later ...

-	if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, pb.ProcessingStrategy(strategyValue)); err != nil {
+	if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, processingStrategy); err != nil {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82e7180 and 3e2fd8b.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (26)

data-models/protobufs/health_event.proto
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
event-exporter/pkg/transformer/cloudevents.go
event-exporter/pkg/transformer/cloudevents_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
health-monitors/kubernetes-object-monitor/main.go
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
health-monitors/kubernetes-object-monitor/pkg/config/types.go
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
store-client/pkg/client/mongodb_pipeline_builder.go
store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/pipeline_builder_test.go
store-client/pkg/client/postgresql_pipeline_builder.go
tests/event_exporter_test.go
tests/helpers/event_exporter.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/kubernetes_object_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

health-monitors/kubernetes-object-monitor/pkg/config/types.go
tests/helpers/healthevent.go
fault-quarantine/pkg/initializer/init.go
store-client/pkg/client/postgresql_pipeline_builder.go
tests/event_exporter_test.go
store-client/pkg/client/pipeline_builder.go
event-exporter/pkg/transformer/cloudevents.go
event-exporter/pkg/transformer/cloudevents_test.go
tests/kubernetes_object_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
store-client/pkg/client/mongodb_pipeline_builder.go
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
store-client/pkg/client/pipeline_builder_test.go
health-monitors/kubernetes-object-monitor/main.go
tests/helpers/kube.go
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
tests/helpers/event_exporter.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/event_exporter_test.go
event-exporter/pkg/transformer/cloudevents_test.go
tests/kubernetes_object_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
store-client/pkg/client/pipeline_builder_test.go

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml

data-models/protobufs/**/*.proto

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

data-models/protobufs/health_event.proto

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py

🧠 Learnings (8)

📚 Learning: 2025-11-10T10:25:19.443Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

fault-quarantine/pkg/initializer/init.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/event_exporter_test.go
tests/kubernetes_object_monitor_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
tests/helpers/kube.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/event_exporter_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-04T06:31:02.147Z

Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory

Applied to files:

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

🧬 Code graph analysis (12)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/event_exporter_test.go (1)

tests/helpers/event_exporter.go (1)

ValidateCloudEvent (257-283)

event-exporter/pkg/transformer/cloudevents_test.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/kubernetes_object_monitor_test.go (4)

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

New (41-46)

health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)

Config (16-18)

tests/helpers/kube.go (9)

SetDeploymentArgs (2255-2288)

NVSentinelNamespace (64-64)

WaitForDeploymentRollout (960-1101)

SetNodeConditionStatus (1709-1770)

GetNodeByName (442-451)

NeverWaitTimeout (62-62)

WaitInterval (63-63)

EventuallyWaitTimeout (61-61)

RemoveDeploymentArgs (2341-2366)

commons/pkg/auditlogger/auditlogger.go (1)

Log (114-134)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

data-models/pkg/protos/health_event.pb.go (16)

HealthEvent (260-280)

HealthEvent (293-293)

HealthEvent (308-310)

Entity (208-214)

Entity (227-227)

Entity (242-244)

RecommendedAction (89-89)

RecommendedAction (139-141)

RecommendedAction (143-145)

RecommendedAction (152-154)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

store-client/pkg/client/mongodb_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)

data-models/pkg/protos/health_event_grpc.pb.go (1)

PlatformConnectorClient (43-45)

health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)

Policy (20-27)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (56-59)

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (56-59)

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

New (41-46)

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)

NewManager (43-45)

store-client/pkg/client/pipeline_builder_test.go (3)

store-client/pkg/client/pipeline_builder.go (1)

PipelineBuilder (26-47)

store-client/pkg/client/mongodb_pipeline_builder.go (1)

NewMongoDBPipelineBuilder (29-31)

store-client/pkg/client/postgresql_pipeline_builder.go (1)

NewPostgreSQLPipelineBuilder (29-31)

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_STORE_ONLY (47-47)

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

data-models/pkg/protos/health_event.pb.go (7)

HealthEvent (260-280)

HealthEvent (293-293)

HealthEvent (308-310)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

🔇 Additional comments (37)

fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The test expectation correctly includes the new processingStrategy field with float64(0) (representing EXECUTE_REMEDIATION, the default enum value). This aligns with the proto changes where ProcessingStrategy defaults to 0.

data-models/protobufs/health_event.proto (2)

32-38: Well-designed enum with correct default.

Good proto3 design: EXECUTE_REMEDIATION = 0 ensures backward compatibility since unset fields default to 0 (the normal remediation behavior). The comments clearly document both strategies per coding guidelines.

77-77: Appropriate field placement.

Field number 16 continues the sequential numbering after drainOverrides = 15. This is a non-breaking addition that maintains wire compatibility with existing messages.

tests/helpers/kube.go (2)

2208-2249: LGTM!

Clean implementation of WaitForDaemonSetRollout that follows the same pattern as WaitForDeploymentRollout. The rollout completion checks (desired, updated, ready pods) are correct.

2290-2336: Approve helper implementation.

The setArgsOnContainer helper correctly handles the three argument styles (--flag=value, --flag, --flag value) with appropriate insertion and update logic. The break statements prevent index issues during slice modification.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-17: LGTM!

The ProcessingStrategy enum class and top-level constants are correctly defined, matching the protobuf definition.

Also applies to: 31-32

78-78: LGTM!

The HealthEvent message correctly includes the new processingStrategy field in __slots__, field number constant, attribute declaration, and __init__ signature. The type hint _Optional[_Union[ProcessingStrategy, str]] is appropriate for protobuf enum fields.

Also applies to: 104-104, 120-120, 138-138

tests/event_exporter_test.go (2)

25-25: LGTM!

The import addition aligns with the updated test helper function signature.

85-85: LGTM!

The test correctly validates the expected processing strategy value. The change properly extends the test to verify the new processingStrategy field in CloudEvents.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM!

The addition of processingStrategy to the CloudEvent data payload correctly uses the .String() method to serialize the enum value. The placement is consistent with other health event fields.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)

51-53: LGTM!

The conditional rendering of processingStrategy follows the same pattern as the errorCode field above (lines 48-50) and properly quotes the value for TOML format. The optional nature ensures backward compatibility when the field is not specified.

event-exporter/pkg/transformer/cloudevents_test.go (2)

69-69: LGTM!

Setting ProcessingStrategy_STORE_ONLY provides good test coverage for the non-default enum value.

106-108: LGTM!

The validation correctly verifies that the processingStrategy field is properly serialized as "STORE_ONLY" in the CloudEvent data. This confirms the .String() method works as expected.

store-client/pkg/client/mongodb_pipeline_builder.go (1)

19-19: LGTM!

The import is necessary to reference ProcessingStrategy_EXECUTE_REMEDIATION in the new pipeline.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml (1)

57-57: No action needed. The processingStrategy value has a default value of EXECUTE_REMEDIATION already defined in values.yaml at line 112.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM! Filtering STORE_ONLY events is the intended behavior.

The change from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() correctly enables fault-quarantine to process only events with ProcessingStrategy = EXECUTE_REMEDIATION, excluding observability-only STORE_ONLY events. This aligns with the PR objectives for conditional event processing.

Producers are properly configured to set ProcessingStrategy based on policy definitions, with EXECUTE_REMEDIATION as the default. Observability-only events must be explicitly marked as STORE_ONLY via policy configuration, so critical events won't be accidentally filtered.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM! Test follows established patterns.

The new test for BuildProcessableHealthEventInsertsPipeline correctly mirrors the structure of existing pipeline tests, using table-driven approach across both MongoDB and PostgreSQL builders.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM! Clear interface addition with good documentation.

The new method is well-documented with clear use case and filtering behavior.

tests/helpers/healthevent.go (1)

48-48: LGTM! Builder pattern follows established conventions.

The ProcessingStrategy field and builder method follow the existing patterns in this test helper. Using int type provides flexibility for testing edge cases beyond the defined enum values.

Also applies to: 153-156

store-client/pkg/client/postgresql_pipeline_builder.go (1)

119-132: LGTM! Implementation correctly filters for EXECUTE_REMEDIATION strategy.

The pipeline matches insert operations where processingStrategy equals EXECUTE_REMEDIATION, consistent with the interface contract and MongoDB implementation.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1391-1589: Test logic validates STORE_ONLY behavior correctly, but consider migration to envtest.

The test effectively validates that STORE_ONLY events don't create node conditions or Kubernetes events, while EXECUTE_REMEDIATION events do. Test cases cover the key scenarios well.

However, the test uses fake.NewSimpleClientset() rather than envtest for Kubernetes controller testing. Based on learnings, envtest is preferred for testing Kubernetes controllers.

Consider migrating this test to use envtest in a future refactor to align with the coding guidelines. The fake client is acceptable for unit tests but envtest provides better integration testing.

tests/helpers/event_exporter.go (2)

220-254: LGTM! Useful helper for finding events by check name.

The new FindEventByNodeAndCheckName function provides a clean way to search for events by multiple criteria including health status.

261-261: LGTM! Validation extended to include processingStrategy.

The ValidateCloudEvent function correctly updated to validate the processingStrategy field in CloudEvent payloads. This ensures tests verify the processing strategy is properly propagated through the event pipeline.

Also applies to: 281-281

tests/kubernetes_object_monitor_test.go (4)

128-159: LGTM! Test setup correctly configures STORE_ONLY strategy.

The setup properly:

Identifies a non-KWOK test node

Applies deployment args to enable STORE_ONLY processing

Waits for deployment rollout before proceeding

161-188: LGTM! Correct use of require.Never for negative assertion.

The test correctly validates that STORE_ONLY strategy does not create node annotations by using require.Never, which asserts the condition never becomes true within the timeout period.

190-217: Clarify the purpose of the "Node Ready recovery clears annotation" assessment.

This assessment expects an annotation to be cleared when the node becomes Ready, but in STORE_ONLY mode, the annotation should never have been created in the first place (as verified by the previous assessment).

Is this assessment intended to verify that:

The annotation remains absent when the condition changes to True? (If so, require.Never would be more appropriate)

Some edge case where an annotation might exist from a previous non-STORE_ONLY run?

Consider whether this assessment should use require.Never instead of require.Eventually to verify the annotation continues to not exist, or add a comment explaining why Eventually is appropriate here.

219-230: LGTM! Teardown properly restores deployment state.

The teardown correctly removes the STORE_ONLY args and waits for the deployment to stabilize, ensuring test isolation.

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)

24-24: LGTM!

Import for the protobuf package is correctly added to support the new ProcessingStrategy type.

38-45: LGTM!

The Manager struct and constructor are correctly extended to accept and store the processingStrategy. The field naming follows Go conventions.

50-53: LGTM!

The early return for STORE_ONLY strategy correctly prevents annotation updates while maintaining debug logging for observability. This aligns with the PR objective to support different event handling strategies.

69-72: LGTM!

Consistent implementation with AddMatch - the STORE_ONLY guard correctly skips annotation removal with appropriate debug logging.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Generated protobuf code - no manual review required.

This file is auto-generated by the protocol buffer compiler as indicated by the header comments. Ensure this file is regenerated from the source .proto file rather than manually edited. The changes correctly reflect the addition of the ProcessingStrategy enum and processingStrategy field in the HealthEvent message.

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

36-45: LGTM!

The Publisher struct and constructor are correctly extended to store and initialize the default processingStrategy. The naming convention follows Go standards.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

372-416: LGTM on the overall flow.

The processHealthEvents function correctly:

Filters out STORE_ONLY events before processing

Updates node conditions only for processable healthy/fatal events

Creates Kubernetes events for non-healthy, non-fatal events

The separation of filtering, condition updates, and event creation is clean and maintainable.

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)

47-55: LGTM!

The Params struct is correctly extended with ProcessingStrategy field. This allows the strategy to be configured externally (likely via command-line flag or environment variable based on the AI summary).

163-171: LGTM!

The registerControllers function signature is correctly updated to accept processingStrategy, and it's properly passed to annotations.NewManager. This ensures the strategy is consistently applied throughout the controller initialization path.

83-91: Default value handling is already provided at the flag level in main.go.

The processingStrategyFlag has a default value of "EXECUTE_REMEDIATION" (line 73), which is passed to the initializer via *processingStrategyFlag. This ensures that under normal execution, params.ProcessingStrategy will never be empty. However, the validation at line 84-86 will still correctly reject invalid or empty values if the initializer is called directly with unvalidated input. Consider adding a comment to document this dependency on the caller providing a valid value.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go

store-client/pkg/client/mongodb_pipeline_builder.go

coderabbitai · 2025-12-22T16:04:33Z

tests/helpers/kube.go

+// RemoveDeploymentArgs removes environment variables from containers in a deployment.
+// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
+// Uses retry.RetryOnConflict for automatic retry handling.


⚠️ Potential issue | 🟡 Minor

Fix incorrect docstring.

The docstring says "removes environment variables" but this function removes container arguments. This appears to be copy-paste from RemoveDeploymentEnvVars.

🔎 Proposed fix

-// RemoveDeploymentArgs removes environment variables from containers in a deployment. +// RemoveDeploymentArgs removes arguments from containers in a deployment. // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. // Uses retry.RetryOnConflict for automatic retry handling.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// RemoveDeploymentArgs removes environment variables from containers in a deployment.

// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.

// Uses retry.RetryOnConflict for automatic retry handling.

// RemoveDeploymentArgs removes arguments from containers in a deployment.

// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.

// Uses retry.RetryOnConflict for automatic retry handling.

🤖 Prompt for AI Agents

In tests/helpers/kube.go around lines 2338-2340, the docstring incorrectly states that the function "removes environment variables" while the implementation actually removes container arguments; update the comment to accurately describe that the function removes container arguments from a Deployment (and note behavior for empty containerName applies to all containers), and remove or replace the reference to RemoveDeploymentEnvVars to avoid copy-paste confusion; keep the mention of retry.RetryOnConflict and the behavior for the containerName parameter.

copy-pr-bot · 2025-12-23T04:49:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e2fd8b and f6e73ed.

📒 Files selected for processing (1)

tests/kubernetes_object_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/kubernetes_object_monitor_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/kubernetes_object_monitor_test.go

🧠 Learnings (2)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/kubernetes_object_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/kubernetes_object_monitor_test.go

🧬 Code graph analysis (1)

tests/kubernetes_object_monitor_test.go (4)

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

New (41-46)

tests/helpers/kube.go (7)

SetDeploymentArgs (2255-2288)

NVSentinelNamespace (64-64)

WaitForDeploymentRollout (960-1101)

SetNodeConditionStatus (1709-1770)

GetNodeByName (442-451)

EventuallyWaitTimeout (61-61)

WaitInterval (63-63)

tests/helpers/fault_quarantine.go (2)

AssertQuarantineState (315-382)

QuarantineAssertion (56-60)

tests/helpers/event_exporter.go (3)

GetMockEvents (36-99)

FindEventByNodeAndCheckName (221-254)

ValidateCloudEvent (257-283)

🔇 Additional comments (1)

tests/kubernetes_object_monitor_test.go (1)

24-24: LGTM!

The import is correctly added to support the helper functions used in the new test.

tests/kubernetes_object_monitor_test.go

coderabbitai · 2025-12-23T04:52:30Z

tests/kubernetes_object_monitor_test.go

+		t.Log("Waiting for policy match annotation to be cleared")
+		require.Eventually(t, func() bool {
+			node, err := helpers.GetNodeByName(ctx, client, nodeName)
+			if err != nil {
+				t.Logf("Failed to get node: %v", err)
+				return false
+			}
+
+			annotation, exists := node.Annotations[annotationKey]
+			if exists && annotation != "" {
+				t.Logf("Annotation still exists: %s", annotation)
+				return false
+			}
+
+			return true
+		}, helpers.EventuallyWaitTimeout, helpers.WaitInterval)


⚠️ Potential issue | 🟡 Minor

Unnecessary annotation check in teardown.

The test verifies that the annotation is never set when using STORE_ONLY strategy (lines 171-186). Therefore, waiting for the annotation to be cleared in teardown is unnecessary and adds confusion. Consider removing this check or adding a comment explaining why it's needed.

🔎 Suggested simplification

helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue) - t.Log("Waiting for policy match annotation to be cleared") - require.Eventually(t, func() bool { - node, err := helpers.GetNodeByName(ctx, client, nodeName) - if err != nil { - t.Logf("Failed to get node: %v", err) - return false - } - - annotation, exists := node.Annotations[annotationKey] - if exists && annotation != "" { - t.Logf("Annotation still exists: %s", annotation) - return false - } - - return true - }, helpers.EventuallyWaitTimeout, helpers.WaitInterval) - err = helpers.RemoveDeploymentArgs(ctx, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace, "", map[string]string{

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

t.Log("Waiting for policy match annotation to be cleared")

require.Eventually(t, func() bool {

node, err := helpers.GetNodeByName(ctx, client, nodeName)

if err != nil {

t.Logf("Failed to get node: %v", err)

return false

}

annotation, exists := node.Annotations[annotationKey]

if exists && annotation != "" {

t.Logf("Annotation still exists: %s", annotation)

return false

}

return true

}, helpers.EventuallyWaitTimeout, helpers.WaitInterval)

🤖 Prompt for AI Agents

In tests/kubernetes_object_monitor_test.go around lines 221 to 236, the teardown currently waits for a policy match annotation to be cleared even though earlier assertions ensure the annotation is never set for STORE_ONLY strategy; remove the entire Eventual check (the t.Log + require.Eventually block) to avoid redundant waiting, or if you prefer to keep it for defensive reasons add a short comment explaining why an explicit teardown check is required despite prior assertions (e.g., "defensive check: ensure no stray annotation was ever written during test").

tests/kubernetes_object_monitor_test.go

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/kubernetes_object_monitor_test.go (1)
191-218: Redundant annotation clearing check.

This assess step waits for the annotation to be cleared, but the previous assess (lines 171-187) already verified that the annotation is never set when using the STORE_ONLY strategy. Waiting for an annotation to be cleared when it was never created is unnecessary and adds confusion to the test flow.

Consider removing this entire assess step, or if you want to keep it for defensive reasons, add a comment explaining why this check is necessary despite the earlier assertion that the annotation is never set.
🔎 Suggested simplification
-	feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
-		client, err := c.NewClient()
-		require.NoError(t, err)
-
-		nodeName := ctx.Value(k8sMonitorKeyNodeName).(string)
-		t.Logf("Setting TestCondition to True on node %s", nodeName)
-
-		helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue)
-
-		t.Log("Waiting for policy match annotation to be cleared")
-		require.Eventually(t, func() bool {
-			node, err := helpers.GetNodeByName(ctx, client, nodeName)
-			if err != nil {
-				t.Logf("Failed to get node: %v", err)
-				return false
-			}
-
-			annotation, exists := node.Annotations[annotationKey]
-			if exists && annotation != "" {
-				t.Logf("Annotation still exists: %s", annotation)
-				return false
-			}
-
-			return true
-		}, helpers.EventuallyWaitTimeout, helpers.WaitInterval)
-
-		return ctx
-	})
-

🧹 Nitpick comments (4)

tests/helpers/kube.go (1)
387-409: LGTM! Function correctly implements negative event assertion.

The logic properly uses require.Never to ensure the specified event type and reason do not appear on the node.
Optional: Consider improving log message clarity

At line 405, the log message could include both eventType and eventReason for consistency with the assertion message at line 408:
-		t.Logf("node %s does not have event %v", nodeName, eventType)
+		t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
tests/platform-connector_test.go (1)
28-32: Remove unused struct fields.

The ConfigMapBackup and TestNamespace fields are declared but never used anywhere in the test. Consider removing them to keep the code clean.
🔎 Proposed cleanup
 type PlatformConnectorTestContext struct {
 	NodeName        string
-	ConfigMapBackup []byte
-	TestNamespace   string
 }
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (2)
84-87: Consider case-insensitive processing strategy validation.

The validation is case-sensitive and requires an exact match with the proto enum names ("EXECUTE_REMEDIATION" or "STORE_ONLY"). This could lead to user confusion if they provide lowercase values like "store_only".

Consider normalizing the input with strings.ToUpper() before validation, or enhance the error message to list the valid values.
🔎 Suggested improvement
-	strategyValue, ok := pb.ProcessingStrategy_value[params.ProcessingStrategy]
+	strategyValue, ok := pb.ProcessingStrategy_value[strings.ToUpper(params.ProcessingStrategy)]
 	if !ok {
-		return nil, fmt.Errorf("unexpected processingStrategy value: %q", params.ProcessingStrategy)
+		return nil, fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", params.ProcessingStrategy)
 	}
54-54: Document the ProcessingStrategy field.

The new ProcessingStrategy field in the Params struct would benefit from a comment explaining its purpose and valid values, especially since it's part of a public API.
 	MaxConcurrentReconciles int
 	PlatformConnectorSocket string
+	// ProcessingStrategy determines how health events are processed.
+	// Valid values: "EXECUTE_REMEDIATION" (default behavior with remediation actions)
+	//               "STORE_ONLY" (events are stored but no remediation is triggered)
 	ProcessingStrategy      string
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6e73ed and 8fc966f.

📒 Files selected for processing (19)

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
docs/postgresql-schema.sql
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/kubernetes-object-monitor/main.go
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
health-monitors/kubernetes-object-monitor/pkg/config/types.go
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
tests/event_exporter_test.go
tests/fault_quarantine_test.go
tests/helpers/event_exporter.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/kubernetes_object_monitor_test.go
tests/platform-connector_test.go

🚧 Files skipped from review as they are similar to previous changes (7)

fault-quarantine/pkg/evaluator/rule_evaluator_test.go
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
health-monitors/kubernetes-object-monitor/main.go
health-monitors/kubernetes-object-monitor/pkg/config/types.go
tests/event_exporter_test.go
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go

🧰 Additional context used

📓 Path-based instructions (3)

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/helpers/healthevent.go
tests/fault_quarantine_test.go
tests/helpers/event_exporter.go
fault-quarantine/pkg/initializer/init.go
tests/kubernetes_object_monitor_test.go
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
tests/helpers/kube.go
tests/platform-connector_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

tests/fault_quarantine_test.go
tests/kubernetes_object_monitor_test.go
tests/platform-connector_test.go

🧠 Learnings (8)

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
tests/helpers/event_exporter.go
tests/platform-connector_test.go
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml

📚 Learning: 2025-11-06T16:18:09.952Z

Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

tests/fault_quarantine_test.go
tests/kubernetes_object_monitor_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

tests/fault_quarantine_test.go
tests/kubernetes_object_monitor_test.go
tests/platform-connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/fault_quarantine_test.go
tests/kubernetes_object_monitor_test.go
tests/helpers/kube.go
tests/platform-connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/fault_quarantine_test.go

📚 Learning: 2025-11-10T10:25:19.443Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

fault-quarantine/pkg/initializer/init.go

📚 Learning: 2025-11-04T06:31:02.147Z

Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
tests/platform-connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/platform-connector_test.go

🧬 Code graph analysis (5)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/fault_quarantine_test.go (4)

tests/helpers/fault_quarantine.go (4)

QuarantineTestContext (51-54)

SetupQuarantineTest (107-112)

AssertQuarantineState (315-382)

QuarantineAssertion (56-60)

tests/helpers/kube.go (1)

SetNodeManagedByNVSentinel (1389-1408)

tests/helpers/healthevent.go (3)

NewHealthEvent (60-76)

SendHealthEvent (263-275)

SendHealthyEvent (277-287)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

tests/kubernetes_object_monitor_test.go (1)

tests/helpers/kube.go (9)

SetDeploymentArgs (2279-2312)

NVSentinelNamespace (64-64)

WaitForDeploymentRollout (984-1125)

SetNodeConditionStatus (1733-1794)

GetNodeByName (466-475)

NeverWaitTimeout (62-62)

WaitInterval (63-63)

EventuallyWaitTimeout (61-61)

RemoveDeploymentArgs (2365-2390)

health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_value (56-59)

health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)

New (41-46)

health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)

NewManager (43-45)

tests/platform-connector_test.go (4)

tests/helpers/kube.go (5)

SelectTestNodeFromUnusedPool (413-437)

EnsureNodeConditionNotPresent (1797-1818)

EnsureNodeEventNotPresent (387-409)

CheckNodeConditionExists (1283-1300)

CheckNodeEventExists (1325-1353)

tests/helpers/healthevent.go (3)

NewHealthEvent (60-76)

SendHealthEvent (263-275)

SendHealthyEvent (277-287)

tests/helpers/health_events_analyzer.go (2)

ERRORCODE_79 (35-35)

ERRORCODE_31 (32-32)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

🔇 Additional comments (16)

tests/helpers/kube.go (3)

2232-2273: LGTM! DaemonSet rollout verification is correct.

The function properly checks all relevant status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to verify rollout completion, with appropriate logging at each stage.

2275-2360: LGTM! Deployment args manipulation is well-implemented.

The function correctly uses retry.RetryOnConflict without wrapping errors (as per coding guidelines), and setArgsOnContainer properly handles the three argument formats: --flag=value, --flag value, and --flag (boolean).

2362-2417: LGTM! Argument removal logic is correct.

The docstring correctly states "removes container arguments" and the implementation properly handles removal of arguments in different formats (--flag=value, --flag value, --flag). The function correctly uses retry.RetryOnConflict without wrapping errors.

Note: The past review comment about an incorrect docstring referencing environment variables appears to have been resolved in the current code.

tests/helpers/healthevent.go (2)

48-48: LGTM!

The ProcessingStrategy field is correctly defined with an appropriate type and JSON tag, consistent with other optional fields in the struct.

153-156: LGTM!

The builder method follows the established pattern and enables fluent chaining, consistent with all other With* methods in this helper.

tests/helpers/event_exporter.go (2)

220-254: LGTM!

The FindEventByNodeAndCheckName function correctly implements CloudEvent search by nodeName, checkName, and isHealthy status, following the same defensive pattern as the existing FindEventByNodeAndMessage function.

256-283: LGTM!

The ValidateCloudEvent function is correctly extended to validate the processingStrategy field, following the same assertion pattern as other health event fields.

distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)

219-222: LGTM!

The processing_strategy column is correctly added to the health_events table schema with an appropriate VARCHAR(50) type. The column is nullable, which aligns with the optional nature of this field in the proto definition and JSON serialization.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM!

Correctly switches to BuildProcessableHealthEventInsertsPipeline() to ensure fault-quarantine only processes EXECUTE_REMEDIATION events, filtering out STORE_ONLY observability events.

docs/postgresql-schema.sql (1)

106-109: LGTM!

The processing_strategy column is correctly added to the canonical PostgreSQL schema and matches the corresponding change in values-tilt-postgresql.yaml, maintaining consistency between the two schema definitions as required.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)

51-53: LGTM!

The processingStrategy field is correctly rendered in the ConfigMap template with proper conditional logic, indentation, and quoting, following the same pattern as the errorCode field.

tests/fault_quarantine_test.go (4)

26-26: LGTM!

The import of the protos package is necessary to reference the ProcessingStrategy enum constants used in the test.

233-250: LGTM!

The test setup correctly initializes the test context and marks the node as managed by NVSentinel, which is required for the quarantine behavior being tested.

252-286: LGTM!

The test assessments correctly verify that STORE_ONLY events do not trigger quarantine while EXECUTE_REMEDIATION events do, effectively validating the core behavior of the processing strategy feature.

288-294: LGTM!

The teardown properly cleans up by sending a healthy event and calling the standard teardown helper.

distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)

107-112: LGTM!

The processingStrategy field is well-documented with clear explanations of both modes and sensible defaults. The default value EXECUTE_REMEDIATION maintains backward compatibility while enabling the new observability-only mode when needed.

coderabbitai · 2025-12-25T12:49:09Z

tests/platform-connector_test.go

+		t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName)
+		helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")


⚠️ Potential issue | 🔴 Critical

Missing error handling and assertion on condition check.

CheckNodeConditionExists returns (*v1.NodeCondition, error) but both return values are ignored. This means:

Errors during the check are silently swallowed

The test doesn't actually verify that the condition exists—it just calls the helper and moves on

You should wrap this in a require.Eventually block to wait for the condition to appear, and check the error.

🔎 Proposed fix

t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName) - helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") + require.Eventually(t, func() bool { + condition, err := helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") + if err != nil { + t.Logf("Error checking condition: %v", err) + return false + } + return condition != nil + }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have condition SysLogsXIDError", testCtx.NodeName)

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-12-25T12:49:09Z

tests/platform-connector_test.go

+		t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)
+		helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")


⚠️ Potential issue | 🔴 Critical

Missing return value check on event existence verification.

CheckNodeEventExists returns (bool, *v1.Event) but both return values are ignored. The test doesn't verify that the event actually exists—it just calls the helper. You should check the boolean return value to assert the event is present.

🔎 Proposed fix

t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName) - helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") + require.Eventually(t, func() bool { + found, event := helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") + if !found { + t.Logf("Event not found yet") + return false + } + t.Logf("Found event: %v", event.Name) + return true + }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have event SysLogsXIDError", testCtx.NodeName)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)

helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")

t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)

require.Eventually(t, func() bool {

found, event := helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")

if !found {

t.Logf("Event not found yet")

return false

}

t.Logf("Found event: %v", event.Name)

return true

}, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have event SysLogsXIDError", testCtx.NodeName)

🤖 Prompt for AI Agents

In tests/platform-connector_test.go around lines 92-93, the call to helpers.CheckNodeEventExists ignores its (bool, *v1.Event) return values so the test doesn't assert the event actually exists; update the test to capture the boolean (and optionally the Event), and assert the boolean is true (using t.Fatalf or t.Errorf) with a helpful message if false, and only proceed (or log the event) when the returned bool indicates the event exists.

feat: add processing strategy field in health events

a44b5a6

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tanishagoyal2 added 4 commits December 23, 2025 21:54

feat: add e2e tests

69a01e2

Signed-off-by: Tanisha goyal <[email protected]>

feat: add event handling strategy in kubernetes monitor

f8eb0d5

Signed-off-by: Tanisha goyal <[email protected]>

feat: temp commit

cc2dc87

Signed-off-by: Tanisha goyal <[email protected]>

feat: review change

8fc966f

Signed-off-by: Tanisha goyal <[email protected]>

tanishagoyal2 force-pushed the 390-event-handling-in-kubernetes-monitor branch from f6e73ed to 8fc966f Compare December 25, 2025 12:44

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

		t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName)
		helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")

		t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)
		helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")

feat: add event handling strategy changes in kubernetes object monitor #639

Are you sure you want to change the base?

feat: add event handling strategy changes in kubernetes object monitor #639

Uh oh!

Conversation

tanishagoyal2 commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Dec 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanishagoyal2 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading