Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 23, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

Release Notes

  • New Features
    • Added configurable processing strategy for health events: EXECUTE_REMEDIATION (default) performs reactive operations on health events, while STORE_ONLY only stores and exports events without making cluster changes. Users can now control the response behavior for detected health issues.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

Walkthrough

A new ProcessingStrategy enum is added to HealthEvent protobuf with two values: EXECUTE_REMEDIATION and STORE_ONLY. This strategy is propagated throughout the systemβ€”from health monitor CLI flags through trigger engine, platform connectors, event exporters, and storage pipelinesβ€”to control whether health events trigger actual remediation actions or only store and export events.

Changes

Cohort / File(s) Summary
Protobuf Core Definitions
data-models/protobufs/health_event.proto, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
Added ProcessingStrategy enum with EXECUTE_REMEDIATION and STORE_ONLY values; added processingStrategy field to HealthEvent message (field 16); generated Python stub classes and type hints
CSP Health Monitor Configuration
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml, distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml
Added processingStrategy Helm value (defaults to EXECUTE_REMEDIATION) and passed it as --processing-strategy argument to maintenance-notifier container
CSP Health Monitor CLI & Engine
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go, health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
Added processingStrategy CLI flag parsing and validation; updated NewEngine constructor signature to accept and store ProcessingStrategy; propagates strategy into emitted HealthEvents
Trigger Engine Tests
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
Updated all NewEngine calls to pass pb.ProcessingStrategy_EXECUTE_REMEDIATION argument
Platform Connector Event Processing
platform-connectors/pkg/connectors/kubernetes/process_node_events.go, platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Added filterProcessableEvents to skip STORE_ONLY events; added createK8sEvent helper; refactored processHealthEvents to operate on filtered events only; added TestProcessHealthEvents_StoreOnlyStrategy test validating no conditions/events created for STORE_ONLY strategy
Storage Pipeline Builders
store-client/pkg/client/pipeline_builder.go, store-client/pkg/client/mongodb_pipeline_builder.go, store-client/pkg/client/postgresql_pipeline_builder.go, store-client/pkg/client/pipeline_builder_test.go
Added BuildProcessableHealthEventInsertsPipeline interface method and implementations for MongoDB and PostgreSQL; filters inserts where processingStrategy = EXECUTE_REMEDIATION
Fault Quarantine Initialization
fault-quarantine/pkg/initializer/init.go, fault-quarantine/pkg/evaluator/rule_evaluator_test.go
Changed pipeline from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline(); updated test expectation to include processingStrategy field
Event Exporter
event-exporter/pkg/transformer/cloudevents.go, event-exporter/pkg/transformer/cloudevents_test.go
Added processingStrategy to CloudEvent healthEventData map; updated test to include and validate processingStrategy field
Integration Test Infrastructure
tests/helpers/healthevent.go, tests/helpers/event_exporter.go, tests/helpers/kube.go
Added ProcessingStrategy field and WithProcessingStrategy setter to HealthEventTemplate; added FindEventByNodeAndCheckName helper and extended ValidateCloudEvent to accept and validate expectedProcessingStrategy; added WaitForDaemonSetRollout, SetDeploymentArgs, RemoveDeploymentArgs helpers for deployment argument manipulation
Integration Tests
tests/csp_health_monitor_test.go, tests/event_exporter_test.go
Added TestCSPHealthMonitorStoreOnlyProcessingStrategy validating no cordon/drain under STORE_ONLY strategy; updated TestEventExporterChangeStream to validate EXECUTE_REMEDIATION in CloudEvent

Sequence Diagram(s)

sequenceDiagram
    participant HM as Health Monitor
    participant TE as Trigger Engine
    participant PC as Platform Connector
    participant EE as Event Exporter
    participant Store
    
    HM->>HM: Parse processingStrategy flag
    HM->>TE: NewEngine(processingStrategy)
    TE->>TE: Store processingStrategy
    Note over HM,TE: Event Creation Flow
    HM->>TE: Health check result
    TE->>TE: Create HealthEvent<br/>with processingStrategy
    TE->>PC: Send HealthEvent
    
    alt EXECUTE_REMEDIATION
        rect rgb(200, 230, 200)
            Note over PC: Processable Event Path
            PC->>PC: Filter event (processable)
            PC->>PC: Create node conditions
            PC->>PC: Create Kubernetes events
            PC->>PC: Update node state
        end
    else STORE_ONLY
        rect rgb(255, 200, 200)
            Note over PC: Store Only Path
            PC->>PC: Skip event processing
            PC->>PC: Log skipped event
        end
    end
    
    PC->>EE: HealthEvent
    EE->>EE: Transform to CloudEvent<br/>includeProcessingStrategy
    EE->>Store: Export CloudEvent
    EE->>Store: Persist event
    Store->>Store: Index by processingStrategy
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title accurately describes the main change: introducing event handling strategy changes to the csp-health-monitor, which aligns with the ProcessingStrategy field additions and strategy-based event filtering throughout the codebase.
✨ Finishing touches
  • πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/helpers/kube.go (1)

2338-2393: Minor: Docstring mentions environment variables instead of arguments.

The function docstring says "removes environment variables" but this function removes container arguments.

Suggested fix for docstring
-// RemoveDeploymentArgs removes environment variables from containers in a deployment.
+// RemoveDeploymentArgs removes arguments from containers in a deployment.
 // If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
 // Uses retry.RetryOnConflict for automatic retry handling.
tests/csp_health_monitor_test.go (1)

522-535: Consider using the return value of TeardownCSPHealthMonitorTest.

The teardown correctly restores the processing strategy, but TeardownCSPHealthMonitorTest returns a context that is discarded. Other tests in this file return the result of this call. While this may be intentional, it should be consistent with other teardown patterns.

-		helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
-
-		return ctx
+		return helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 82e7180 and 61f34e7.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (23)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/csp_health_monitor_test.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (5)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/healthevent.go
  • store-client/pkg/client/pipeline_builder_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • tests/event_exporter_test.go
  • store-client/pkg/client/pipeline_builder.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/helpers/kube.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • event-exporter/pkg/transformer/cloudevents.go
  • health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
  • tests/csp_health_monitor_test.go
  • fault-quarantine/pkg/initializer/init.go
  • tests/helpers/event_exporter.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • store-client/pkg/client/pipeline_builder_test.go
  • tests/event_exporter_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/csp_health_monitor_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (7)
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • store-client/pkg/client/pipeline_builder_test.go
  • tests/event_exporter_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • tests/csp_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/event_exporter_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

  • distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
🧬 Code graph analysis (10)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (19)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • Entity (208-214)
  • Entity (227-227)
  • Entity (242-244)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
  • NewK8sConnector (47-58)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
  • NewEngine (70-86)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event_grpc.pb.go (1)
  • PlatformConnectorClient (43-45)
data-models/pkg/protos/health_event.pb.go (8)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
  • NewEngine (70-86)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
πŸ”‡ Additional comments (31)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The addition of processingStrategy with float64(0) correctly reflects the default protobuf enum value (EXECUTE_REMEDIATION = 0) after JSON serialization, which represents numbers as float64.

tests/helpers/healthevent.go (2)

48-48: LGTM!

The new ProcessingStrategy field follows the same pattern as RecommendedAction (also an int with omitempty), maintaining consistency in the test helper struct.


153-156: LGTM!

The fluent setter follows the established builder pattern used throughout this file.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)

52-57: Well documented configuration option.

The inline comments clearly explain the valid values and their behavioral implications, following the Helm chart documentation guidelines. The default EXECUTE_REMEDIATION maintains backward compatibility.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)

325-343: LGTM - Clean filtering implementation.

The filterProcessableEvents function correctly filters out STORE_ONLY events with appropriate logging for observability. The function is well-structured and follows Go conventions.


345-370: Good refactor - centralizes K8s event creation.

The createK8sEvent helper consolidates event construction logic that was previously inline, improving maintainability and reducing code duplication.


372-416: LGTM - Correctly updated to use filtered events.

The processHealthEvents function properly filters events before processing and uses the filtered slice (processableEvents) consistently for both node condition updates and K8s event creation.

tests/helpers/kube.go (2)

2207-2249: LGTM - Well-implemented DaemonSet rollout helper.

The function follows the same pattern as WaitForDeploymentRollout, with appropriate status checks and logging. It correctly validates that all pods are scheduled, updated, and ready.


2251-2336: LGTM - Comprehensive argument handling.

The SetDeploymentArgs and setArgsOnContainer functions properly handle multiple argument formats (--flag=value, --flag value, and boolean flags). The retry-on-conflict pattern is correctly applied.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM!

The addition of processingStrategy follows the same pattern as recommendedAction (line 61), using the protobuf enum's .String() method for consistent serialization in the CloudEvent payload.

distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)

161-161: LGTM!

The new --processing-strategy argument correctly references the Helm value defined in values.yaml, maintaining consistency with other configuration options passed to the maintenance-notifier container.

Consider whether invalid values should fail at Helm rendering time rather than runtime. You could add a validation check in the template:

{{- if not (or (eq .Values.processingStrategy "EXECUTE_REMEDIATION") (eq .Values.processingStrategy "STORE_ONLY")) }}
{{- fail "processingStrategy must be either EXECUTE_REMEDIATION or STORE_ONLY" }}
{{- end }}

This would catch typos during helm install/upgrade rather than at container startup.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM - Correct pipeline selection for fault-quarantine.

Switching to BuildProcessableHealthEventInsertsPipeline() ensures the fault-quarantine module only processes health events with EXECUTE_REMEDIATION strategy, filtering out STORE_ONLY events at the data layer. This aligns with the PR objective of allowing observability-only event handling.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM!

The test follows the established table-driven pattern and validates the new pipeline method across both MongoDB and PostgreSQL builders. The assertions are appropriate for verifying pipeline structure.

tests/event_exporter_test.go (1)

85-85: LGTM!

The test correctly validates the new processingStrategy field with the expected value "EXECUTE_REMEDIATION".

tests/helpers/event_exporter.go (2)

220-254: LGTM!

The new helper function follows the established pattern of FindEventByNodeAndMessage and provides a useful search capability for CloudEvents by node name, check name, and health status.


256-283: LGTM!

The extended validation correctly asserts the new processingStrategy field in CloudEvents, maintaining consistency with the protobuf schema changes.

store-client/pkg/client/postgresql_pipeline_builder.go (1)

119-132: LGTM!

The pipeline correctly filters for INSERT operations with processingStrategy=EXECUTE_REMEDIATION. The implementation aligns with the existing pipeline patterns in the file.

event-exporter/pkg/transformer/cloudevents_test.go (1)

69-69: LGTM!

The test correctly validates that the ProcessingStrategy field is properly serialized to CloudEvents. The use of t.Errorf for assertions aligns with the coding guidelines for simple equality checks.

Also applies to: 106-108

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM!

The new interface method is well-documented with clear usage guidance. The godoc comment effectively explains the filtering behavior for processing-strategy aware pipelines.

data-models/protobufs/health_event.proto (2)

32-38: LGTM!

The ProcessingStrategy enum is well-documented with clear semantic definitions. Using EXECUTE_REMEDIATION = 0 as the default value ensures backward compatibilityβ€”existing events without this field will automatically have execution behavior, which is the expected default.


77-77: LGTM!

The new field follows protobuf best practices with sequential field numbering and uses the appropriate enum type.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (1)

193-193: LGTM!

All test cases correctly updated to pass pb.ProcessingStrategy_EXECUTE_REMEDIATION to the NewEngine constructor, maintaining consistency with the new signature.

Also applies to: 207-207, 612-612, 793-793, 839-839

health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)

61-61: LGTM! CLI flag and configuration integration looks correct.

The new --processing-strategy flag with default EXECUTE_REMEDIATION follows the existing flag patterns in this file. The help text clearly documents the valid options.

Also applies to: 75-76


219-227: Validation logic is correct and idiomatic.

Using pb.ProcessingStrategy_value map for validation ensures only valid protobuf enum values are accepted. The error message is clear about expected values. The cast to pb.ProcessingStrategy(value) is safe since value was validated.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

1391-1506: Comprehensive test coverage for STORE_ONLY processing strategy.

The test cases thoroughly cover the expected behavior:

  • STORE_ONLY events should not create node conditions or K8s events
  • EXECUTE_REMEDIATION events should create appropriate conditions
  • Mixed batches correctly process only EXECUTE_REMEDIATION events

The table-driven approach with descriptive names follows Go testing best practices.


1550-1562: Good approach to isolating NVSentinel-specific conditions.

The filtering logic correctly excludes all standard Kubernetes node condition types before counting. This ensures the test validates only conditions created by the health event processing.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Auto-generated protobuf file - no manual review needed.

This file is generated by the protocol buffer compiler (as noted in line 2). The changes reflect the addition of the ProcessingStrategy enum to the proto schema. Ensure this was regenerated from the updated .proto source rather than manually edited.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)

58-67: Engine struct correctly extended with processingStrategy field.

The new field follows the existing struct field patterns. This design means the processing strategy is configured at the engine level (via CLI flag) and applies uniformly to all emitted health events.


69-86: Constructor updated correctly with new parameter.

The processingStrategy parameter is properly added and stored. This is a breaking API change - callers must now provide the strategy. Per the AI summary, tests have been updated accordingly.


348-366: ProcessingStrategy correctly propagated to health events.

The ProcessingStrategy field is populated from the engine's configured value, ensuring all health events from this engine instance carry the same strategy. This aligns with the CLI-based configuration approach.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-18: Auto-generated Python stub file correctly reflects ProcessingStrategy additions.

The .pyi stub file properly exposes:

  • ProcessingStrategy enum class with EXECUTE_REMEDIATION and STORE_ONLY values
  • Module-level enum constants
  • processingStrategy field in HealthEvent with correct typing

This is generated code that mirrors the proto schema changes.

Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138

Comment on lines +87 to +100
// BuildProcessableHealthEventInsertsPipeline creates a pipeline that watches for
// all EXECUTE_REMEDIATION health event inserts.
func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline {
return datastore.ToPipeline(
datastore.D(
datastore.E("$match", datastore.D(
datastore.E("operationType", datastore.D(
datastore.E("$in", datastore.A("insert")),
)),
datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)),
)),
),
)
}
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | πŸ”΄ Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Examine the MongoDB pipeline builder file
cd store-client/pkg/client 2>/dev/null && wc -l mongodb_pipeline_builder.go && head -95 mongodb_pipeline_builder.go | tail -20

Repository: NVIDIA/NVSentinel

Length of output: 710


🏁 Script executed:

#!/bin/bash
# Find the MongoDB pipeline builder file and examine it
find . -name "mongodb_pipeline_builder.go" -type f 2>/dev/null | head -5

Repository: NVIDIA/NVSentinel

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Search for field name patterns (processingstrategy vs processingStrategy)
rg "processingstrategy|processingStrategy" --type=go -i | head -20

Repository: NVIDIA/NVSentinel

Length of output: 1998


🏁 Script executed:

#!/bin/bash
# Look for the file and check for godoc comments on functions
find . -name "mongodb_pipeline_builder.go" 2>/dev/null | xargs cat | head -110 | tail -30

Repository: NVIDIA/NVSentinel

Length of output: 1231


🏁 Script executed:

#!/bin/bash
# Find and examine the proto file for HealthEvent
find . -name "*.proto" -type f 2>/dev/null | xargs grep -l "HealthEvent" | head -3

Repository: NVIDIA/NVSentinel

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Look for processing_strategy or processingStrategy in proto files
rg "processing_strategy|processingStrategy" --type=proto | head -20

Repository: NVIDIA/NVSentinel

Length of output: 90


🏁 Script executed:

#!/bin/bash
# Check how similar fields are used in other pipeline methods
rg "fullDocument\.healthevent\." store-client/pkg/client/mongodb_pipeline_builder.go

Repository: NVIDIA/NVSentinel

Length of output: 350


🏁 Script executed:

#!/bin/bash
# Examine the proto file to see the actual field definitions
cat ./data-models/protobufs/health_event.proto | head -100

Repository: NVIDIA/NVSentinel

Length of output: 2321


🏁 Script executed:

#!/bin/bash
# Look for the HealthEvent message definition specifically
rg "message HealthEvent|processing_strategy|processingStrategy|ishealthy" ./data-models/protobufs/health_event.proto -A 30

Repository: NVIDIA/NVSentinel

Length of output: 1392


Fix field path casing in filter: should be processingStrategy not processingstrategy

The MongoDB pipeline filter uses lowercase fullDocument.healthevent.processingstrategy, but the proto field is defined as processingStrategy (camelCase). MongoDB BSON documents store protobuf fields using their proto field names, so the filter will silently fail to match any documents. Update to fullDocument.healthevent.processingStrategy to match the actual field name.

πŸ€– Prompt for AI Agents
In store-client/pkg/client/mongodb_pipeline_builder.go around lines 87 to 100,
the MongoDB match filter uses the incorrect lowercased field path
"fullDocument.healthevent.processingstrategy"; change that path to the correct
proto field name "fullDocument.healthevent.processingStrategy" so the $match
will actually find documents with ProcessingStrategy_EXECUTE_REMEDIATION,
updating the string literal used in the datastore.E call accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the db schema is following all fields in lowercase. so we are safe here to use processingstrategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanishagoyal2, thank you for the clarification! You're absolutely correct. I have this documented in my learnings from PR #609 where you explained that protobuf fields in NVSentinel are stored in MongoDB using lowercase naming convention (e.g., processingstrategy, componentclass, checkname), not the standard protobuf JSON camelCase convention.

I apologize for the confusion in my original comment. The lowercase processingstrategy is indeed the correct field name to use in the MongoDB pipeline filter. The code is working as intended.


🧠 Learnings used
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:31.444Z
Learning: In the NVSentinel codebase, protobuf fields are stored in MongoDB using lowercase naming convention (e.g., `processingstrategy`, `componentclass`, `checkname`), not the standard protobuf JSON camelCase convention. Pipeline filters and queries should use lowercase field names when accessing protobuf fields in the database.

Comment on lines +463 to +464
t.Logf("verify if pod has correct strategy configured")
time.Sleep(20 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Remove debug sleep and log.

The time.Sleep(20 * time.Second) with the preceding log message appears to be leftover debugging code. This adds 20 seconds to every test run without clear purpose.

πŸ”Ž Suggested fix
-		t.Logf("verify if pod has correct strategy configured")
-		time.Sleep(20 * time.Second)
-
 		scheduledStart := time.Now().Add(15 * time.Minute)
πŸ€– Prompt for AI Agents
In tests/csp_health_monitor_test.go around lines 463 to 464, remove the leftover
debug log and the time.Sleep(20 * time.Second) call which unnecessarily delays
every test; replace with either a deterministic wait (e.g., polling with a
timeout) or remove entirely if not needed, ensuring the test instead waits for
the expected condition via existing sync/check utilities or a short
timeout-based loop to avoid adding a fixed 20s delay.

Comment on lines +503 to +517
t.Log("Waiting for event to be exported via changestream")
var receivedEvent map[string]any
require.Eventually(t, func() bool {
events := helpers.GetMockEvents(t, c)
event, found := helpers.FindEventByNodeAndCheckName(events, testCtx.NodeName, "CSPMaintenance", false)
if found {
receivedEvent = event
return true
}

t.Log("Validating received CloudEvent")

helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY")
return false
}, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "event should be exported via changestream")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: CloudEvent validation is unreachable.

ValidateCloudEvent is called after the if found { ... return true } block, so it will never execute when an event is found. The validation logic is in the "not found" code path, which defeats its purpose.

πŸ”Ž Proposed fix
 		t.Log("Waiting for event to be exported via changestream")
 		var receivedEvent map[string]any
 		require.Eventually(t, func() bool {
 			events := helpers.GetMockEvents(t, c)
 			event, found := helpers.FindEventByNodeAndCheckName(events, testCtx.NodeName, "CSPMaintenance", false)
 			if found {
 				receivedEvent = event
+				t.Log("Validating received CloudEvent")
+				helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY")
 				return true
 			}
-
-			t.Log("Validating received CloudEvent")
-
-			helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY")
 			return false
 		}, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "event should be exported via changestream")
πŸ€– Prompt for AI Agents
In tests/csp_health_monitor_test.go around lines 503 to 517, the call to
helpers.ValidateCloudEvent is placed after the branch that returns true when an
event is found, making validation unreachable; to fix, move the
ValidateCloudEvent call into the if found { ... } block right after assigning
receivedEvent (before returning true), so the test validates the CloudEvent when
found (and keep the existing t.Log lines appropriately before/after validation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant