-
Notifications
You must be signed in to change notification settings - Fork 32
feat: add event handling strategy changes in csp-health-monitor #641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in csp-health-monitor #641
Conversation
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
WalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant HM as Health Monitor
participant TE as Trigger Engine
participant PC as Platform Connector
participant EE as Event Exporter
participant Store
HM->>HM: Parse processingStrategy flag
HM->>TE: NewEngine(processingStrategy)
TE->>TE: Store processingStrategy
Note over HM,TE: Event Creation Flow
HM->>TE: Health check result
TE->>TE: Create HealthEvent<br/>with processingStrategy
TE->>PC: Send HealthEvent
alt EXECUTE_REMEDIATION
rect rgb(200, 230, 200)
Note over PC: Processable Event Path
PC->>PC: Filter event (processable)
PC->>PC: Create node conditions
PC->>PC: Create Kubernetes events
PC->>PC: Update node state
end
else STORE_ONLY
rect rgb(255, 200, 200)
Note over PC: Store Only Path
PC->>PC: Skip event processing
PC->>PC: Log skipped event
end
end
PC->>EE: HealthEvent
EE->>EE: Transform to CloudEvent<br/>includeProcessingStrategy
EE->>Store: Export CloudEvent
EE->>Store: Persist event
Store->>Store: Index by processingStrategy
Estimated code review effortπ― 3 (Moderate) | β±οΈ ~20 minutes Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (2)
tests/helpers/kube.go (1)
2338-2393: Minor: Docstring mentions environment variables instead of arguments.The function docstring says "removes environment variables" but this function removes container arguments.
Suggested fix for docstring
-// RemoveDeploymentArgs removes environment variables from containers in a deployment. +// RemoveDeploymentArgs removes arguments from containers in a deployment. // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. // Uses retry.RetryOnConflict for automatic retry handling.tests/csp_health_monitor_test.go (1)
522-535: Consider using the return value of TeardownCSPHealthMonitorTest.The teardown correctly restores the processing strategy, but
TeardownCSPHealthMonitorTestreturns a context that is discarded. Other tests in this file return the result of this call. While this may be intentional, it should be consistent with other teardown patterns.- helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx) - - return ctx + return helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (23)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyiplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/csp_health_monitor_test.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/healthevent.gostore-client/pkg/client/pipeline_builder_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gotests/event_exporter_test.gostore-client/pkg/client/pipeline_builder.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/helpers/kube.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.goevent-exporter/pkg/transformer/cloudevents.gohealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.gotests/csp_health_monitor_test.gofault-quarantine/pkg/initializer/init.gotests/helpers/event_exporter.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gotests/event_exporter_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gotests/csp_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (7)
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
store-client/pkg/client/pipeline_builder_test.gotests/event_exporter_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gotests/csp_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
𧬠Code graph analysis (10)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (19)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
NewEngine(70-86)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)data-models/pkg/protos/health_event.pb.go (8)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
NewEngine(70-86)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
π Additional comments (31)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The addition of
processingStrategywithfloat64(0)correctly reflects the default protobuf enum value (EXECUTE_REMEDIATION = 0) after JSON serialization, which represents numbers as float64.tests/helpers/healthevent.go (2)
48-48: LGTM!The new
ProcessingStrategyfield follows the same pattern asRecommendedAction(also an int with omitempty), maintaining consistency in the test helper struct.
153-156: LGTM!The fluent setter follows the established builder pattern used throughout this file.
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)
52-57: Well documented configuration option.The inline comments clearly explain the valid values and their behavioral implications, following the Helm chart documentation guidelines. The default
EXECUTE_REMEDIATIONmaintains backward compatibility.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)
325-343: LGTM - Clean filtering implementation.The
filterProcessableEventsfunction correctly filters outSTORE_ONLYevents with appropriate logging for observability. The function is well-structured and follows Go conventions.
345-370: Good refactor - centralizes K8s event creation.The
createK8sEventhelper consolidates event construction logic that was previously inline, improving maintainability and reducing code duplication.
372-416: LGTM - Correctly updated to use filtered events.The
processHealthEventsfunction properly filters events before processing and uses the filtered slice (processableEvents) consistently for both node condition updates and K8s event creation.tests/helpers/kube.go (2)
2207-2249: LGTM - Well-implemented DaemonSet rollout helper.The function follows the same pattern as
WaitForDeploymentRollout, with appropriate status checks and logging. It correctly validates that all pods are scheduled, updated, and ready.
2251-2336: LGTM - Comprehensive argument handling.The
SetDeploymentArgsandsetArgsOnContainerfunctions properly handle multiple argument formats (--flag=value,--flag value, and boolean flags). The retry-on-conflict pattern is correctly applied.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM!The addition of
processingStrategyfollows the same pattern asrecommendedAction(line 61), using the protobuf enum's.String()method for consistent serialization in the CloudEvent payload.distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)
161-161: LGTM!The new
--processing-strategyargument correctly references the Helm value defined invalues.yaml, maintaining consistency with other configuration options passed to the maintenance-notifier container.Consider whether invalid values should fail at Helm rendering time rather than runtime. You could add a validation check in the template:
{{- if not (or (eq .Values.processingStrategy "EXECUTE_REMEDIATION") (eq .Values.processingStrategy "STORE_ONLY")) }} {{- fail "processingStrategy must be either EXECUTE_REMEDIATION or STORE_ONLY" }} {{- end }}This would catch typos during
helm install/upgraderather than at container startup.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM - Correct pipeline selection for fault-quarantine.Switching to
BuildProcessableHealthEventInsertsPipeline()ensures the fault-quarantine module only processes health events withEXECUTE_REMEDIATIONstrategy, filtering outSTORE_ONLYevents at the data layer. This aligns with the PR objective of allowing observability-only event handling.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM!The test follows the established table-driven pattern and validates the new pipeline method across both MongoDB and PostgreSQL builders. The assertions are appropriate for verifying pipeline structure.
tests/event_exporter_test.go (1)
85-85: LGTM!The test correctly validates the new
processingStrategyfield with the expected value "EXECUTE_REMEDIATION".tests/helpers/event_exporter.go (2)
220-254: LGTM!The new helper function follows the established pattern of
FindEventByNodeAndMessageand provides a useful search capability for CloudEvents by node name, check name, and health status.
256-283: LGTM!The extended validation correctly asserts the new
processingStrategyfield in CloudEvents, maintaining consistency with the protobuf schema changes.store-client/pkg/client/postgresql_pipeline_builder.go (1)
119-132: LGTM!The pipeline correctly filters for INSERT operations with
processingStrategy=EXECUTE_REMEDIATION. The implementation aligns with the existing pipeline patterns in the file.event-exporter/pkg/transformer/cloudevents_test.go (1)
69-69: LGTM!The test correctly validates that the
ProcessingStrategyfield is properly serialized to CloudEvents. The use oft.Errorffor assertions aligns with the coding guidelines for simple equality checks.Also applies to: 106-108
store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM!The new interface method is well-documented with clear usage guidance. The godoc comment effectively explains the filtering behavior for processing-strategy aware pipelines.
data-models/protobufs/health_event.proto (2)
32-38: LGTM!The
ProcessingStrategyenum is well-documented with clear semantic definitions. UsingEXECUTE_REMEDIATION = 0as the default value ensures backward compatibilityβexisting events without this field will automatically have execution behavior, which is the expected default.
77-77: LGTM!The new field follows protobuf best practices with sequential field numbering and uses the appropriate enum type.
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (1)
193-193: LGTM!All test cases correctly updated to pass
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto theNewEngineconstructor, maintaining consistency with the new signature.Also applies to: 207-207, 612-612, 793-793, 839-839
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
61-61: LGTM! CLI flag and configuration integration looks correct.The new
--processing-strategyflag with defaultEXECUTE_REMEDIATIONfollows the existing flag patterns in this file. The help text clearly documents the valid options.Also applies to: 75-76
219-227: Validation logic is correct and idiomatic.Using
pb.ProcessingStrategy_valuemap for validation ensures only valid protobuf enum values are accepted. The error message is clear about expected values. The cast topb.ProcessingStrategy(value)is safe sincevaluewas validated.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
1391-1506: Comprehensive test coverage for STORE_ONLY processing strategy.The test cases thoroughly cover the expected behavior:
- STORE_ONLY events should not create node conditions or K8s events
- EXECUTE_REMEDIATION events should create appropriate conditions
- Mixed batches correctly process only EXECUTE_REMEDIATION events
The table-driven approach with descriptive names follows Go testing best practices.
1550-1562: Good approach to isolating NVSentinel-specific conditions.The filtering logic correctly excludes all standard Kubernetes node condition types before counting. This ensures the test validates only conditions created by the health event processing.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Auto-generated protobuf file - no manual review needed.This file is generated by the protocol buffer compiler (as noted in line 2). The changes reflect the addition of the
ProcessingStrategyenum to the proto schema. Ensure this was regenerated from the updated.protosource rather than manually edited.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
58-67: Engine struct correctly extended with processingStrategy field.The new field follows the existing struct field patterns. This design means the processing strategy is configured at the engine level (via CLI flag) and applies uniformly to all emitted health events.
69-86: Constructor updated correctly with new parameter.The
processingStrategyparameter is properly added and stored. This is a breaking API change - callers must now provide the strategy. Per the AI summary, tests have been updated accordingly.
348-366: ProcessingStrategy correctly propagated to health events.The
ProcessingStrategyfield is populated from the engine's configured value, ensuring all health events from this engine instance carry the same strategy. This aligns with the CLI-based configuration approach.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-18: Auto-generated Python stub file correctly reflects ProcessingStrategy additions.The
.pyistub file properly exposes:
ProcessingStrategyenum class withEXECUTE_REMEDIATIONandSTORE_ONLYvalues- Module-level enum constants
processingStrategyfield inHealthEventwith correct typingThis is generated code that mirrors the proto schema changes.
Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138
| // BuildProcessableHealthEventInsertsPipeline creates a pipeline that watches for | ||
| // all EXECUTE_REMEDIATION health event inserts. | ||
| func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline { | ||
| return datastore.ToPipeline( | ||
| datastore.D( | ||
| datastore.E("$match", datastore.D( | ||
| datastore.E("operationType", datastore.D( | ||
| datastore.E("$in", datastore.A("insert")), | ||
| )), | ||
| datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)), | ||
| )), | ||
| ), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
#!/bin/bash
# Examine the MongoDB pipeline builder file
cd store-client/pkg/client 2>/dev/null && wc -l mongodb_pipeline_builder.go && head -95 mongodb_pipeline_builder.go | tail -20Repository: NVIDIA/NVSentinel
Length of output: 710
π Script executed:
#!/bin/bash
# Find the MongoDB pipeline builder file and examine it
find . -name "mongodb_pipeline_builder.go" -type f 2>/dev/null | head -5Repository: NVIDIA/NVSentinel
Length of output: 114
π Script executed:
#!/bin/bash
# Search for field name patterns (processingstrategy vs processingStrategy)
rg "processingstrategy|processingStrategy" --type=go -i | head -20Repository: NVIDIA/NVSentinel
Length of output: 1998
π Script executed:
#!/bin/bash
# Look for the file and check for godoc comments on functions
find . -name "mongodb_pipeline_builder.go" 2>/dev/null | xargs cat | head -110 | tail -30Repository: NVIDIA/NVSentinel
Length of output: 1231
π Script executed:
#!/bin/bash
# Find and examine the proto file for HealthEvent
find . -name "*.proto" -type f 2>/dev/null | xargs grep -l "HealthEvent" | head -3Repository: NVIDIA/NVSentinel
Length of output: 103
π Script executed:
#!/bin/bash
# Look for processing_strategy or processingStrategy in proto files
rg "processing_strategy|processingStrategy" --type=proto | head -20Repository: NVIDIA/NVSentinel
Length of output: 90
π Script executed:
#!/bin/bash
# Check how similar fields are used in other pipeline methods
rg "fullDocument\.healthevent\." store-client/pkg/client/mongodb_pipeline_builder.goRepository: NVIDIA/NVSentinel
Length of output: 350
π Script executed:
#!/bin/bash
# Examine the proto file to see the actual field definitions
cat ./data-models/protobufs/health_event.proto | head -100Repository: NVIDIA/NVSentinel
Length of output: 2321
π Script executed:
#!/bin/bash
# Look for the HealthEvent message definition specifically
rg "message HealthEvent|processing_strategy|processingStrategy|ishealthy" ./data-models/protobufs/health_event.proto -A 30Repository: NVIDIA/NVSentinel
Length of output: 1392
Fix field path casing in filter: should be processingStrategy not processingstrategy
The MongoDB pipeline filter uses lowercase fullDocument.healthevent.processingstrategy, but the proto field is defined as processingStrategy (camelCase). MongoDB BSON documents store protobuf fields using their proto field names, so the filter will silently fail to match any documents. Update to fullDocument.healthevent.processingStrategy to match the actual field name.
π€ Prompt for AI Agents
In store-client/pkg/client/mongodb_pipeline_builder.go around lines 87 to 100,
the MongoDB match filter uses the incorrect lowercased field path
"fullDocument.healthevent.processingstrategy"; change that path to the correct
proto field name "fullDocument.healthevent.processingStrategy" so the $match
will actually find documents with ProcessingStrategy_EXECUTE_REMEDIATION,
updating the string literal used in the datastore.E call accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the db schema is following all fields in lowercase. so we are safe here to use processingstrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanishagoyal2, thank you for the clarification! You're absolutely correct. I have this documented in my learnings from PR #609 where you explained that protobuf fields in NVSentinel are stored in MongoDB using lowercase naming convention (e.g., processingstrategy, componentclass, checkname), not the standard protobuf JSON camelCase convention.
I apologize for the confusion in my original comment. The lowercase processingstrategy is indeed the correct field name to use in the MongoDB pipeline filter. The code is working as intended.
π§ Learnings used
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:31.444Z
Learning: In the NVSentinel codebase, protobuf fields are stored in MongoDB using lowercase naming convention (e.g., `processingstrategy`, `componentclass`, `checkname`), not the standard protobuf JSON camelCase convention. Pipeline filters and queries should use lowercase field names when accessing protobuf fields in the database.
| t.Logf("verify if pod has correct strategy configured") | ||
| time.Sleep(20 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug sleep and log.
The time.Sleep(20 * time.Second) with the preceding log message appears to be leftover debugging code. This adds 20 seconds to every test run without clear purpose.
π Suggested fix
- t.Logf("verify if pod has correct strategy configured")
- time.Sleep(20 * time.Second)
-
scheduledStart := time.Now().Add(15 * time.Minute)π€ Prompt for AI Agents
In tests/csp_health_monitor_test.go around lines 463 to 464, remove the leftover
debug log and the time.Sleep(20 * time.Second) call which unnecessarily delays
every test; replace with either a deterministic wait (e.g., polling with a
timeout) or remove entirely if not needed, ensuring the test instead waits for
the expected condition via existing sync/check utilities or a short
timeout-based loop to avoid adding a fixed 20s delay.
| t.Log("Waiting for event to be exported via changestream") | ||
| var receivedEvent map[string]any | ||
| require.Eventually(t, func() bool { | ||
| events := helpers.GetMockEvents(t, c) | ||
| event, found := helpers.FindEventByNodeAndCheckName(events, testCtx.NodeName, "CSPMaintenance", false) | ||
| if found { | ||
| receivedEvent = event | ||
| return true | ||
| } | ||
|
|
||
| t.Log("Validating received CloudEvent") | ||
|
|
||
| helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY") | ||
| return false | ||
| }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "event should be exported via changestream") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: CloudEvent validation is unreachable.
ValidateCloudEvent is called after the if found { ... return true } block, so it will never execute when an event is found. The validation logic is in the "not found" code path, which defeats its purpose.
π Proposed fix
t.Log("Waiting for event to be exported via changestream")
var receivedEvent map[string]any
require.Eventually(t, func() bool {
events := helpers.GetMockEvents(t, c)
event, found := helpers.FindEventByNodeAndCheckName(events, testCtx.NodeName, "CSPMaintenance", false)
if found {
receivedEvent = event
+ t.Log("Validating received CloudEvent")
+ helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY")
return true
}
-
- t.Log("Validating received CloudEvent")
-
- helpers.ValidateCloudEvent(t, receivedEvent, testCtx.NodeName, "CSP maintenance scheduled", "CSPMaintenance", "", "STORE_ONLY")
return false
}, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "event should be exported via changestream")π€ Prompt for AI Agents
In tests/csp_health_monitor_test.go around lines 503 to 517, the call to
helpers.ValidateCloudEvent is placed after the branch that returns true when an
event is found, making validation unreachable; to fix, move the
ValidateCloudEvent call into the if found { ... } block right after assigning
receivedEvent (before returning true), so the test validates the CloudEvent when
found (and keep the existing t.Log lines appropriately before/after validation).
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
Release Notes
βοΈ Tip: You can customize this high-level summary in your review settings.