-
Notifications
You must be signed in to change notification settings - Fork 32
feat: add event handling strategy changes in kubernetes object monitor #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in kubernetes object monitor #639
Conversation
Signed-off-by: Tanisha goyal <[email protected]>
π WalkthroughWalkthroughAdds a ProcessingStrategy enum and HealthEvent.processingStrategy field; threads a processing-strategy config/flag through object-monitor, publisher, exporter, connector, and store pipelines; filters STORE_ONLY events from remediation/control paths while still exporting them. Changes
Sequence DiagramsequenceDiagram
actor Operator
participant Helm as Helm Values
participant Monitor as Object Monitor
participant Publisher
participant Exporter as Event Exporter
participant Connector as Platform Connector
participant Node as Kubernetes Node
Operator->>Helm: set processingStrategy (EXECUTE_REMEDIATION|STORE_ONLY)
Helm->>Monitor: supply flag/value
Monitor->>Monitor: validate & map -> pb.ProcessingStrategy
Monitor->>Publisher: New(..., processingStrategy)
Monitor->>Monitor: create AnnotationMgr(processingStrategy)
rect rgb(200,255,200)
Note over Monitor,Connector: EXECUTE_REMEDIATION path
Monitor->>Publisher: PublishHealthEvent(event)
Publisher->>Publisher: resolve strategy (event or default)
Publisher->>Exporter: emit CloudEvent (includes processingStrategy)
Exporter->>Connector: forward event
Connector->>Connector: filterProcessableEvents() -> passes
Connector->>Node: create/update conditions & events
end
rect rgb(255,230,230)
Note over Monitor,Connector: STORE_ONLY path
Monitor->>Publisher: PublishHealthEvent(event)
Publisher->>Exporter: emit CloudEvent (processingStrategy=STORE_ONLY)
Exporter->>Connector: forward event
Connector->>Connector: filterProcessableEvents() -> drops event
Connector->>Node: skip condition/annotation updates
Monitor->>Monitor: AnnotationMgr Add/Remove skip when STORE_ONLY
end
Estimated code review effortπ― 4 (Complex) | β±οΈ ~50 minutes
Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (8)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)
107-112: Well-documented configuration option.The new
processingStrategyfield is properly documented with valid values and behavioral explanations. The defaultEXECUTE_REMEDIATIONmaintains backward compatibility.Consider adding validation in the Helm templates to fail early if an invalid value is provided (e.g., via
.Values.processingStrategy | upper | mustRegexMatch "^(EXECUTE_REMEDIATION|STORE_ONLY)$"), though this is optional if validation happens at the application level.tests/helpers/kube.go (1)
2341-2366: Missing container-not-found check.Unlike
SetDeploymentArgs(lines 2282-2284),RemoveDeploymentArgsdoesn't return an error if the specifiedcontainerNameis not found in the deployment. This inconsistency could hide configuration errors.π Proposed fix
func RemoveDeploymentArgs( ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string, ) error { return retry.RetryOnConflict(retry.DefaultRetry, func() error { deployment := &appsv1.Deployment{} if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil { return err } if len(deployment.Spec.Template.Spec.Containers) == 0 { return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName) } + found := false + for i := range deployment.Spec.Template.Spec.Containers { container := &deployment.Spec.Template.Spec.Containers[i] if containerName != "" && container.Name != containerName { continue } + found = true + removeArgsFromContainer(container, args) } + if containerName != "" && !found { + return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName) + } + return c.Resources().Update(ctx, deployment) }) }health-monitors/kubernetes-object-monitor/pkg/config/types.go (2)
49-50: Consider adding validation for ProcessingStrategy values.The
ProcessingStrategyfield is a string without validation, which could allow invalid values to be silently accepted. Consider validating against allowed values (e.g., "EXECUTE_REMEDIATION", "STORE_ONLY") either at config load time or through a custom unmarshal function.
50-50: Add godoc comment for exported field.The exported
ProcessingStrategyfield should have a godoc comment that describes its purpose and allowed values, following Go conventions.As per coding guidelines: "Function comments required for all exported Go functions" (applies to exported fields as well).
health-monitors/kubernetes-object-monitor/main.go (1)
71-75: Consider validating the processing-strategy flag value.The flag accepts any string value but only "EXECUTE_REMEDIATION" and "STORE_ONLY" are valid. Consider adding validation in the
run()function to fail fast with a clear error message if an invalid value is provided.π Suggested validation
func run() error { ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) defer stop() + // Validate processing strategy + validStrategies := []string{"EXECUTE_REMEDIATION", "STORE_ONLY"} + isValid := false + for _, valid := range validStrategies { + if *processingStrategyFlag == valid { + isValid = true + break + } + } + if !isValid { + return fmt.Errorf("invalid processing-strategy %q, must be one of: %v", *processingStrategyFlag, validStrategies) + } + params := initializer.Params{ PolicyConfigPath: *policyConfigPath,Also applies to: 101-101
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
50-73: Strategy override logic is well-designed.The pattern of defaulting to the publisher's strategy while allowing per-policy overrides is flexible. The validation against the generated
pb.ProcessingStrategy_valuemap ensures consistency with the protobuf definition.One minor observation: consider adding context to the error message (e.g.,
"policy %s: unexpected processingStrategy...") to help identify which policy has the invalid configuration during debugging.π Optional: Add policy context to error message
if policy.HealthEvent.ProcessingStrategy != "" { value, ok := pb.ProcessingStrategy_value[policy.HealthEvent.ProcessingStrategy] if !ok { - return fmt.Errorf("unexpected processingStrategy value: %q", policy.HealthEvent.ProcessingStrategy) + return fmt.Errorf("policy %q: unexpected processingStrategy value: %q", policy.Name, policy.HealthEvent.ProcessingStrategy) }platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Consider using Debug log level for filtered events.Using
slog.Infofor every skippedSTORE_ONLYevent could generate high log volume in production. Considerslog.Debugfor consistency with similar skip-logging patterns elsewhere in the codebase (e.g.,manager.golines 51, 70).π Proposed change
for _, healthEvent := range healthEvents.Events { if healthEvent.ProcessingStrategy == protos.ProcessingStrategy_STORE_ONLY { - slog.Info("Skipping STORE_ONLY health event (no node conditions / node events)", + slog.Debug("Skipping STORE_ONLY health event (no node conditions / node events)", "node", healthEvent.NodeName, "checkName", healthEvent.CheckName, "agent", healthEvent.Agent)health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (1)
116-116: Minor: duplicate cast of strategyValue.The
pb.ProcessingStrategy(strategyValue)cast is performed twice (once on line 91 for the publisher, and again on line 116 for registerControllers). Consider storing the cast result in a variable to avoid repetition.π Proposed refactor
+ processingStrategy := pb.ProcessingStrategy(strategyValue) + slog.Info("Event handling strategy configured", "processingStrategy", params.ProcessingStrategy) - pub := publisher.New(pcClient, pb.ProcessingStrategy(strategyValue)) + pub := publisher.New(pcClient, processingStrategy) // ... later ... - if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, pb.ProcessingStrategy(strategyValue)); err != nil { + if err := registerControllers(mgr, evaluator, pub, cfg.Policies, params.MaxConcurrentReconciles, processingStrategy); err != nil {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (26)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/kubernetes-object-monitor/main.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/kubernetes-object-monitor/pkg/config/types.gotests/helpers/healthevent.gofault-quarantine/pkg/initializer/init.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gostore-client/pkg/client/pipeline_builder.goevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gostore-client/pkg/client/mongodb_pipeline_builder.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gostore-client/pkg/client/pipeline_builder_test.gohealth-monitors/kubernetes-object-monitor/main.gotests/helpers/kube.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gotests/helpers/event_exporter.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/event_exporter_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gostore-client/pkg/client/pipeline_builder_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (8)
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/kubernetes_object_monitor_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
𧬠Code graph analysis (12)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
Config(16-18)tests/helpers/kube.go (9)
SetDeploymentArgs(2255-2288)NVSentinelNamespace(64-64)WaitForDeploymentRollout(960-1101)SetNodeConditionStatus(1709-1770)GetNodeByName(442-451)NeverWaitTimeout(62-62)WaitInterval(63-63)EventuallyWaitTimeout(61-61)RemoveDeploymentArgs(2341-2366)commons/pkg/auditlogger/auditlogger.go (1)
Log(114-134)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (16)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (3)
data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)health-monitors/kubernetes-object-monitor/pkg/config/types.go (1)
Policy(20-27)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_STORE_ONLY(47-47)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (7)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
π Additional comments (37)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The test expectation correctly includes the new
processingStrategyfield withfloat64(0)(representingEXECUTE_REMEDIATION, the default enum value). This aligns with the proto changes whereProcessingStrategydefaults to 0.data-models/protobufs/health_event.proto (2)
32-38: Well-designed enum with correct default.Good proto3 design:
EXECUTE_REMEDIATION = 0ensures backward compatibility since unset fields default to 0 (the normal remediation behavior). The comments clearly document both strategies per coding guidelines.
77-77: Appropriate field placement.Field number 16 continues the sequential numbering after
drainOverrides = 15. This is a non-breaking addition that maintains wire compatibility with existing messages.tests/helpers/kube.go (2)
2208-2249: LGTM!Clean implementation of
WaitForDaemonSetRolloutthat follows the same pattern asWaitForDeploymentRollout. The rollout completion checks (desired, updated, ready pods) are correct.
2290-2336: Approve helper implementation.The
setArgsOnContainerhelper correctly handles the three argument styles (--flag=value,--flag,--flag value) with appropriate insertion and update logic. Thebreakstatements prevent index issues during slice modification.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-17: LGTM!The
ProcessingStrategyenum class and top-level constants are correctly defined, matching the protobuf definition.Also applies to: 31-32
78-78: LGTM!The
HealthEventmessage correctly includes the newprocessingStrategyfield in__slots__, field number constant, attribute declaration, and__init__signature. The type hint_Optional[_Union[ProcessingStrategy, str]]is appropriate for protobuf enum fields.Also applies to: 104-104, 120-120, 138-138
tests/event_exporter_test.go (2)
25-25: LGTM!The import addition aligns with the updated test helper function signature.
85-85: LGTM!The test correctly validates the expected processing strategy value. The change properly extends the test to verify the new processingStrategy field in CloudEvents.
event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM!The addition of
processingStrategyto the CloudEvent data payload correctly uses the.String()method to serialize the enum value. The placement is consistent with other health event fields.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)
51-53: LGTM!The conditional rendering of
processingStrategyfollows the same pattern as theerrorCodefield above (lines 48-50) and properly quotes the value for TOML format. The optional nature ensures backward compatibility when the field is not specified.event-exporter/pkg/transformer/cloudevents_test.go (2)
69-69: LGTM!Setting
ProcessingStrategy_STORE_ONLYprovides good test coverage for the non-default enum value.
106-108: LGTM!The validation correctly verifies that the
processingStrategyfield is properly serialized as "STORE_ONLY" in the CloudEvent data. This confirms the.String()method works as expected.store-client/pkg/client/mongodb_pipeline_builder.go (1)
19-19: LGTM!The import is necessary to reference
ProcessingStrategy_EXECUTE_REMEDIATIONin the new pipeline.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml (1)
57-57: No action needed. TheprocessingStrategyvalue has a default value ofEXECUTE_REMEDIATIONalready defined in values.yaml at line 112.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM! Filtering STORE_ONLY events is the intended behavior.The change from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()correctly enables fault-quarantine to process only events withProcessingStrategy = EXECUTE_REMEDIATION, excluding observability-onlySTORE_ONLYevents. This aligns with the PR objectives for conditional event processing.Producers are properly configured to set
ProcessingStrategybased on policy definitions, withEXECUTE_REMEDIATIONas the default. Observability-only events must be explicitly marked asSTORE_ONLYvia policy configuration, so critical events won't be accidentally filtered.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM! Test follows established patterns.The new test for
BuildProcessableHealthEventInsertsPipelinecorrectly mirrors the structure of existing pipeline tests, using table-driven approach across both MongoDB and PostgreSQL builders.store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM! Clear interface addition with good documentation.The new method is well-documented with clear use case and filtering behavior.
tests/helpers/healthevent.go (1)
48-48: LGTM! Builder pattern follows established conventions.The
ProcessingStrategyfield and builder method follow the existing patterns in this test helper. Usinginttype provides flexibility for testing edge cases beyond the defined enum values.Also applies to: 153-156
store-client/pkg/client/postgresql_pipeline_builder.go (1)
119-132: LGTM! Implementation correctly filters for EXECUTE_REMEDIATION strategy.The pipeline matches insert operations where
processingStrategyequalsEXECUTE_REMEDIATION, consistent with the interface contract and MongoDB implementation.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1391-1589: Test logic validates STORE_ONLY behavior correctly, but consider migration to envtest.The test effectively validates that
STORE_ONLYevents don't create node conditions or Kubernetes events, whileEXECUTE_REMEDIATIONevents do. Test cases cover the key scenarios well.However, the test uses
fake.NewSimpleClientset()rather thanenvtestfor Kubernetes controller testing. Based on learnings, envtest is preferred for testing Kubernetes controllers.Consider migrating this test to use
envtestin a future refactor to align with the coding guidelines. The fake client is acceptable for unit tests but envtest provides better integration testing.tests/helpers/event_exporter.go (2)
220-254: LGTM! Useful helper for finding events by check name.The new
FindEventByNodeAndCheckNamefunction provides a clean way to search for events by multiple criteria including health status.
261-261: LGTM! Validation extended to include processingStrategy.The
ValidateCloudEventfunction correctly updated to validate theprocessingStrategyfield in CloudEvent payloads. This ensures tests verify the processing strategy is properly propagated through the event pipeline.Also applies to: 281-281
tests/kubernetes_object_monitor_test.go (4)
128-159: LGTM! Test setup correctly configures STORE_ONLY strategy.The setup properly:
- Identifies a non-KWOK test node
- Applies deployment args to enable STORE_ONLY processing
- Waits for deployment rollout before proceeding
161-188: LGTM! Correct use of require.Never for negative assertion.The test correctly validates that
STORE_ONLYstrategy does not create node annotations by usingrequire.Never, which asserts the condition never becomes true within the timeout period.
190-217: Clarify the purpose of the "Node Ready recovery clears annotation" assessment.This assessment expects an annotation to be cleared when the node becomes Ready, but in
STORE_ONLYmode, the annotation should never have been created in the first place (as verified by the previous assessment).Is this assessment intended to verify that:
- The annotation remains absent when the condition changes to True? (If so,
require.Neverwould be more appropriate)- Some edge case where an annotation might exist from a previous non-STORE_ONLY run?
Consider whether this assessment should use
require.Neverinstead ofrequire.Eventuallyto verify the annotation continues to not exist, or add a comment explaining whyEventuallyis appropriate here.
219-230: LGTM! Teardown properly restores deployment state.The teardown correctly removes the STORE_ONLY args and waits for the deployment to stabilize, ensuring test isolation.
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (4)
24-24: LGTM!Import for the protobuf package is correctly added to support the new
ProcessingStrategytype.
38-45: LGTM!The
Managerstruct and constructor are correctly extended to accept and store theprocessingStrategy. The field naming follows Go conventions.
50-53: LGTM!The early return for
STORE_ONLYstrategy correctly prevents annotation updates while maintaining debug logging for observability. This aligns with the PR objective to support different event handling strategies.
69-72: LGTM!Consistent implementation with
AddMatch- theSTORE_ONLYguard correctly skips annotation removal with appropriate debug logging.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Generated protobuf code - no manual review required.This file is auto-generated by the protocol buffer compiler as indicated by the header comments. Ensure this file is regenerated from the source
.protofile rather than manually edited. The changes correctly reflect the addition of theProcessingStrategyenum andprocessingStrategyfield in theHealthEventmessage.health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
36-45: LGTM!The
Publisherstruct and constructor are correctly extended to store and initialize the defaultprocessingStrategy. The naming convention follows Go standards.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
372-416: LGTM on the overall flow.The
processHealthEventsfunction correctly:
- Filters out
STORE_ONLYevents before processing- Updates node conditions only for processable healthy/fatal events
- Creates Kubernetes events for non-healthy, non-fatal events
The separation of filtering, condition updates, and event creation is clean and maintainable.
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
47-55: LGTM!The
Paramsstruct is correctly extended withProcessingStrategyfield. This allows the strategy to be configured externally (likely via command-line flag or environment variable based on the AI summary).
163-171: LGTM!The
registerControllersfunction signature is correctly updated to acceptprocessingStrategy, and it's properly passed toannotations.NewManager. This ensures the strategy is consistently applied throughout the controller initialization path.
83-91: Default value handling is already provided at the flag level inmain.go.The
processingStrategyFlaghas a default value of"EXECUTE_REMEDIATION"(line 73), which is passed to the initializer via*processingStrategyFlag. This ensures that under normal execution,params.ProcessingStrategywill never be empty. However, the validation at line 84-86 will still correctly reject invalid or empty values if the initializer is called directly with unvalidated input. Consider adding a comment to document this dependency on the caller providing a valid value.
tests/helpers/kube.go
Outdated
| // RemoveDeploymentArgs removes environment variables from containers in a deployment. | ||
| // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. | ||
| // Uses retry.RetryOnConflict for automatic retry handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect docstring.
The docstring says "removes environment variables" but this function removes container arguments. This appears to be copy-paste from RemoveDeploymentEnvVars.
π Proposed fix
-// RemoveDeploymentArgs removes environment variables from containers in a deployment.
+// RemoveDeploymentArgs removes arguments from containers in a deployment.
// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
// Uses retry.RetryOnConflict for automatic retry handling.π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // RemoveDeploymentArgs removes environment variables from containers in a deployment. | |
| // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. | |
| // Uses retry.RetryOnConflict for automatic retry handling. | |
| // RemoveDeploymentArgs removes arguments from containers in a deployment. | |
| // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. | |
| // Uses retry.RetryOnConflict for automatic retry handling. |
π€ Prompt for AI Agents
In tests/helpers/kube.go around lines 2338-2340, the docstring incorrectly
states that the function "removes environment variables" while the
implementation actually removes container arguments; update the comment to
accurately describe that the function removes container arguments from a
Deployment (and note behavior for empty containerName applies to all
containers), and remove or replace the reference to RemoveDeploymentEnvVars to
avoid copy-paste confusion; keep the mention of retry.RetryOnConflict and the
behavior for the containerName parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
tests/kubernetes_object_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/kubernetes_object_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/kubernetes_object_monitor_test.go
π§ Learnings (2)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/kubernetes_object_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/kubernetes_object_monitor_test.go
𧬠Code graph analysis (1)
tests/kubernetes_object_monitor_test.go (4)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)tests/helpers/kube.go (7)
SetDeploymentArgs(2255-2288)NVSentinelNamespace(64-64)WaitForDeploymentRollout(960-1101)SetNodeConditionStatus(1709-1770)GetNodeByName(442-451)EventuallyWaitTimeout(61-61)WaitInterval(63-63)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/event_exporter.go (3)
GetMockEvents(36-99)FindEventByNodeAndCheckName(221-254)ValidateCloudEvent(257-283)
π Additional comments (1)
tests/kubernetes_object_monitor_test.go (1)
24-24: LGTM!The import is correctly added to support the helper functions used in the new test.
| t.Log("Waiting for policy match annotation to be cleared") | ||
| require.Eventually(t, func() bool { | ||
| node, err := helpers.GetNodeByName(ctx, client, nodeName) | ||
| if err != nil { | ||
| t.Logf("Failed to get node: %v", err) | ||
| return false | ||
| } | ||
|
|
||
| annotation, exists := node.Annotations[annotationKey] | ||
| if exists && annotation != "" { | ||
| t.Logf("Annotation still exists: %s", annotation) | ||
| return false | ||
| } | ||
|
|
||
| return true | ||
| }, helpers.EventuallyWaitTimeout, helpers.WaitInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary annotation check in teardown.
The test verifies that the annotation is never set when using STORE_ONLY strategy (lines 171-186). Therefore, waiting for the annotation to be cleared in teardown is unnecessary and adds confusion. Consider removing this check or adding a comment explaining why it's needed.
π Suggested simplification
helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue)
- t.Log("Waiting for policy match annotation to be cleared")
- require.Eventually(t, func() bool {
- node, err := helpers.GetNodeByName(ctx, client, nodeName)
- if err != nil {
- t.Logf("Failed to get node: %v", err)
- return false
- }
-
- annotation, exists := node.Annotations[annotationKey]
- if exists && annotation != "" {
- t.Logf("Annotation still exists: %s", annotation)
- return false
- }
-
- return true
- }, helpers.EventuallyWaitTimeout, helpers.WaitInterval)
-
err = helpers.RemoveDeploymentArgs(ctx, client, "kubernetes-object-monitor", helpers.NVSentinelNamespace, "", map[string]string{π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t.Log("Waiting for policy match annotation to be cleared") | |
| require.Eventually(t, func() bool { | |
| node, err := helpers.GetNodeByName(ctx, client, nodeName) | |
| if err != nil { | |
| t.Logf("Failed to get node: %v", err) | |
| return false | |
| } | |
| annotation, exists := node.Annotations[annotationKey] | |
| if exists && annotation != "" { | |
| t.Logf("Annotation still exists: %s", annotation) | |
| return false | |
| } | |
| return true | |
| }, helpers.EventuallyWaitTimeout, helpers.WaitInterval) |
π€ Prompt for AI Agents
In tests/kubernetes_object_monitor_test.go around lines 221 to 236, the teardown
currently waits for a policy match annotation to be cleared even though earlier
assertions ensure the annotation is never set for STORE_ONLY strategy; remove
the entire Eventual check (the t.Log + require.Eventually block) to avoid
redundant waiting, or if you prefer to keep it for defensive reasons add a short
comment explaining why an explicit teardown check is required despite prior
assertions (e.g., "defensive check: ensure no stray annotation was ever written
during test").
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
f6e73ed to
8fc966f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
β»οΈ Duplicate comments (1)
tests/kubernetes_object_monitor_test.go (1)
191-218: Redundant annotation clearing check.This assess step waits for the annotation to be cleared, but the previous assess (lines 171-187) already verified that the annotation is never set when using the STORE_ONLY strategy. Waiting for an annotation to be cleared when it was never created is unnecessary and adds confusion to the test flow.
Consider removing this entire assess step, or if you want to keep it for defensive reasons, add a comment explaining why this check is necessary despite the earlier assertion that the annotation is never set.
π Suggested simplification
- feature.Assess("Node Ready recovery clears annotation", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { - client, err := c.NewClient() - require.NoError(t, err) - - nodeName := ctx.Value(k8sMonitorKeyNodeName).(string) - t.Logf("Setting TestCondition to True on node %s", nodeName) - - helpers.SetNodeConditionStatus(ctx, t, client, nodeName, v1.NodeConditionType(testConditionType), v1.ConditionTrue) - - t.Log("Waiting for policy match annotation to be cleared") - require.Eventually(t, func() bool { - node, err := helpers.GetNodeByName(ctx, client, nodeName) - if err != nil { - t.Logf("Failed to get node: %v", err) - return false - } - - annotation, exists := node.Annotations[annotationKey] - if exists && annotation != "" { - t.Logf("Annotation still exists: %s", annotation) - return false - } - - return true - }, helpers.EventuallyWaitTimeout, helpers.WaitInterval) - - return ctx - }) -
π§Ή Nitpick comments (4)
tests/helpers/kube.go (1)
387-409: LGTM! Function correctly implements negative event assertion.The logic properly uses
require.Neverto ensure the specified event type and reason do not appear on the node.Optional: Consider improving log message clarity
At line 405, the log message could include both
eventTypeandeventReasonfor consistency with the assertion message at line 408:- t.Logf("node %s does not have event %v", nodeName, eventType) + t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)tests/platform-connector_test.go (1)
28-32: Remove unused struct fields.The
ConfigMapBackupandTestNamespacefields are declared but never used anywhere in the test. Consider removing them to keep the code clean.π Proposed cleanup
type PlatformConnectorTestContext struct { NodeName string - ConfigMapBackup []byte - TestNamespace string }health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (2)
84-87: Consider case-insensitive processing strategy validation.The validation is case-sensitive and requires an exact match with the proto enum names ("EXECUTE_REMEDIATION" or "STORE_ONLY"). This could lead to user confusion if they provide lowercase values like "store_only".
Consider normalizing the input with
strings.ToUpper()before validation, or enhance the error message to list the valid values.π Suggested improvement
- strategyValue, ok := pb.ProcessingStrategy_value[params.ProcessingStrategy] + strategyValue, ok := pb.ProcessingStrategy_value[strings.ToUpper(params.ProcessingStrategy)] if !ok { - return nil, fmt.Errorf("unexpected processingStrategy value: %q", params.ProcessingStrategy) + return nil, fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", params.ProcessingStrategy) }
54-54: Document the ProcessingStrategy field.The new
ProcessingStrategyfield in theParamsstruct would benefit from a comment explaining its purpose and valid values, especially since it's part of a public API.MaxConcurrentReconciles int PlatformConnectorSocket string + // ProcessingStrategy determines how health events are processed. + // Valid values: "EXECUTE_REMEDIATION" (default behavior with remediation actions) + // "STORE_ONLY" (events are stored but no remediation is triggered) ProcessingStrategy string }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (19)
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamldistros/kubernetes/nvsentinel/values-tilt-postgresql.yamldocs/postgresql-schema.sqlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/kubernetes-object-monitor/main.gohealth-monitors/kubernetes-object-monitor/pkg/annotations/manager.gohealth-monitors/kubernetes-object-monitor/pkg/config/types.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gohealth-monitors/kubernetes-object-monitor/pkg/publisher/publisher.gotests/event_exporter_test.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π§ Files skipped from review as they are similar to previous changes (7)
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
- distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/deployment.yaml
- health-monitors/kubernetes-object-monitor/main.go
- health-monitors/kubernetes-object-monitor/pkg/config/types.go
- tests/event_exporter_test.go
- health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go
- health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go
π§° Additional context used
π Path-based instructions (3)
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/healthevent.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gofault-quarantine/pkg/initializer/init.gotests/kubernetes_object_monitor_test.gohealth-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/helpers/kube.gotests/platform-connector_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π§ Learnings (8)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yamltests/helpers/event_exporter.gotests/platform-connector_test.godistros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/fault_quarantine_test.gotests/kubernetes_object_monitor_test.gotests/helpers/kube.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/platform-connector_test.go
𧬠Code graph analysis (5)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/kube.go (1)
SetNodeManagedByNVSentinel(1389-1408)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
tests/kubernetes_object_monitor_test.go (1)
tests/helpers/kube.go (9)
SetDeploymentArgs(2279-2312)NVSentinelNamespace(64-64)WaitForDeploymentRollout(984-1125)SetNodeConditionStatus(1733-1794)GetNodeByName(466-475)NeverWaitTimeout(62-62)WaitInterval(63-63)EventuallyWaitTimeout(61-61)RemoveDeploymentArgs(2365-2390)
health-monitors/kubernetes-object-monitor/pkg/initializer/initializer.go (3)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
New(41-46)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(43-45)
tests/platform-connector_test.go (4)
tests/helpers/kube.go (5)
SelectTestNodeFromUnusedPool(413-437)EnsureNodeConditionNotPresent(1797-1818)EnsureNodeEventNotPresent(387-409)CheckNodeConditionExists(1283-1300)CheckNodeEventExists(1325-1353)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)tests/helpers/health_events_analyzer.go (2)
ERRORCODE_79(35-35)ERRORCODE_31(32-32)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
π Additional comments (16)
tests/helpers/kube.go (3)
2232-2273: LGTM! DaemonSet rollout verification is correct.The function properly checks all relevant status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to verify rollout completion, with appropriate logging at each stage.
2275-2360: LGTM! Deployment args manipulation is well-implemented.The function correctly uses
retry.RetryOnConflictwithout wrapping errors (as per coding guidelines), andsetArgsOnContainerproperly handles the three argument formats:--flag=value,--flag value, and--flag(boolean).
2362-2417: LGTM! Argument removal logic is correct.The docstring correctly states "removes container arguments" and the implementation properly handles removal of arguments in different formats (--flag=value, --flag value, --flag). The function correctly uses
retry.RetryOnConflictwithout wrapping errors.Note: The past review comment about an incorrect docstring referencing environment variables appears to have been resolved in the current code.
tests/helpers/healthevent.go (2)
48-48: LGTM!The
ProcessingStrategyfield is correctly defined with an appropriate type and JSON tag, consistent with other optional fields in the struct.
153-156: LGTM!The builder method follows the established pattern and enables fluent chaining, consistent with all other
With*methods in this helper.tests/helpers/event_exporter.go (2)
220-254: LGTM!The
FindEventByNodeAndCheckNamefunction correctly implements CloudEvent search by nodeName, checkName, and isHealthy status, following the same defensive pattern as the existingFindEventByNodeAndMessagefunction.
256-283: LGTM!The
ValidateCloudEventfunction is correctly extended to validate theprocessingStrategyfield, following the same assertion pattern as other health event fields.distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)
219-222: LGTM!The
processing_strategycolumn is correctly added to thehealth_eventstable schema with an appropriate VARCHAR(50) type. The column is nullable, which aligns with the optional nature of this field in the proto definition and JSON serialization.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM!Correctly switches to
BuildProcessableHealthEventInsertsPipeline()to ensure fault-quarantine only processesEXECUTE_REMEDIATIONevents, filtering outSTORE_ONLYobservability events.docs/postgresql-schema.sql (1)
106-109: LGTM!The
processing_strategycolumn is correctly added to the canonical PostgreSQL schema and matches the corresponding change invalues-tilt-postgresql.yaml, maintaining consistency between the two schema definitions as required.distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/templates/configmap.yaml (1)
51-53: LGTM!The
processingStrategyfield is correctly rendered in the ConfigMap template with proper conditional logic, indentation, and quoting, following the same pattern as theerrorCodefield.tests/fault_quarantine_test.go (4)
26-26: LGTM!The import of the
protospackage is necessary to reference theProcessingStrategyenum constants used in the test.
233-250: LGTM!The test setup correctly initializes the test context and marks the node as managed by NVSentinel, which is required for the quarantine behavior being tested.
252-286: LGTM!The test assessments correctly verify that
STORE_ONLYevents do not trigger quarantine whileEXECUTE_REMEDIATIONevents do, effectively validating the core behavior of the processing strategy feature.
288-294: LGTM!The teardown properly cleans up by sending a healthy event and calling the standard teardown helper.
distros/kubernetes/nvsentinel/charts/kubernetes-object-monitor/values.yaml (1)
107-112: LGTM!The
processingStrategyfield is well-documented with clear explanations of both modes and sensible defaults. The default valueEXECUTE_REMEDIATIONmaintains backward compatibility while enabling the new observability-only mode when needed.
| t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName) | ||
| helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling and assertion on condition check.
CheckNodeConditionExists returns (*v1.NodeCondition, error) but both return values are ignored. This means:
- Errors during the check are silently swallowed
- The test doesn't actually verify that the condition existsβit just calls the helper and moves on
You should wrap this in a require.Eventually block to wait for the condition to appear, and check the error.
π Proposed fix
t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName)
- helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")
+ require.Eventually(t, func() bool {
+ condition, err := helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")
+ if err != nil {
+ t.Logf("Error checking condition: %v", err)
+ return false
+ }
+ return condition != nil
+ }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have condition SysLogsXIDError", testCtx.NodeName)Committable suggestion skipped: line range outside the PR's diff.
| t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName) | ||
| helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing return value check on event existence verification.
CheckNodeEventExists returns (bool, *v1.Event) but both return values are ignored. The test doesn't verify that the event actually existsβit just calls the helper. You should check the boolean return value to assert the event is present.
π Proposed fix
t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)
- helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")
+ require.Eventually(t, func() bool {
+ found, event := helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")
+ if !found {
+ t.Logf("Event not found yet")
+ return false
+ }
+ t.Logf("Found event: %v", event.Name)
+ return true
+ }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have event SysLogsXIDError", testCtx.NodeName)π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName) | |
| helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") | |
| t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName) | |
| require.Eventually(t, func() bool { | |
| found, event := helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") | |
| if !found { | |
| t.Logf("Event not found yet") | |
| return false | |
| } | |
| t.Logf("Found event: %v", event.Name) | |
| return true | |
| }, helpers.EventuallyWaitTimeout, helpers.WaitInterval, "node %s should have event SysLogsXIDError", testCtx.NodeName) |
π€ Prompt for AI Agents
In tests/platform-connector_test.go around lines 92-93, the call to
helpers.CheckNodeEventExists ignores its (bool, *v1.Event) return values so the
test doesn't assert the event actually exists; update the test to capture the
boolean (and optionally the Event), and assert the boolean is true (using
t.Fatalf or t.Errorf) with a helpful message if false, and only proceed (or log
the event) when the returned bool indicates the event exists.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Documentation
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.