Skip to content

Cae 633 add functional tests notifications #3357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: feature/maintenance-events
Choose a base branch
from

Conversation

kiryazovi-redis
Copy link
Collaborator

@kiryazovi-redis kiryazovi-redis commented Jul 11, 2025

Make sure that:

  • You have read the contribution guidelines.
  • You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
  • You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
  • You submit test cases (unit or integration tests) that back your changes.

CAE-633: Add Functional Tests for Redis Enterprise Push Notifications

Overview

This PR adds comprehensive functional tests to validate Redis Enterprise push notifications during maintenance events, ensuring the Lettuce client correctly captures and processes RESP3 push messages for cluster state changes.

What's Added

  • Complete test suite for all Redis Enterprise push notification types:

    • MOVING - Endpoint migration notifications
    • MIGRATING - Data migration in progress
    • MIGRATED - Data migration completed
    • FAILING_OVER - Failover initiation
    • FAILED_OVER - Failover completion
  • Dynamic cluster discovery with real-time adaptation to cluster state changes

  • Proper RESP3 push message capture using Lettuce's PushListener API

  • Robust test isolation with state recording and restoration between tests

Technical Implementation

Push Notification Capture:

  • Replaced flawed ping-based approach with proper PushListener API integration
  • Added ByteBuffer decoding to extract actual IP addresses from push messages
  • Successfully hooks into Lettuce's protocol layer to capture RESP3 push messages

Cluster State Management:

  • Implemented RedisEnterpriseConfigDiscovery with cache clearing for fresh master/slave lists
  • Added dynamic node selection for failover operations
  • Real-time cluster configuration discovery prevents stale state issues

Fault Injection:

  • Enhanced FaultInjectionClient with proper validation and retry logic
  • Added support for migration operations with appropriate timeouts (300s)
  • Fixed infinite retry loops in validation

Critical Bug Fixes Applied

  1. Double-Prefixing Bug: Fixed malformed shard IDs (redis:redis:1redis:1) in state recording
  2. Migration Timeouts: Increased timeout from 120s to 300s for data migration operations
  3. Restoration Logic: Fixed cluster state restoration to only failover master shards (Redis Enterprise constraint)

Test Results

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
Total time: 09:05 min

All notification types successfully validated with proper push message capture and cluster state management.

Files Changed

  • MaintenanceNotificationTest.java - Main test suite with 5 comprehensive tests
  • FaultInjectionClient.java - Enhanced fault injection with proper timeouts
  • RedisEnterpriseConfigDiscovery.java - Dynamic cluster discovery implementation
  • RedisEnterpriseConfig.java - Cluster configuration models
  • Additional utility classes for test infrastructure

Benefits

  • Production readiness: Validates real-world Redis Enterprise maintenance scenarios
  • Proactive monitoring: Ensures applications can react to cluster state changes
  • Regression prevention: Comprehensive test coverage prevents future push notification issues
  • Dynamic adaptation: Tests work with any Redis Enterprise cluster configuration

Testing

Tests validated against live Redis Enterprise clusters with real maintenance operations. The test suite adapts dynamically to different cluster configurations and provides comprehensive coverage of all push notification scenarios.


This PR establishes a solid foundation for Redis Enterprise push notification testing, ensuring the Lettuce client maintains robust support for maintenance event handling in production environments.

@kiryazovi-redis kiryazovi-redis self-assigned this Jul 11, 2025
Copilot

This comment was marked as outdated.

@tishun tishun added the type: task A general task label Jul 14, 2025
@tishun tishun added this to the Async milestone Jul 14, 2025
Copy link
Collaborator

@tishun tishun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, but I think we can refactor the code a bit to make it more readable and easier to maintain

- Fixed receiveMovingPushNotificationTest to properly capture MOVING messages
- Switched from flawed ping-based approach to proper PushListener API
- Added ByteBuffer decoding to extract actual IP addresses from push messages
- Enhanced two-step rebind process (migrate + bind) working correctly
- Successfully capturing RESP3 push messages: MOVING, MIGRATING, MIGRATED, etc.
- Test now passes and captures real MOVING notifications during endpoint rebind

The test now properly hooks into Lettuce's protocol layer to capture
RESP3 push messages that come before command responses, solving the
core issue where we were only seeing PONG responses.
…h notification tests

- Fixed stale cluster configuration issue in RedisEnterpriseConfigDiscovery
- Added proper cache clearing in parseShards() method to get fresh master/slave lists
- Added getNodeWithMasterShards() method for proper node selection in failover tests
- Fixed infinite retry loops in FaultInjectionClient validation
- Updated MaintenanceNotificationTest to use dynamic node selection for both FAILING_OVER and FAILED_OVER tests
- Added CONTEXT_MOVING_PUSH_NOTIFICATIONS.md to .gitignore
- All push notification tests now working with real-time dynamic cluster discovery
- Fix double-prefixing bug in shard ID state recording
  * originalShardRoles.put(masterShard, 'master') instead of 'redis:' + masterShard
  * Prevents malformed redis:redis:1 IDs that broke state restoration

- Increase migration timeout from 120s to 300s in FaultInjectionClient
  * Migration operations move actual data and need more time than failovers
  * Fixes receiveMigratedPushNotificationTest timeout failures

- Fix cluster state restoration logic to only failover master shards
  * Redis Enterprise constraint: only master shards can be failed over
  * Prevents 'ERROR: No such master shard id' when trying to failover slaves
  * Only targets current masters that should become slaves

- Complete test suite now passing: 5/5 tests successful
  * MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER notifications
  * Dynamic cluster discovery with proper state isolation
  * Real-time adaptation to cluster state changes between test runs

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
…numbers

- Remove all legacy methods with hardcoded defaults from FaultInjectionClient:
  * triggerShardMigration(String, String) - 2-parameter version
  * triggerShardFailover(String, String) - 2-parameter version
  * triggerShardFailover(String, String, String) - 3-parameter version
  * triggerMovingNotification(String, String, String) - 3-parameter version
  * validateRladminCommandCompletion() - useless method

- Replace magic numbers with meaningful constants across scenario folder:
  * Use Duration.ofMinutes() for long timeouts (300s->5min, 180s->3min, etc.)
  * Add descriptive constants for all timeout values
  * Improve code readability and maintainability

- Update tests to use 4-parameter dynamic discovery methods
- Fix Jackson import to use proper import instead of full package name
- Update maintenance operation system to handle missing legacy methods

All tests pass and compilation successful.
@kiryazovi-redis kiryazovi-redis force-pushed the CAE-633-add-functional-tests-notifications branch from 505ca71 to 4d87ba7 Compare July 22, 2025 11:53
@kiryazovi-redis kiryazovi-redis requested a review from Copilot July 22, 2025 11:54
Copilot

This comment was marked as outdated.

… tracking

- Increase StepVerifier timeout for long-running operations (migrations/failovers)
- Change from .verifyComplete() to .expectComplete().verify(LONG_OPERATION_TIMEOUT)
- All tests now complete successfully without timing out
- Remove .vscode/settings.json from git tracking
@kiryazovi-redis kiryazovi-redis requested a review from Copilot July 22, 2025 15:47
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive functional tests for Redis Enterprise push notifications during maintenance events, enabling the Lettuce client to properly capture and validate RESP3 push messages for cluster state changes. The implementation includes full test coverage for all notification types (MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER) with dynamic cluster discovery and proper state management.

  • Complete test suite validating all Redis Enterprise push notification types during maintenance operations
  • Dynamic cluster configuration discovery with real-time adaptation to cluster state changes
  • Enhanced fault injection client with proper timeout management and validation for Redis Enterprise operations

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

File Description
MaintenanceNotificationTest.java Main test suite with 5 comprehensive tests for all push notification types
RedisEnterpriseConfig.java Dynamic cluster discovery and configuration management with validation
FaultInjectionClient.java Enhanced fault injection with proper timeouts, validation, and Redis Enterprise operation support
FaultInjectionClientUnitTest.java Unit tests for fault injection client validation and basic functionality
Comments suppressed due to low confidence (2)

Comment on lines +288 to +310
* Extract the numeric part of shard ID from full format "redis:X" -> "X"
*/
private String extractNumericShardId(String fullShardId) {
if (fullShardId == null) {
return null;
}
if (fullShardId.contains(":")) {
return fullShardId.split(":")[1];
}
return fullShardId;
}

/**
* Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"
*/
private String extractEndpointId(String fullEndpointId) {
if (fullEndpointId == null) {
return null;
}
if (fullEndpointId.startsWith("endpoint:")) {
return fullEndpointId.substring("endpoint:".length());
}
return fullEndpointId;
Copy link
Preview

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and extractNumericNodeId() have similar logic but handle null differently. Consider creating a generic extract method or ensuring consistent null handling patterns.

Suggested change
* Extract the numeric part of shard ID from full format "redis:X" -> "X"
*/
private String extractNumericShardId(String fullShardId) {
if (fullShardId == null) {
return null;
}
if (fullShardId.contains(":")) {
return fullShardId.split(":")[1];
}
return fullShardId;
}
/**
* Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"
*/
private String extractEndpointId(String fullEndpointId) {
if (fullEndpointId == null) {
return null;
}
if (fullEndpointId.startsWith("endpoint:")) {
return fullEndpointId.substring("endpoint:".length());
}
return fullEndpointId;
* Extract a specific part of a string based on a prefix and optional delimiter.
*/
private String extractPart(String input, String prefix, String delimiter) {
if (input == null) {
return null;
}
if (prefix != null && input.startsWith(prefix)) {
input = input.substring(prefix.length());
}
if (delimiter != null && input.contains(delimiter)) {
return input.split(delimiter)[1];
}
return input;
}
/**
* Extract the numeric part of shard ID from full format "redis:X" -> "X"
*/
private String extractNumericShardId(String fullShardId) {
return extractPart(fullShardId, "redis:", ":");
}
/**
* Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"
*/
private String extractEndpointId(String fullEndpointId) {
return extractPart(fullEndpointId, "endpoint:", null);

Copilot uses AI. Check for mistakes.

Comment on lines +176 to +179
List<String> currentShards = new ArrayList<>();

// Get current shards (already in "redis:X" format)
currentShards.addAll(currentConfig.getShardsForNode(nodeId));
Copy link
Preview

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary ArrayList creation when currentShards is immediately populated with addAll(). Consider direct assignment: List currentShards = currentConfig.getShardsForNode(nodeId);

Suggested change
List<String> currentShards = new ArrayList<>();
// Get current shards (already in "redis:X" format)
currentShards.addAll(currentConfig.getShardsForNode(nodeId));
// Get current shards (already in "redis:X" format)
List<String> currentShards = currentConfig.getShardsForNode(nodeId);

Copilot uses AI. Check for mistakes.

// Only failover shards that are currently masters but should be slaves
if ("slave".equals(originalRole) && currentConfig.getMasterShardIds().contains(shardId)) {
// Should be slave but is currently master - failover this master
mastersToFailover.add(shardId.replace("redis:", ""));
Copy link
Preview

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String replacement logic is duplicated from RedisEnterpriseConfig.extractNumericShardId(). Consider using the existing utility method or creating a shared utility.

Suggested change
mastersToFailover.add(shardId.replace("redis:", ""));
mastersToFailover.add(RedisEnterpriseConfig.extractNumericShardId(shardId));

Copilot uses AI. Check for mistakes.

@@ -146,8 +173,8 @@ public Mono<Boolean> waitForCompletion(String actionId, Duration checkInterval,
return Mono.just(true);
}
return Mono.just(false);
}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval).timeout(timeout)
.doOnError(e -> log.error("Timeout waiting for action to complete", e)));
}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval)).timeout(timeout)
Copy link
Preview

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is applied after repeatWhenEmpty, which means the timeout applies to the entire repeat sequence rather than individual checks. This could cause premature timeouts. Consider moving timeout inside the repeat or using takeUntil with a timer.

Suggested change
}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval)).timeout(timeout)
}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval))
.takeUntil(Mono.delay(timeout))

Copilot uses AI. Check for mistakes.

@kiryazovi-redis kiryazovi-redis requested a review from tishun July 22, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: task A general task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants