Skip to content

CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

ifesdjeen
Copy link
Contributor

No description provided.

… been down for a longer period to rejoin the cluster
@ifesdjeen ifesdjeen requested a review from belliottsmith July 4, 2025 08:35
replayJournal(as);
// If we hit an error during journal replay, we need to mark ourselves unsafe to read, perform full data repair,
// and trigger RX before we can continue serving traffic
if (replayJournal(as) && ClusterMetadata.current().directory.allJoinedEndpoints().size() > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is allJoinedEndpoints.size() > 1 enough? Don't we need a quorum (or some stronger user command insisting we proceed with a minority)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we definitely need that in order to succeed. Here, I am just checking for whether or not we are in the single node cluster, in which case rebootstrap will simply not work.

}

@VisibleForTesting
public static void replayJournal(AccordService as)
public static boolean replayJournal(AccordService as)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is surprising for true to mean replay failed. I think we should either use false, or introduce an enum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure good point, inverted this and called the variable success instead

@belliottsmith
Copy link
Contributor

Do we want to add a "clean shutdown" marker here too, for periodic mode, so we know we may need to rebootstrap via simple RX? Or leave it for a later patch?

@belliottsmith belliottsmith force-pushed the trunk branch 2 times, most recently from df3eb40 to 54e39a9 Compare July 23, 2025 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants