CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

ifesdjeen · 2025-07-04T08:35:49Z

No description provided.

… been down for a longer period to rejoin the cluster

belliottsmith · 2025-07-04T08:58:32Z

src/java/org/apache/cassandra/service/accord/AccordService.java

-        replayJournal(as);
+        // If we hit an error during journal replay, we need to mark ourselves unsafe to read, perform full data repair,
+        // and trigger RX before we can continue serving traffic
+        if (replayJournal(as) && ClusterMetadata.current().directory.allJoinedEndpoints().size() > 1)


is allJoinedEndpoints.size() > 1 enough? Don't we need a quorum (or some stronger user command insisting we proceed with a minority)?

Oh, we definitely need that in order to succeed. Here, I am just checking for whether or not we are in the single node cluster, in which case rebootstrap will simply not work.

belliottsmith · 2025-07-04T09:00:09Z

src/java/org/apache/cassandra/service/accord/AccordService.java

    }

    @VisibleForTesting
-    public static void replayJournal(AccordService as)
+    public static boolean replayJournal(AccordService as)


I think it is surprising for true to mean replay failed. I think we should either use false, or introduce an enum

sure good point, inverted this and called the variable success instead

belliottsmith · 2025-07-04T09:00:53Z

Do we want to add a "clean shutdown" marker here too, for periodic mode, so we know we may need to rebootstrap via simple RX? Or leave it for a later patch?

Add rebootstrap to Accord to enable a lagging node or a node that has…

a5f444a

… been down for a longer period to rejoin the cluster

ifesdjeen requested a review from belliottsmith July 4, 2025 08:35

belliottsmith reviewed Jul 4, 2025

View reviewed changes

belliottsmith force-pushed the trunk branch 2 times, most recently from df3eb40 to 54e39a9 Compare July 23, 2025 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

Uh oh!

ifesdjeen commented Jul 4, 2025

Uh oh!

belliottsmith Jul 4, 2025

Uh oh!

ifesdjeen Jul 8, 2025

Uh oh!

belliottsmith Jul 4, 2025

Uh oh!

ifesdjeen Jul 8, 2025

Uh oh!

belliottsmith commented Jul 4, 2025

Uh oh!

Uh oh!

CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

Are you sure you want to change the base?

CASSANDRA-20741: Add rebootstrap to Accord to enable a lagging node or a node that has been down for a longer period to rejoin the cluster #4227

Uh oh!

Conversation

ifesdjeen commented Jul 4, 2025

Uh oh!

belliottsmith Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

ifesdjeen Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

belliottsmith Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

ifesdjeen Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

belliottsmith commented Jul 4, 2025

Uh oh!

Uh oh!