KAFKA-19571: Race condition between log segment flush and file deletion causing log dir to go offline #20289
+49
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following JIRA Ticket: https://issues.apache.org/jira/browse/KAFKA-19571
A race condition can occur during replica rebalancing where a log segment's file is deleted after an asynchronous flush has been scheduled but before it executes.
This would previously cause an unhandled
ClosedChannelException
, leading theReplicaManager
to mark the entire log directory as offline.The fix involves catching the
ClosedChannelException
within theLogSegment.flush()
method and suppressing it only if the underlying log file no longer exists, which is the specific symptom of this race condition. Legitimate I/O errors on existing files will still be thrown.Unit test has been added to
LogSegmentTest
to verify both the fix and the case where the exception should still be thrown.