Skip to content

KAFKA-19571: Race condition between log segment flush and file deletion causing log dir to go offline #20289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

itoumlilt
Copy link

Following JIRA Ticket: https://issues.apache.org/jira/browse/KAFKA-19571

A race condition can occur during replica rebalancing where a log segment's file is deleted after an asynchronous flush has been scheduled but before it executes.

This would previously cause an unhandled ClosedChannelException, leading the ReplicaManager to mark the entire log directory as offline.

The fix involves catching the ClosedChannelException within the LogSegment.flush() method and suppressing it only if the underlying log file no longer exists, which is the specific symptom of this race condition. Legitimate I/O errors on existing files will still be thrown.

Unit test has been added to LogSegmentTest to verify both the fix and the case where the exception should still be thrown.

A race condition can occur during replica rebalancing where a log segment's
file is deleted after an asynchronous flush has been scheduled but before
it executes.

This would previously cause an unhandled ClosedChannelException, leading the ReplicaManager to mark the entire log directory as offline.

The fix involves catching the ClosedChannelException within the
LogSegment.flush() method and suppressing it only if the underlying
log file no longer exists, which is the specific symptom of this
race condition. Legitimate I/O errors on existing files will still
be thrown.

Unit test has been added to LogSegmentTest to verify both
the fix and the case where the exception should still be thrown.
@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module small Small PRs labels Aug 1, 2025
@itoumlilt
Copy link
Author

#11438 was fixed to swallow the first NoSuchFileException WARN in the above stacktrace, but not the underlying exception.
#14280 is similar but different, it swallows NoSuchFileException for race condition on log directory move/delete, but not on the segment file level.

Copy link

github-actions bot commented Aug 9, 2025

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-attention small Small PRs storage Pull requests that target the storage module triage PRs from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant