fix(inputs.tail): Prevent panic when closing tailers under load#19093
Draft
skartikey wants to merge 1 commit into
Draft
fix(inputs.tail): Prevent panic when closing tailers under load#19093skartikey wants to merge 1 commit into
skartikey wants to merge 1 commit into
Conversation
With multiple files, all tailers share a single semaphore that is sized to max_undelivered_lines and kept in lockstep with the tracking accumulator's delivery channel: a tailer must acquire a slot before adding a metric, and a slot is freed only when a metric is delivered. When a tailer closed while blocked adding a line, the receiver released a semaphore slot without a matching delivery. With other tailers still running, that freed slot let one of them add beyond the budget and overflow the delivery channel, which panics with "channel is full" and crashes the agent before plugin state (read offsets) is persisted. Drop the line on close without releasing the semaphore. The receiver still exits via the dying branch and keeps draining the tailer so it can stop, so the close-time deadlock guarded against previously stays fixed. resolves influxdata#19073
Contributor
|
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
All tailers share one semaphore (
t.sem) sized tomax_undelivered_lines, kept in lockstep with the tracking accumulator's delivery channel of the same size: a receiver must acquire a slot beforeAddTrackingMetricGroup, and a slot is freed only when the drain goroutine sees a delivery onDelivered(). This keeps the in-flight count at or below the delivery channel capacity.The
<-tailer.Dying()branch inreceiver(added in #15649 to avoid a close-time deadlock) released a semaphore slot without a corresponding delivery. With more than one tailer, another receiver that was blocked adding a line then took the freed slot and pushed the in-flight count past the budget. The next delivery overflowed the channel and hit thepanic("channel is full")guard in the tracking accumulator. A single-file setup cannot trigger it, which is why it only shows up on busy multi-file hosts.Fix
Drop the current line on close without releasing the semaphore. The receiver still exits through the dying branch and keeps draining
tailer.Linessotailer.Stop()can complete, so the deadlock #15649 fixed does not return. Data-loss behavior is unchanged: that branch already dropped the blocked line; it only additionally and incorrectly freed a budget slot.The
panicguard in the accumulator is left as is. It is an intentional programming-error signal; the defect was tail over-subscribing the budget.Checklist
Related issues
resolves #19073