Skip to content

fix(inputs.tail): Prevent panic when closing tailers under load#19093

Draft
skartikey wants to merge 1 commit into
influxdata:masterfrom
skartikey:fix/tail-channel-full-panic-19073
Draft

fix(inputs.tail): Prevent panic when closing tailers under load#19093
skartikey wants to merge 1 commit into
influxdata:masterfrom
skartikey:fix/tail-channel-full-panic-19073

Conversation

@skartikey

Copy link
Copy Markdown
Contributor

Summary

All tailers share one semaphore (t.sem) sized to max_undelivered_lines, kept in lockstep with the tracking accumulator's delivery channel of the same size: a receiver must acquire a slot before AddTrackingMetricGroup, and a slot is freed only when the drain goroutine sees a delivery on Delivered(). This keeps the in-flight count at or below the delivery channel capacity.

The <-tailer.Dying() branch in receiver (added in #15649 to avoid a close-time deadlock) released a semaphore slot without a corresponding delivery. With more than one tailer, another receiver that was blocked adding a line then took the freed slot and pushed the in-flight count past the budget. The next delivery overflowed the channel and hit the panic("channel is full") guard in the tracking accumulator. A single-file setup cannot trigger it, which is why it only shows up on busy multi-file hosts.

Fix

Drop the current line on close without releasing the semaphore. The receiver still exits through the dying branch and keeps draining tailer.Lines so tailer.Stop() can complete, so the deadlock #15649 fixed does not return. Data-loss behavior is unchanged: that branch already dropped the blocked line; it only additionally and incorrectly freed a budget slot.

The panic guard in the accumulator is left as is. It is an intentional programming-error signal; the defect was tail over-subscribing the budget.

Checklist

Related issues

resolves #19073

With multiple files, all tailers share a single semaphore that is sized
to max_undelivered_lines and kept in lockstep with the tracking
accumulator's delivery channel: a tailer must acquire a slot before
adding a metric, and a slot is freed only when a metric is delivered.

When a tailer closed while blocked adding a line, the receiver released
a semaphore slot without a matching delivery. With other tailers still
running, that freed slot let one of them add beyond the budget and
overflow the delivery channel, which panics with "channel is full" and
crashes the agent before plugin state (read offsets) is persisted.

Drop the line on close without releasing the semaphore. The receiver
still exits via the dying branch and keeps draining the tailer so it can
stop, so the close-time deadlock guarded against previously stays fixed.

resolves influxdata#19073
@telegraf-tiger telegraf-tiger Bot added area/tail fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jun 12, 2026
@skartikey skartikey marked this pull request as draft June 12, 2026 13:39
@telegraf-tiger

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tail fix pr to fix corresponding bug plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegraf shutdown/stops with panic: channel is full

1 participant