Skip to content

Degradation under load after update to 2.12.2 (from 2.11.9) #7633

@MiloszKrajewski

Description

@MiloszKrajewski

Observed behavior

We recently updated from NATS server 2.11.9 to 2.12.2. And we definitely see service degradation.

Image

The most worrying is miraculous multiplication of messages (see below). It seems that under load extra copies get generated.

It might be the same root cause as reported here: https://jepsen.io/analyses/nats-2.12.1

Background

We have an HTTP endpoint that receives high load at 5am when devices upload their logs. The upload handler is designed for maximum throughput:

app.Map('/upload', () => {
    var guid = NewGuid();
    var stream = request.Body;
    bucket.Put(guid, stream);
    js.Publish('log.uploaded', guid);
});

The handler accepts requests as quickly as possible, stores logs in an object bucket, and publishes an event for later processing.

On the consumer side, however, we take it slow, we have a log processor that handles log.uploaded events:

js.Subscribe('log.uploaded', message => {
    var stream = MemoryStream();
    var loop = StartTimer(15*1000, () => message.AckProgress());
    bucket.Get(guid, stream);
    var log = ParseLog(stream);
    db.Put(log);
    bucket.Delete(guid);
    loop.Stop();
    js.Ack(message);
});

Key implementation details:

  • A periodic AckProgress timer prevents message redelivery during processing
  • We limit the number of concurrent handlers to avoid overloading the database
  • Processing includes fetching from bucket, parsing, database insertion, and cleanup

Log Event Types

There is a lot of data to analyze, but I tried to extract as much from logs, to identify certain patterns.

Throughout this report, we use the following event names:

  • attempt - Processing started, first attempt or legitimate retry
  • redelivery - like attempt, but should not happen (as it was acked already)
  • duplicate - like redelivery, but worse, indicates message duplication in the stream
  • saved - database insert completed successfully
  • success - processing completed and message acked
  • ignored - acked like success but technically object was already deleted
  • error - processing failed with exception, next attempt is (rightfully) expected

Problem

After upgrading to NATS 2.12.2, we started experiencing anomalies under load. While initial timeouts are expected, the situation escalates to quite uncomfortable levels.

Issue 1: Lost Acknowledgments

Reference file: 3edabbe9fa65fc01fe8d0c2a1c1092d8

span                host    timestamp                           event
--------------------------------------------------------------------------------
e4174614f7d210c8	zcv9m	2025-12-09 05:05:12.781561 +0000	attempt
e4174614f7d210c8	zcv9m	2025-12-09 05:05:16.464711 +0000	saved
e4174614f7d210c8	zcv9m	2025-12-09 05:05:19.295516 +0000	success
d4ff2f1eb7576b79	zcv9m	2025-12-09 05:09:22.128579 +0000	redelivery
d4ff2f1eb7576b79	zcv9m	2025-12-09 05:09:22.214666 +0000	ignored

The file was successfully processed and acknowledged at success, yet the message was redelivered ~4 minutes later. The redelivery was correctly ignored, preventing further retries. While not ideal, occasional lost acknowledgments under load might be acceptable.

Issue 2: Message Duplication

Reference file: af1bb801d1ef4348be3b4b6a28aa0327

span                host    timestamp                           event
--------------------------------------------------------------------------------
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:43.346032 +0000	attempt
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:48.278157 +0000	saved
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:53.577229 +0000	success
68fcf85ab965195d	tdc4b	2025-12-09 05:09:20.559444 +0000	duplicate
77d96109380d527b	ckxj9	2025-12-09 05:09:20.735112 +0000	duplicate
68fcf85ab965195d	tdc4b	2025-12-09 05:09:20.735627 +0000	ignored
77d96109380d527b	ckxj9	2025-12-09 05:09:20.753071 +0000	ignored

After successful processing on j5h8j, two different hosts (tdc4b and ckxj9) began processing the same message simultaneously at 05:09:20. This is not a simple redelivery - both hosts started at nearly the same time, suggesting the message was duplicated in the stream.

Note that for this stream we use single named/durable JetStream consumer.

Extended Example

Reference file: fb3c00f6aa88ab3c93497dcf2f933a9c

span                host    timestamp                           event
--------------------------------------------------------------------------------
04216daaf8477973	ckxj9	2025-12-09 05:06:41.741895 +0000	attempt
04216daaf8477973	ckxj9	2025-12-09 05:06:48.642661 +0000	saved
04216daaf8477973	ckxj9	2025-12-09 05:06:53.535538 +0000	success
e575ac17244a71d1	f9v9h	2025-12-09 05:09:01.099885 +0000	redelivery
e575ac17244a71d1	f9v9h	2025-12-09 05:09:01.128765 +0000	ignored
4fcfd73341193f13	6rh45	2025-12-09 05:09:27.285313 +0000	duplicate
4fcfd73341193f13	6rh45	2025-12-09 05:09:27.286785 +0000	ignored
3bdf9b9890ea1a58	tdc4b	2025-12-09 05:09:27.354946 +0000	duplicate
3bdf9b9890ea1a58	tdc4b	2025-12-09 05:09:27.357696 +0000	ignored
86f3f13be192f787	txwx2	2025-12-09 05:09:27.563944 +0000	duplicate
86f3f13be192f787	txwx2	2025-12-09 05:09:27.579053 +0000	ignored

This shows escalation:

  1. Initial success with lost acknowledgment (05:06:53)
  2. One redelivery correctly ignored with lost acknowledgment (05:09:01) again
  3. Three more hosts simultaneously attempting to process at 05:09:27 with duplicated messages

Issue 3: Message Multiplication After Errors

Reference file: ffc962e90d726f5ca524ae2f763a500f

span                host    timestamp                           event
--------------------------------------------------------------------------------
6ac1eead21567b87	9zdrt	2025-12-09 05:05:45.692145 +0000	attempt
6ac1eead21567b87	9zdrt	2025-12-09 05:05:48.259197 +0000	saved
6ac1eead21567b87	9zdrt	2025-12-09 05:05:54.088455 +0000	error
4ee2f2fa4fc65f37	wb7gg	2025-12-09 05:09:17.488001 +0000	attempt
4ee2f2fa4fc65f37	wb7gg	2025-12-09 05:09:18.512093 +0000	ignored
5d57854920255467	6zlnk	2025-12-09 05:09:26.398787 +0000	duplicate
5d57854920255467	6zlnk	2025-12-09 05:09:26.432748 +0000	ignored
b30e3585d184043a	f9v9h	2025-12-09 05:09:27.314732 +0000	duplicate
b30e3585d184043a	f9v9h	2025-12-09 05:09:27.316256 +0000	ignored
13215a392b8ab9df	txwx2	2025-12-09 05:09:27.327671 +0000	duplicate
13215a392b8ab9df	txwx2	2025-12-09 05:09:27.329408 +0000	ignored
82ea14a4b101fbc7	sbgvh	2025-12-09 05:09:28.249803 +0000	duplicate
82ea14a4b101fbc7	sbgvh	2025-12-09 05:09:28.252123 +0000	ignored
959565dc2bebd3ad	sbgvh	2025-12-09 05:09:28.319449 +0000	duplicate
959565dc2bebd3ad	sbgvh	2025-12-09 05:09:28.321834 +0000	ignored
8cdd38936e3a3407	j5h8j	2025-12-09 05:09:28.467998 +0000	duplicate
8cdd38936e3a3407	j5h8j	2025-12-09 05:09:28.469124 +0000	ignored
cdc6800d5dec46ee	sbgvh	2025-12-09 05:09:28.470661 +0000	duplicate
cdc6800d5dec46ee	sbgvh	2025-12-09 05:09:28.472738 +0000	ignored
e6059d7fb3e0ea42	zcv9m	2025-12-09 05:09:28.673221 +0000	duplicate
e6059d7fb3e0ea42	zcv9m	2025-12-09 05:09:28.675075 +0000	ignored

The initial processing at 05:05:54 encountered an error - the database insert and object deletion completed, but the acknowledgment timed out. While one more attempt (05:09:17) is absolutely reasonable, but we also see eight simultaneous duplicate between 05:09:26-28

Messages that fail to acknowledge appear to multiply exponentially.

Visualization

gantt
    title Message Processing Timeline (ffc962e90d726f5ca524ae2f763a500f)
    dateFormat HH:mm:ss.SSS
    axisFormat %H:%M:%S

    section 9zdrt
    6ac1 attempt            :6ac1_start, 05:05:45.692, 8396ms
    6ac1 saved              :milestone, 05:05:48.259, 0ms
    6ac1 ERROR              :crit, milestone, 05:05:54.088, 0ms

    section wb7gg
    4ee2 redelivery         :done, 4ee2_start, 05:09:17.488, 1024ms
    4ee2 ignored            :done, milestone, 05:09:18.512, 0ms

    section 6zlnk
    5d57 duplicate          :active, 5d57_start, 05:09:26.398, 34ms
    5d57 ignored            :active, milestone, 05:09:26.432, 0ms

    section f9v9h
    b30e duplicate          :active, b30e_start, 05:09:27.314, 2ms
    b30e ignored            :active, milestone, 05:09:27.316, 0ms

    section txwx2
    1321 duplicate          :active, 1321_start, 05:09:27.327, 2ms
    1321 ignored            :active, milestone, 05:09:27.329, 0ms

    section sbgvh
    82ea duplicate          :active, 82ea_start, 05:09:28.249, 2ms
    82ea ignored            :active, milestone, 05:09:28.252, 0ms
    9595 duplicate          :active, 9595_start, 05:09:28.319, 2ms
    9595 ignored            :active, milestone, 05:09:28.321, 0ms
    cdc6 duplicate          :active, cdc6_start, 05:09:28.470, 2ms
    cdc6 ignored            :active, milestone, 05:09:28.472, 0ms

    section j5h8j
    8cdd duplicate          :active, 8cdd_start, 05:09:28.467, 1ms
    8cdd ignored            :active, milestone, 05:09:28.469, 0ms

    section zcv9m
    e605 duplicate          :active, e605_start, 05:09:28.673, 2ms
    e605 ignored            :active, milestone, 05:09:28.675, 0ms
Loading

Above:

  • machine 6ac1 span 9zdrt almost processes the events, however timeouts on confirmation
  • machine 4ee2 span wb7gg tries again to find that everything has been done

So far... acceptable, but then:

  • message gets delivered to multiple machines at the same time

Expected behavior

Lost ACK is explainable, but spurious multiplication of messages feels worrying.

Server and client version

NATS server: 2.12.2
NATS client: .NET 2.6.11

Host environment

K8S / AKS / ARM64

Steps to reproduce

That's a problem. We have it only in system under very heavy load. So far I did not manage to reproduce on the laptop. Possibly it may need multiple nodes with simulated partiotion? I don't know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions