Degradation under load after update to 2.12.2 (from 2.11.9)

### Observed behavior

We recently updated from NATS server 2.11.9 to 2.12.2. And we definitely see service degradation. 

<img width="1962" height="232" alt="Image" src="https://github.com/user-attachments/assets/739ca0a9-61c8-490d-bfec-b1868bf42791" />

The most worrying is miraculous multiplication of messages (see below). It seems that under load extra copies get generated.

It might be the same root cause as reported here: https://jepsen.io/analyses/nats-2.12.1

# Background

We have an HTTP endpoint that receives high load at 5am when devices upload their logs. The upload handler is designed for maximum throughput:

```csharp
app.Map('/upload', () => {
    var guid = NewGuid();
    var stream = request.Body;
    bucket.Put(guid, stream);
    js.Publish('log.uploaded', guid);
});
```

The handler accepts requests as quickly as possible, stores logs in an object bucket, and publishes an event for later processing.

On the consumer side, however, we take it slow, we have a log processor that handles `log.uploaded` events:

```csharp
js.Subscribe('log.uploaded', message => {
    var stream = MemoryStream();
    var loop = StartTimer(15*1000, () => message.AckProgress());
    bucket.Get(guid, stream);
    var log = ParseLog(stream);
    db.Put(log);
    bucket.Delete(guid);
    loop.Stop();
    js.Ack(message);
});
```

Key implementation details:
- A periodic `AckProgress` timer prevents message redelivery during processing
- We limit the number of concurrent handlers to avoid overloading the database
- Processing includes fetching from bucket, parsing, database insertion, and cleanup

## Log Event Types

There is a lot of data to analyze, but I tried to extract as much from logs, to identify certain patterns.

Throughout this report, we use the following event names:
- `attempt` - Processing started, first attempt or **legitimate** retry
- `redelivery` - like `attempt`, but should not happen (as it was `ack`ed already)
- `duplicate` - like `redelivery`, but worse, indicates message duplication in the stream
- `saved` - database insert completed successfully
- `success` - processing completed and message `ack`ed
- `ignored` - `ack`ed like `success` but technically object was already deleted
- `error` - processing failed with exception, next `attempt` is (rightfully) expected

# Problem

After upgrading to NATS 2.12.2, we started experiencing anomalies under load. While initial timeouts are expected, the situation escalates to quite uncomfortable levels.

## Issue 1: Lost Acknowledgments

**Reference file:** `3edabbe9fa65fc01fe8d0c2a1c1092d8`

```
span                host    timestamp                           event
--------------------------------------------------------------------------------
e4174614f7d210c8	zcv9m	2025-12-09 05:05:12.781561 +0000	attempt
e4174614f7d210c8	zcv9m	2025-12-09 05:05:16.464711 +0000	saved
e4174614f7d210c8	zcv9m	2025-12-09 05:05:19.295516 +0000	success
d4ff2f1eb7576b79	zcv9m	2025-12-09 05:09:22.128579 +0000	redelivery
d4ff2f1eb7576b79	zcv9m	2025-12-09 05:09:22.214666 +0000	ignored
```

The file was successfully processed and acknowledged at `success`, yet the message was redelivered ~4 minutes later. The redelivery was correctly `ignored`, preventing further retries. While not ideal, occasional lost acknowledgments under load might be acceptable.

## Issue 2: Message Duplication

**Reference file:** `af1bb801d1ef4348be3b4b6a28aa0327`

```
span                host    timestamp                           event
--------------------------------------------------------------------------------
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:43.346032 +0000	attempt
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:48.278157 +0000	saved
62b955eb1c49a18e	j5h8j	2025-12-09 05:05:53.577229 +0000	success
68fcf85ab965195d	tdc4b	2025-12-09 05:09:20.559444 +0000	duplicate
77d96109380d527b	ckxj9	2025-12-09 05:09:20.735112 +0000	duplicate
68fcf85ab965195d	tdc4b	2025-12-09 05:09:20.735627 +0000	ignored
77d96109380d527b	ckxj9	2025-12-09 05:09:20.753071 +0000	ignored
```

After successful processing on `j5h8j`, two different hosts (`tdc4b` and `ckxj9`) began processing the same message **simultaneously** at 05:09:20. This is not a simple redelivery - both hosts started at nearly the same time, suggesting the message was duplicated in the stream. 

Note that for this stream we use single named/durable JetStream consumer.

### Extended Example

**Reference file:** `fb3c00f6aa88ab3c93497dcf2f933a9c`

```
span                host    timestamp                           event
--------------------------------------------------------------------------------
04216daaf8477973	ckxj9	2025-12-09 05:06:41.741895 +0000	attempt
04216daaf8477973	ckxj9	2025-12-09 05:06:48.642661 +0000	saved
04216daaf8477973	ckxj9	2025-12-09 05:06:53.535538 +0000	success
e575ac17244a71d1	f9v9h	2025-12-09 05:09:01.099885 +0000	redelivery
e575ac17244a71d1	f9v9h	2025-12-09 05:09:01.128765 +0000	ignored
4fcfd73341193f13	6rh45	2025-12-09 05:09:27.285313 +0000	duplicate
4fcfd73341193f13	6rh45	2025-12-09 05:09:27.286785 +0000	ignored
3bdf9b9890ea1a58	tdc4b	2025-12-09 05:09:27.354946 +0000	duplicate
3bdf9b9890ea1a58	tdc4b	2025-12-09 05:09:27.357696 +0000	ignored
86f3f13be192f787	txwx2	2025-12-09 05:09:27.563944 +0000	duplicate
86f3f13be192f787	txwx2	2025-12-09 05:09:27.579053 +0000	ignored
```

This shows escalation:
1. Initial `success` with lost acknowledgment (05:06:53)
2. One `redelivery` correctly `ignored` with lost acknowledgment (05:09:01) again
3. Three more hosts simultaneously attempting to process at 05:09:27 with duplicated messages

## Issue 3: Message Multiplication After Errors

**Reference file:** `ffc962e90d726f5ca524ae2f763a500f`

```
span                host    timestamp                           event
--------------------------------------------------------------------------------
6ac1eead21567b87	9zdrt	2025-12-09 05:05:45.692145 +0000	attempt
6ac1eead21567b87	9zdrt	2025-12-09 05:05:48.259197 +0000	saved
6ac1eead21567b87	9zdrt	2025-12-09 05:05:54.088455 +0000	error
4ee2f2fa4fc65f37	wb7gg	2025-12-09 05:09:17.488001 +0000	attempt
4ee2f2fa4fc65f37	wb7gg	2025-12-09 05:09:18.512093 +0000	ignored
5d57854920255467	6zlnk	2025-12-09 05:09:26.398787 +0000	duplicate
5d57854920255467	6zlnk	2025-12-09 05:09:26.432748 +0000	ignored
b30e3585d184043a	f9v9h	2025-12-09 05:09:27.314732 +0000	duplicate
b30e3585d184043a	f9v9h	2025-12-09 05:09:27.316256 +0000	ignored
13215a392b8ab9df	txwx2	2025-12-09 05:09:27.327671 +0000	duplicate
13215a392b8ab9df	txwx2	2025-12-09 05:09:27.329408 +0000	ignored
82ea14a4b101fbc7	sbgvh	2025-12-09 05:09:28.249803 +0000	duplicate
82ea14a4b101fbc7	sbgvh	2025-12-09 05:09:28.252123 +0000	ignored
959565dc2bebd3ad	sbgvh	2025-12-09 05:09:28.319449 +0000	duplicate
959565dc2bebd3ad	sbgvh	2025-12-09 05:09:28.321834 +0000	ignored
8cdd38936e3a3407	j5h8j	2025-12-09 05:09:28.467998 +0000	duplicate
8cdd38936e3a3407	j5h8j	2025-12-09 05:09:28.469124 +0000	ignored
cdc6800d5dec46ee	sbgvh	2025-12-09 05:09:28.470661 +0000	duplicate
cdc6800d5dec46ee	sbgvh	2025-12-09 05:09:28.472738 +0000	ignored
e6059d7fb3e0ea42	zcv9m	2025-12-09 05:09:28.673221 +0000	duplicate
e6059d7fb3e0ea42	zcv9m	2025-12-09 05:09:28.675075 +0000	ignored
```

The initial processing at 05:05:54 encountered an `error` - the database insert and object deletion completed, but the acknowledgment timed out. While one more `attempt` (05:09:17) is absolutely reasonable, but we also see eight simultaneous `duplicate` between 05:09:26-28

Messages that fail to acknowledge appear to multiply exponentially.

# Visualization

```mermaid
gantt
    title Message Processing Timeline (ffc962e90d726f5ca524ae2f763a500f)
    dateFormat HH:mm:ss.SSS
    axisFormat %H:%M:%S

    section 9zdrt
    6ac1 attempt            :6ac1_start, 05:05:45.692, 8396ms
    6ac1 saved              :milestone, 05:05:48.259, 0ms
    6ac1 ERROR              :crit, milestone, 05:05:54.088, 0ms

    section wb7gg
    4ee2 redelivery         :done, 4ee2_start, 05:09:17.488, 1024ms
    4ee2 ignored            :done, milestone, 05:09:18.512, 0ms

    section 6zlnk
    5d57 duplicate          :active, 5d57_start, 05:09:26.398, 34ms
    5d57 ignored            :active, milestone, 05:09:26.432, 0ms

    section f9v9h
    b30e duplicate          :active, b30e_start, 05:09:27.314, 2ms
    b30e ignored            :active, milestone, 05:09:27.316, 0ms

    section txwx2
    1321 duplicate          :active, 1321_start, 05:09:27.327, 2ms
    1321 ignored            :active, milestone, 05:09:27.329, 0ms

    section sbgvh
    82ea duplicate          :active, 82ea_start, 05:09:28.249, 2ms
    82ea ignored            :active, milestone, 05:09:28.252, 0ms
    9595 duplicate          :active, 9595_start, 05:09:28.319, 2ms
    9595 ignored            :active, milestone, 05:09:28.321, 0ms
    cdc6 duplicate          :active, cdc6_start, 05:09:28.470, 2ms
    cdc6 ignored            :active, milestone, 05:09:28.472, 0ms

    section j5h8j
    8cdd duplicate          :active, 8cdd_start, 05:09:28.467, 1ms
    8cdd ignored            :active, milestone, 05:09:28.469, 0ms

    section zcv9m
    e605 duplicate          :active, e605_start, 05:09:28.673, 2ms
    e605 ignored            :active, milestone, 05:09:28.675, 0ms
```

Above:
* machine `6ac1` span `9zdrt` almost processes the events, however timeouts on confirmation
* machine `4ee2` span `wb7gg` tries again to find that everything has been done

So far... acceptable, but then:

* message gets delivered to multiple machines at the same time


### Expected behavior

Lost `ACK` is explainable, but spurious multiplication of messages feels worrying.

### Server and client version

NATS server: 2.12.2
NATS client: .NET 2.6.11

### Host environment

K8S / AKS / ARM64

### Steps to reproduce

That's a problem. We have it only in system under very heavy load. So far I did not manage to reproduce on the laptop. Possibly it may need multiple nodes with simulated partiotion? I don't know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Degradation under load after update to 2.12.2 (from 2.11.9) #7633

Observed behavior

Background

Log Event Types

Problem

Issue 1: Lost Acknowledgments

Issue 2: Message Duplication

Extended Example

Issue 3: Message Multiplication After Errors

Visualization

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Degradation under load after update to 2.12.2 (from 2.11.9) #7633

Description

Observed behavior

Background

Log Event Types

Problem

Issue 1: Lost Acknowledgments

Issue 2: Message Duplication

Extended Example

Issue 3: Message Multiplication After Errors

Visualization

Expected behavior

Server and client version

Host environment

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions