-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Observed behavior
We recently updated from NATS server 2.11.9 to 2.12.2. And we definitely see service degradation.
The most worrying is miraculous multiplication of messages (see below). It seems that under load extra copies get generated.
It might be the same root cause as reported here: https://jepsen.io/analyses/nats-2.12.1
Background
We have an HTTP endpoint that receives high load at 5am when devices upload their logs. The upload handler is designed for maximum throughput:
app.Map('/upload', () => {
var guid = NewGuid();
var stream = request.Body;
bucket.Put(guid, stream);
js.Publish('log.uploaded', guid);
});The handler accepts requests as quickly as possible, stores logs in an object bucket, and publishes an event for later processing.
On the consumer side, however, we take it slow, we have a log processor that handles log.uploaded events:
js.Subscribe('log.uploaded', message => {
var stream = MemoryStream();
var loop = StartTimer(15*1000, () => message.AckProgress());
bucket.Get(guid, stream);
var log = ParseLog(stream);
db.Put(log);
bucket.Delete(guid);
loop.Stop();
js.Ack(message);
});Key implementation details:
- A periodic
AckProgresstimer prevents message redelivery during processing - We limit the number of concurrent handlers to avoid overloading the database
- Processing includes fetching from bucket, parsing, database insertion, and cleanup
Log Event Types
There is a lot of data to analyze, but I tried to extract as much from logs, to identify certain patterns.
Throughout this report, we use the following event names:
attempt- Processing started, first attempt or legitimate retryredelivery- likeattempt, but should not happen (as it wasacked already)duplicate- likeredelivery, but worse, indicates message duplication in the streamsaved- database insert completed successfullysuccess- processing completed and messageackedignored-acked likesuccessbut technically object was already deletederror- processing failed with exception, nextattemptis (rightfully) expected
Problem
After upgrading to NATS 2.12.2, we started experiencing anomalies under load. While initial timeouts are expected, the situation escalates to quite uncomfortable levels.
Issue 1: Lost Acknowledgments
Reference file: 3edabbe9fa65fc01fe8d0c2a1c1092d8
span host timestamp event
--------------------------------------------------------------------------------
e4174614f7d210c8 zcv9m 2025-12-09 05:05:12.781561 +0000 attempt
e4174614f7d210c8 zcv9m 2025-12-09 05:05:16.464711 +0000 saved
e4174614f7d210c8 zcv9m 2025-12-09 05:05:19.295516 +0000 success
d4ff2f1eb7576b79 zcv9m 2025-12-09 05:09:22.128579 +0000 redelivery
d4ff2f1eb7576b79 zcv9m 2025-12-09 05:09:22.214666 +0000 ignored
The file was successfully processed and acknowledged at success, yet the message was redelivered ~4 minutes later. The redelivery was correctly ignored, preventing further retries. While not ideal, occasional lost acknowledgments under load might be acceptable.
Issue 2: Message Duplication
Reference file: af1bb801d1ef4348be3b4b6a28aa0327
span host timestamp event
--------------------------------------------------------------------------------
62b955eb1c49a18e j5h8j 2025-12-09 05:05:43.346032 +0000 attempt
62b955eb1c49a18e j5h8j 2025-12-09 05:05:48.278157 +0000 saved
62b955eb1c49a18e j5h8j 2025-12-09 05:05:53.577229 +0000 success
68fcf85ab965195d tdc4b 2025-12-09 05:09:20.559444 +0000 duplicate
77d96109380d527b ckxj9 2025-12-09 05:09:20.735112 +0000 duplicate
68fcf85ab965195d tdc4b 2025-12-09 05:09:20.735627 +0000 ignored
77d96109380d527b ckxj9 2025-12-09 05:09:20.753071 +0000 ignored
After successful processing on j5h8j, two different hosts (tdc4b and ckxj9) began processing the same message simultaneously at 05:09:20. This is not a simple redelivery - both hosts started at nearly the same time, suggesting the message was duplicated in the stream.
Note that for this stream we use single named/durable JetStream consumer.
Extended Example
Reference file: fb3c00f6aa88ab3c93497dcf2f933a9c
span host timestamp event
--------------------------------------------------------------------------------
04216daaf8477973 ckxj9 2025-12-09 05:06:41.741895 +0000 attempt
04216daaf8477973 ckxj9 2025-12-09 05:06:48.642661 +0000 saved
04216daaf8477973 ckxj9 2025-12-09 05:06:53.535538 +0000 success
e575ac17244a71d1 f9v9h 2025-12-09 05:09:01.099885 +0000 redelivery
e575ac17244a71d1 f9v9h 2025-12-09 05:09:01.128765 +0000 ignored
4fcfd73341193f13 6rh45 2025-12-09 05:09:27.285313 +0000 duplicate
4fcfd73341193f13 6rh45 2025-12-09 05:09:27.286785 +0000 ignored
3bdf9b9890ea1a58 tdc4b 2025-12-09 05:09:27.354946 +0000 duplicate
3bdf9b9890ea1a58 tdc4b 2025-12-09 05:09:27.357696 +0000 ignored
86f3f13be192f787 txwx2 2025-12-09 05:09:27.563944 +0000 duplicate
86f3f13be192f787 txwx2 2025-12-09 05:09:27.579053 +0000 ignored
This shows escalation:
- Initial
successwith lost acknowledgment (05:06:53) - One
redeliverycorrectlyignoredwith lost acknowledgment (05:09:01) again - Three more hosts simultaneously attempting to process at 05:09:27 with duplicated messages
Issue 3: Message Multiplication After Errors
Reference file: ffc962e90d726f5ca524ae2f763a500f
span host timestamp event
--------------------------------------------------------------------------------
6ac1eead21567b87 9zdrt 2025-12-09 05:05:45.692145 +0000 attempt
6ac1eead21567b87 9zdrt 2025-12-09 05:05:48.259197 +0000 saved
6ac1eead21567b87 9zdrt 2025-12-09 05:05:54.088455 +0000 error
4ee2f2fa4fc65f37 wb7gg 2025-12-09 05:09:17.488001 +0000 attempt
4ee2f2fa4fc65f37 wb7gg 2025-12-09 05:09:18.512093 +0000 ignored
5d57854920255467 6zlnk 2025-12-09 05:09:26.398787 +0000 duplicate
5d57854920255467 6zlnk 2025-12-09 05:09:26.432748 +0000 ignored
b30e3585d184043a f9v9h 2025-12-09 05:09:27.314732 +0000 duplicate
b30e3585d184043a f9v9h 2025-12-09 05:09:27.316256 +0000 ignored
13215a392b8ab9df txwx2 2025-12-09 05:09:27.327671 +0000 duplicate
13215a392b8ab9df txwx2 2025-12-09 05:09:27.329408 +0000 ignored
82ea14a4b101fbc7 sbgvh 2025-12-09 05:09:28.249803 +0000 duplicate
82ea14a4b101fbc7 sbgvh 2025-12-09 05:09:28.252123 +0000 ignored
959565dc2bebd3ad sbgvh 2025-12-09 05:09:28.319449 +0000 duplicate
959565dc2bebd3ad sbgvh 2025-12-09 05:09:28.321834 +0000 ignored
8cdd38936e3a3407 j5h8j 2025-12-09 05:09:28.467998 +0000 duplicate
8cdd38936e3a3407 j5h8j 2025-12-09 05:09:28.469124 +0000 ignored
cdc6800d5dec46ee sbgvh 2025-12-09 05:09:28.470661 +0000 duplicate
cdc6800d5dec46ee sbgvh 2025-12-09 05:09:28.472738 +0000 ignored
e6059d7fb3e0ea42 zcv9m 2025-12-09 05:09:28.673221 +0000 duplicate
e6059d7fb3e0ea42 zcv9m 2025-12-09 05:09:28.675075 +0000 ignored
The initial processing at 05:05:54 encountered an error - the database insert and object deletion completed, but the acknowledgment timed out. While one more attempt (05:09:17) is absolutely reasonable, but we also see eight simultaneous duplicate between 05:09:26-28
Messages that fail to acknowledge appear to multiply exponentially.
Visualization
gantt
title Message Processing Timeline (ffc962e90d726f5ca524ae2f763a500f)
dateFormat HH:mm:ss.SSS
axisFormat %H:%M:%S
section 9zdrt
6ac1 attempt :6ac1_start, 05:05:45.692, 8396ms
6ac1 saved :milestone, 05:05:48.259, 0ms
6ac1 ERROR :crit, milestone, 05:05:54.088, 0ms
section wb7gg
4ee2 redelivery :done, 4ee2_start, 05:09:17.488, 1024ms
4ee2 ignored :done, milestone, 05:09:18.512, 0ms
section 6zlnk
5d57 duplicate :active, 5d57_start, 05:09:26.398, 34ms
5d57 ignored :active, milestone, 05:09:26.432, 0ms
section f9v9h
b30e duplicate :active, b30e_start, 05:09:27.314, 2ms
b30e ignored :active, milestone, 05:09:27.316, 0ms
section txwx2
1321 duplicate :active, 1321_start, 05:09:27.327, 2ms
1321 ignored :active, milestone, 05:09:27.329, 0ms
section sbgvh
82ea duplicate :active, 82ea_start, 05:09:28.249, 2ms
82ea ignored :active, milestone, 05:09:28.252, 0ms
9595 duplicate :active, 9595_start, 05:09:28.319, 2ms
9595 ignored :active, milestone, 05:09:28.321, 0ms
cdc6 duplicate :active, cdc6_start, 05:09:28.470, 2ms
cdc6 ignored :active, milestone, 05:09:28.472, 0ms
section j5h8j
8cdd duplicate :active, 8cdd_start, 05:09:28.467, 1ms
8cdd ignored :active, milestone, 05:09:28.469, 0ms
section zcv9m
e605 duplicate :active, e605_start, 05:09:28.673, 2ms
e605 ignored :active, milestone, 05:09:28.675, 0ms
Above:
- machine
6ac1span9zdrtalmost processes the events, however timeouts on confirmation - machine
4ee2spanwb7ggtries again to find that everything has been done
So far... acceptable, but then:
- message gets delivered to multiple machines at the same time
Expected behavior
Lost ACK is explainable, but spurious multiplication of messages feels worrying.
Server and client version
NATS server: 2.12.2
NATS client: .NET 2.6.11
Host environment
K8S / AKS / ARM64
Steps to reproduce
That's a problem. We have it only in system under very heavy load. So far I did not manage to reproduce on the laptop. Possibly it may need multiple nodes with simulated partiotion? I don't know.