Track more snapshot-releated node-level stats #130301

nicktindall · 2025-06-30T01:05:05Z

Adds additional snapshot metrics and publishes them via APM

Apologies for the size of this change, but most of it is plumbing. The change itself is quite small.

Relates: ES-12055, ES-11927

…ot_stats_as_metrics

DaveCTurner

Looks good (after wading through all the plumbing changes!). I left some comments inline.

DaveCTurner · 2025-07-08T07:19:40Z

server/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

@@ -191,6 +191,10 @@ public Stage getStage() {
        return stage.get();
    }

+    public long getTotalTime() {


I know this is the name of the field but could we include the unit (millis?) in the name of this getter at least?

Done in 4a8fb9d

DaveCTurner · 2025-07-08T07:23:25Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesStats.java

+                    in.readLong(),
+                    in.readLong(),
+                    in.readLong(),
+                    totalReadThrottledNanos,
+                    totalWriteThrottledNanos,
+                    in.readLong(),
+                    in.readLong(),
+                    in.readLong(),
+                    in.readLong()


I'd expect these to be represented using VLong rather than a bare Long - mostly they're going to be quite close to zero. Tho that means we can't just use -1 to mean "missing". Maybe we should also include a boolean up front to indicate that we only include the legacy throttling stats?

Fixed in e344693, I just put zeroes for the BWC as I imagine it's fairly intermittent.

DaveCTurner · 2025-07-08T07:28:47Z

server/src/main/java/org/elasticsearch/repositories/Repository.java

+    default LongWithAttributes getShardSnapshotsInProgress() {
+        return null;


I think I'd rather not have a default implementation here and instead just require all the non-blobstore repository implementations to return null explicitly. But also all the non-blobstore repository implementations are read-only, so they can reasonably return 0 here.

Fixed in ff82158

DaveCTurner · 2025-07-08T07:29:55Z

server/src/main/java/org/elasticsearch/repositories/Repository.java

+    default RepositoriesStats.SnapshotStats getSnapshotStats() {
+        return new RepositoriesStats.SnapshotStats(getRestoreThrottleTimeInNanos(), getSnapshotThrottleTimeInNanos());


Likewise here, can we not have a default implementation and instead push this down to the subclasses? We should be able to drop getRestoreThrottleTimeInNanos() and getSnapshotThrottleTimeInNanos() from the Repository interface with this change.

Done in 0d9a62d

server/src/main/java/org/elasticsearch/repositories/SnapshotShardContext.java

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreSnapshotMetrics.java

DaveCTurner · 2025-07-08T07:45:24Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+     * {@link SnapshotInProgressAllocationDecider}, or states that might delay
+     * a snapshot's completion.
+     */
+    private static final List<ShardState> TRACKED_SHARD_STATES = List.of(


Is there a good reason for not just tracking all the states? I can see some value in knowing about completed shard snapshots too (e.g. to investigate delays in snapshot finalization)

Yeah not really, I was going for a minimal "interesting" set, but happy to enable them all.

Enabled in e9f32a0

DaveCTurner · 2025-07-08T07:49:09Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    for (SnapshotsInProgress.Entry snapshot : snapshotsInProgress.forRepo(projectId, repository.name())) {
+                        for (ShardSnapshotStatus shardSnapshotStatus : snapshot.shards().values()) {
+                            if (shardCounts.containsKey(shardSnapshotStatus.state())) {
+                                shardCounts.put(shardSnapshotStatus.state(), shardCounts.get(shardSnapshotStatus.state()) + 1);


suggest using com.carrotsearch.hppc.ObjectIntMap#addTo (saves looking up the entry twice)

Fixed in ba62ff6

DaveCTurner · 2025-07-08T07:52:30Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                    for (SnapshotsInProgress.Entry snapshot : snapshotsInProgress.forRepo(projectId, repository.name())) {
+                        for (ShardSnapshotStatus shardSnapshotStatus : snapshot.shards().values()) {


I worry about these potentially-deeply-nested loops happening on each metric collection cycle. We're already keeping track of these things in applyClusterState (the only place they can change) - could we compute these counts there instead?

Done in e90ff65

DaveCTurner · 2025-07-08T07:53:04Z

server/src/test/java/org/elasticsearch/action/admin/cluster/node/stats/NodeStatsTests.java

@@ -1069,7 +1069,7 @@ public static NodeStats createNodeStats() {
            );
        }
        RepositoriesStats repositoriesStats = new RepositoriesStats(
-            Map.of("test-repository", new RepositoriesStats.ThrottlingStats(100, 200))
+            Map.of("test-repository", new RepositoriesStats.SnapshotStats(100, 200))


Could we use the full SnapshotStats here rather than the legacy throttling-only one? And assert its contents?

Done in 1dec741

… writes in millis)

…hrottleTimeInNanos() and Repository#getRestoreThrottleTimeInNanos()

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

DaveCTurner · 2025-07-14T07:32:57Z

server/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

            + ", startTime="
-            + startTime
+            + startTimeMillis
            + ", totalTime="
-            + totalTime
+            + totalTimeMillis


nit: could rename the labels here too

DaveCTurner · 2025-07-14T07:47:27Z

server/src/main/java/org/elasticsearch/cluster/SnapshotsInProgress.java

+            this(entries, stateSummaries.v1(), stateSummaries.v2());
+        }
+
+        private static Tuple<Map<State, Integer>, Map<ShardState, Integer>> calculateStateSummaries(List<Entry> entries) {


Hmm I think this means we do this computation on every node now which seems wasteful. Could we do it in SnapshotsService still, just on the master?

When I suggested doing this in applyClusterState I meant just updating the existing stats according to the new cluster state, not computing everything from scratch. If we have to do it from scratch every time then I guess it'd be better to happen on the stats-collection thread rather than the cluster applier. At least we could cache the results assuming they won't change before the next stats collection?

Track snapshot stats as metrics

e7794c1

elasticsearchmachine added the v9.2.0 label Jun 30, 2025

nicktindall added 3 commits June 30, 2025 11:49

Fix double counted snapshot completion

b8be99e

Reduce size of change

9eeee30

Add MeterRegistry param in callers

67eb753

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 30, 2025

nicktindall added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 30, 2025

nicktindall and others added 22 commits June 30, 2025 14:29

Make banned implementation final

faf4e7a

Improve javadoc

5a33bb6

Fix naming

b4c926f

Fix naming, record shard duration as histogram

d808b85

Millis -> nanos

fd55b35

Reuse totalTime

ffdb941

Don't use cached time

6e22dc6

Fist pass on tests

818c259

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

b929cd1

…ot_stats_as_metrics

Fix SnapshotMetricsIT

9ba0d4a

Naming

45909db

Assert on throttling metrics

f42f9bd

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

ef19cd5

…ot_stats_as_metrics

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

adc149a

…ot_stats_as_metrics

Add snapshot APM metrics

345cc59

Add snapshot metrics

b341645

Tidy

3894be6

Tidy

d164c8c

Tidy

b8cc9f9

Reduce surface area of change (?)

7f7427c

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

3aa7ca6

…ot_stats_as_metrics

URLRepository

e2665d1

nicktindall added 2 commits July 8, 2025 16:38

Use humanReadableField

03e9d81

Better names for uploaded size/blobs

c4a0d67

DaveCTurner reviewed Jul 8, 2025

View reviewed changes

nicktindall added 21 commits July 8, 2025 17:35

Assert common attributes for shards-by-status metrics

3f8762d

Reduce number of documents indexed

46f892a

Use millis rather than nanos when measuring/counting upload/read time

8bf2475

Write throttle time as nanos (can't use humanReadableField because it…

02d11fd

… writes in millis)

Include unit in IndexShardSnapshotStatus#(startTime|totalTime)

4a8fb9d

Use VLong to encode extended fields, zero for BWC

e344693

Remove default getShardSnapshotsInProgress

ff82158

Remove default Repository#getSnapshotStats(), Repository#getSnapshotT…

0d9a62d

…hrottleTimeInNanos() and Repository#getRestoreThrottleTimeInNanos()

Tidy up close listener creation

b665a49

Explain why we use seconds for duration histograms

7d9b36d

Track all snapshot shard statuses

e9f32a0

Use com.carrotsearch.hppc.ObjectIntMap.addTo

ba62ff6

Pre-calculate shard & snapshot state summaries in cluster state

e90ff65

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

2ef402e

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

6c5d200

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Don't try and get shards for clone entry

a43ac7e

Add snapshots by state metric

093ae22

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

81ce452

Remove remnants of limited state tracking

f509643

Remove redundant snapshots in progress metric

3ac413c

Populate and assert on all snapshotStats fields

1dec741

nicktindall requested a review from ywangd July 10, 2025 03:57

nicktindall added 3 commits July 10, 2025 14:09

Fix flakiness in RepositorySnapshotStatsIT, remove dead code

850c116

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

6dd44f1

Fix assertion

ca9d1ab

nicktindall requested a review from DaveCTurner July 10, 2025 06:33

DaveCTurner reviewed Jul 14, 2025

View reviewed changes

		default LongWithAttributes getShardSnapshotsInProgress() {
		return null;

		default RepositoriesStats.SnapshotStats getSnapshotStats() {
		return new RepositoriesStats.SnapshotStats(getRestoreThrottleTimeInNanos(), getSnapshotThrottleTimeInNanos());

		for (SnapshotsInProgress.Entry snapshot : snapshotsInProgress.forRepo(projectId, repository.name())) {
		for (ShardSnapshotStatus shardSnapshotStatus : snapshot.shards().values()) {

Track more snapshot-releated node-level stats #130301

Are you sure you want to change the base?

Track more snapshot-releated node-level stats #130301

Conversation

nicktindall commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktindall commented Jun 30, 2025 •

edited

Loading