-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Description
We noticed that TransportGetShutdownStatusAction
can sometimes take more than 30 minutes to be executed on large clusters:
handling request [InboundMessage{Header{589}{8070199}{4327671}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [2507091ms] which is above the warn threshold of [5000ms]
handling request [InboundMessage{Header{589}{8070199}{247173}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [1951209ms] which is above the warn threshold of [5000ms]
handling request [InboundMessage{Header{589}{8070199}{2121260298}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [2305361ms] which is above the warn threshold of [5000ms]
In this case the cluster had 40K shards (~30K partially mounted searchable snapshot indices with 1 primary 0 replica, ~5K regular indices with 1 primary and 1 replica). We saw this on a 8.7.1 cluster but I suspect all versions are affected.
It's not clear where the processing time is spent. By looking at the code we suspect that most time is spent computing the explain allocation for every started shard:
Lines 273 to 277 in c956eec
Optional<Tuple<ShardRouting, ShardAllocationDecision>> unmovableShard = currentState.getRoutingNodes() | |
.node(nodeId) | |
.shardsWithState(ShardRoutingState.STARTED) | |
.peek(s -> cancellableTask.ensureNotCancelled()) | |
.map(shardRouting -> new Tuple<>(shardRouting, allocationService.explainShardAllocation(shardRouting, allocation))) |
The class also computes all unassigned shards to later retain only the first one:
Lines 216 to 225 in c956eec
var unassignedShards = currentState.getRoutingNodes() | |
.unassigned() | |
.stream() | |
.peek(s -> cancellableTask.ensureNotCancelled()) | |
.filter(s -> Objects.equals(s.unassignedInfo().getLastAllocatedNodeId(), nodeId)) | |
.filter(s -> s.primary() || hasShardCopyOnAnotherNode(currentState, s, shuttingDownNodes) == false) | |
.toList(); | |
if (unassignedShards.isEmpty() == false) { | |
var shardRouting = unassignedShards.get(0); |