Skip to content

TransportGetShutdownStatusAction can take minutes to complete #100506

@tlrx

Description

@tlrx

We noticed that TransportGetShutdownStatusAction can sometimes take more than 30 minutes to be executed on large clusters:

handling request [InboundMessage{Header{589}{8070199}{4327671}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [2507091ms] which is above the warn threshold of [5000ms]

handling request [InboundMessage{Header{589}{8070199}{247173}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [1951209ms] which is above the warn threshold of [5000ms]

handling request [InboundMessage{Header{589}{8070199}{2121260298}{true}{false}{false}{false}{cluster:admin/shutdown/get}}] took [2305361ms] which is above the warn threshold of [5000ms]

In this case the cluster had 40K shards (~30K partially mounted searchable snapshot indices with 1 primary 0 replica, ~5K regular indices with 1 primary and 1 replica). We saw this on a 8.7.1 cluster but I suspect all versions are affected.

It's not clear where the processing time is spent. By looking at the code we suspect that most time is spent computing the explain allocation for every started shard:

Optional<Tuple<ShardRouting, ShardAllocationDecision>> unmovableShard = currentState.getRoutingNodes()
.node(nodeId)
.shardsWithState(ShardRoutingState.STARTED)
.peek(s -> cancellableTask.ensureNotCancelled())
.map(shardRouting -> new Tuple<>(shardRouting, allocationService.explainShardAllocation(shardRouting, allocation)))

The class also computes all unassigned shards to later retain only the first one:

var unassignedShards = currentState.getRoutingNodes()
.unassigned()
.stream()
.peek(s -> cancellableTask.ensureNotCancelled())
.filter(s -> Objects.equals(s.unassignedInfo().getLastAllocatedNodeId(), nodeId))
.filter(s -> s.primary() || hasShardCopyOnAnotherNode(currentState, s, shuttingDownNodes) == false)
.toList();
if (unassignedShards.isEmpty() == false) {
var shardRouting = unassignedShards.get(0);

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions