Store-gateway blocks resharding during rollout

When running the blocks storage, store-gateways reshard blocks whenever the ring topology changes. This means that during a rollout of the store-gateways (ie. deploy a config change or version upgrade) blocks are resharded across instances. 

This is highly inefficient in a cluster with a large number of tenants or few very large tenants. Ideally, no blocks resharding should occur during a rollout (if the blocks replication factor is > 1).

## Rollouts

We could improve the system to avoid the blocks resharding when the following conditions are met:
- `-experimental.store-gateway.replication-factor` is `> 1` (so that while a store-gateway restarts all of its blocks are replicated to at least another instance)
- `-experimental.store-gateway.tokens-file-path` is configured (so that previous tokens are picked up on restart)
- The store-gateway instance ID is stable across restarts (ie. Kubernetes Statfulsets)

To avoid blocks resharding during store-gateways rollout we need the restarting store-gateway instance to **not** be unregistered from the ring during the restart.

When a store-gateway shutdowns, the instance could be left in the `LEAVING` state within the ring and we could change the `BlocksReplicationStrategy.ShouldExtendReplicaSet()` to **not** extend the replica set if an instance is in the `LEAVING` state.

This means that during the rollout, for the blocks hold by the restarting replica there will be N-1 replicas (contrary to the N desired replicas configured). Once the instance restarts, it will have the same instance ID and same tokens (assuming `tokens-file-path` is configured) and thus will replace its state from `LEAVING` to `JOINING` within the ring.

## Scale down

There's no way to distinguish between a rollout and a scale down: the process just receives a termination signal.

This means that during a scale down, the instance would be left in the `LEAVING` state within the ring. However, the store-gateway has an auto-forget feature which removes unhealthy instances after 10x heartbeat timeouts (default: 1m timeout = 10m before an unhealthy instance is forgotten).

A scale down of a number of instance < replication factor could leverage on the auto-forget. However, there's no easy way to have a smooth scale down unless we'll have a way to signal the process whether it's going to shutdown because of a scale down or rollout.

## Crashes

_In case a store-gateway crashes, there would be no difference compared to today._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store-gateway blocks resharding during rollout #2823

Rollouts

Scale down

Crashes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Store-gateway blocks resharding during rollout #2823

Description

Rollouts

Scale down

Crashes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions