-
Notifications
You must be signed in to change notification settings - Fork 827
Description
When running the blocks storage, store-gateways reshard blocks whenever the ring topology changes. This means that during a rollout of the store-gateways (ie. deploy a config change or version upgrade) blocks are resharded across instances.
This is highly inefficient in a cluster with a large number of tenants or few very large tenants. Ideally, no blocks resharding should occur during a rollout (if the blocks replication factor is > 1).
Rollouts
We could improve the system to avoid the blocks resharding when the following conditions are met:
-experimental.store-gateway.replication-factor
is> 1
(so that while a store-gateway restarts all of its blocks are replicated to at least another instance)-experimental.store-gateway.tokens-file-path
is configured (so that previous tokens are picked up on restart)- The store-gateway instance ID is stable across restarts (ie. Kubernetes Statfulsets)
To avoid blocks resharding during store-gateways rollout we need the restarting store-gateway instance to not be unregistered from the ring during the restart.
When a store-gateway shutdowns, the instance could be left in the LEAVING
state within the ring and we could change the BlocksReplicationStrategy.ShouldExtendReplicaSet()
to not extend the replica set if an instance is in the LEAVING
state.
This means that during the rollout, for the blocks hold by the restarting replica there will be N-1 replicas (contrary to the N desired replicas configured). Once the instance restarts, it will have the same instance ID and same tokens (assuming tokens-file-path
is configured) and thus will replace its state from LEAVING
to JOINING
within the ring.
Scale down
There's no way to distinguish between a rollout and a scale down: the process just receives a termination signal.
This means that during a scale down, the instance would be left in the LEAVING
state within the ring. However, the store-gateway has an auto-forget feature which removes unhealthy instances after 10x heartbeat timeouts (default: 1m timeout = 10m before an unhealthy instance is forgotten).
A scale down of a number of instance < replication factor could leverage on the auto-forget. However, there's no easy way to have a smooth scale down unless we'll have a way to signal the process whether it's going to shutdown because of a scale down or rollout.
Crashes
In case a store-gateway crashes, there would be no difference compared to today.