Skip to content

Store-gateway blocks resharding during rollout #2823

@pracucci

Description

@pracucci

When running the blocks storage, store-gateways reshard blocks whenever the ring topology changes. This means that during a rollout of the store-gateways (ie. deploy a config change or version upgrade) blocks are resharded across instances.

This is highly inefficient in a cluster with a large number of tenants or few very large tenants. Ideally, no blocks resharding should occur during a rollout (if the blocks replication factor is > 1).

Rollouts

We could improve the system to avoid the blocks resharding when the following conditions are met:

  • -experimental.store-gateway.replication-factor is > 1 (so that while a store-gateway restarts all of its blocks are replicated to at least another instance)
  • -experimental.store-gateway.tokens-file-path is configured (so that previous tokens are picked up on restart)
  • The store-gateway instance ID is stable across restarts (ie. Kubernetes Statfulsets)

To avoid blocks resharding during store-gateways rollout we need the restarting store-gateway instance to not be unregistered from the ring during the restart.

When a store-gateway shutdowns, the instance could be left in the LEAVING state within the ring and we could change the BlocksReplicationStrategy.ShouldExtendReplicaSet() to not extend the replica set if an instance is in the LEAVING state.

This means that during the rollout, for the blocks hold by the restarting replica there will be N-1 replicas (contrary to the N desired replicas configured). Once the instance restarts, it will have the same instance ID and same tokens (assuming tokens-file-path is configured) and thus will replace its state from LEAVING to JOINING within the ring.

Scale down

There's no way to distinguish between a rollout and a scale down: the process just receives a termination signal.

This means that during a scale down, the instance would be left in the LEAVING state within the ring. However, the store-gateway has an auto-forget feature which removes unhealthy instances after 10x heartbeat timeouts (default: 1m timeout = 10m before an unhealthy instance is forgotten).

A scale down of a number of instance < replication factor could leverage on the auto-forget. However, there's no easy way to have a smooth scale down unless we'll have a way to signal the process whether it's going to shutdown because of a scale down or rollout.

Crashes

In case a store-gateway crashes, there would be no difference compared to today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions