Doubled buckets due to excluding nodes during rebalancing

### To avoid such situation user should enable autoexpel  feature in Tarantool 3 ([link](https://github.com/tarantool/tarantool/blob/master/src/box/lua/config/applier/autoexpel.lua#L122-L146)). In Tarantool 2 with cartridge it should work out of the box ([link](https://github.com/tarantool/cartridge/blob/b70400d08648b048a79040bcf6e0b765e6bf24ce/cartridge/confapplier.lua#L295-L295)).
----

Even after implementing the solutions proposed in the RFCs https://github.com/tarantool/vshard/discussions/619 and https://github.com/tarantool/vshard/discussions/620, it's still possible to encounter doubled bucket situation, if user won't follow rules in the documentation and will exclude node from the config without removing it from the `_cluster` space and then return it to config during rebalancing.

This is how this'll happen:

1. User starts rebalancing by changing weights in the cluster.
2. Rebalancer on master cannot sync with one node, so user excludes the node from configuration to continue rebalancing
3. The node is not deleted from the `_cluster` for some reason (either `autoexpel` is disabled or smth is not working)
4. Rebalancing sends the bucket successfully, the bucket is `ACTIVE` on another replicaset, the replication with expelled node is broken.
5. User returns the expelled node to the configuration and it becomes master, it has the bucket in `ACTIVE` state

If the node had been removed from the `_cluster`, this would not have happened, since it would have to rejoin and would get the actual state of the `_bucket` space. So, using `autoexpel` in Tarantool 3 fixes the issue.

---

We need to find a way to protect user from such situation, but it seems, that at the current point, there's no way to do that:

1. The only 100% safe solution, which doesn't depend on configuration systems, is syncing with all nodes in the `_cluster` space. However, we still have CDC (or any other non-vshard instances), which may lag a lot, moreover, we won't be able. to check refs there, which is needed for https://github.com/tarantool/vshard/discussions/620 or map_callro. So, we must rely on configuration of vshard.

2. We could add some configuration option to vshard, which would filter out nodes in the _cluster for us, but the user can still add vshard instance to it, which will again lead to doubled buckets. Not an option. So, here it's better to think, that all intsances which are in _cluster but not in vshard configuration are non-vshard instances.

3. Prohibit adding new instance, which is already in the _cluster space, when the node is sending buckets. This still won't work unfortunately: the node is expelled from cluster, replication is broken, the node successfully sends the bucket (the bucket completely deleted), then user adds that node to the vshard configuration again, doubled bucket.

4. Vshard completely forbids adding new nodes to a configuration, if they're already in the _cluster space. Drop the node from _cluster and only then you can add it again. As stated by @sergepetrenko, this is not an option, because configuration cannot be rejected on some external factor ([link](https://github.com/tarantool/vshard/discussions/619#discussioncomment-15063512)).

Proposed by @Gerold103 ([link](https://github.com/tarantool/vshard/discussions/619#discussioncomment-15079627)):

5. Raft would be a solution. But we still need to support versions where it is broken. It is broken on all of them right now because of Synchronous Promote ticket, which I am working on. And because of existence of IPROTO_RAFT_ROLLBACK which must be deleted. And because of election_mode='off' allowing promotes. It is also broken.

6. I feel it would be enough to find a way to guarantee that the stale node attempting to do any illegal changes eventually gets a replication error from the other nodes and the router will start ignoring it. I feel like this could be possible via bucket generations, but I don't see a way. Without actual elections and 50%+1 quorum there is always a way to make a stale node a master, make it ignore all the other nodes, and go crazy.

7. Maybe the best we can do is to make the router and/or rebalancer notice that some buckets on different replicasets are active. And signal this to the user so it doesn't go unnoticed. Noticing that is problematic though. Rebalancer would have to collect specific bucket IDs. And I wouldn't want routers to become essential to this logic, since we literally have some people using non-Lua routers, and they will probably miss this logic. Discovery of double buckets could work as a part of the rebalancer perhaps. It would fetch the bucket IDs from the storages. See if any ID appears twice. If it does, then we need to double-check it again on the storages where it happened, because the bucket just could be moved between the ID downloads. And it appears again more than once on the second check with the same generations, then we alert this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Doubled buckets due to excluding nodes during rebalancing #623

To avoid such situation user should enable autoexpel feature in Tarantool 3 (link). In Tarantool 2 with cartridge it should work out of the box (link).

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Doubled buckets due to excluding nodes during rebalancing #623

Description

To avoid such situation user should enable autoexpel feature in Tarantool 3 (link). In Tarantool 2 with cartridge it should work out of the box (link).

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions