Skip to content

SlotNotCoveredError when cluster is resharding #2988

Closed
@dinispeixoto

Description

@dinispeixoto

Version: What redis-py and what redis version is the issue happening on?

  • Client: 4.5.5
  • Cluster Engine: 6.2.6

Platform: What platform / version? (For example Python 3.5.1 on Windows 7 / Ubuntu 15.10 / Azure)

  • ECS and ElastiCache

Description: Description of your issue, stack traces from errors and code that reproduces the issue

When the cluster is up or down scaling, while slots are being migrated to the new shard, the client raises the following error:

redis.exceptions.SlotNotCoveredError: Command # 9 (HGETALL ...) of pipeline caused error: ('Slot "4718" not covered by the cluster. "require_full_coverage=True"',)

Full stack trace:

  File "/root/clients/redis_client.py", line 219, in run_queries
    return await pipeline.execute()
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1455, in execute
    return await self._execute(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1536, in _execute
    raise result
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1520, in _execute
    cmd.result = await client.execute_command(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 725, in execute_command
    raise e
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 696, in execute_command
    ret = await self._execute_command(target_nodes[0], *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 745, in _execute_command
    target_node = self.nodes_manager.get_node_from_slot(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1174, in get_node_from_slot
    raise SlotNotCoveredError(
redis.exceptions.SlotNotCoveredError: Command # 6 (HGET ... ...) of pipeline caused error: ('Slot "11246" not covered by the cluster. "require_full_coverage=True"',)

Reproduce locally
It's also possible to reproduce this locally quite easily:

  1. Setup a Redis cluster locally (https://github.com/Grokzen/docker-redis-cluster)
  2. Create a simple Redis app that leverages a pipeline of read-only commands and run a load test to force multiple reads. I also made sure there was a different client writing non-stop to the local cluster.
  3. Force resharding of a few slots (e.g. 5k) with redis-cli --cluster reshard 0.0.0.0:7000
  4. Redis app should raise SlotNotCoveredError quite often while the migration of slots is happening, once it's done it stops.

Possible root cause
I tried debugging the library and here's what I could find so far:

  1. The slots cache is outdated, the slot we need is in a different shard. Client replaces self.slots_cache[slot] with the redirected_node (returned by the MOVED error) which is only the master node for the new shard of the slot.

https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1177

else:
    # The new slot owner is a new server, or a server from a different
    # shard. We need to remove all current nodes from the slot's list
    # (including replications) and add just the new node.
    self.slots_cache[e.slot_id] = [redirected_node]
  1. However at the same time, we might call NodesManager.get_node_from_slot. Since we are only reading from read replicas it starts by getting the server index using the RR load balancer. Let's say we have 1 master node and 2 replicas for each shard, it can return an index from 0 to 2. However, as we've seen earlier for this specific slot the slots cache only has a single entry (the master node). This essentially means that if the RR load balancer returns either 1 or 2, it will raise an IndexError, thus a SlotNotCoveredError.

https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1191C1-L1204C14

        try:
            if read_from_replicas:
                # get the server index in a Round-Robin manner
                primary_name = self.slots_cache[slot][0].name
                node_idx = self.read_load_balancer.get_server_index(
                    primary_name, len(self.slots_cache[slot])
                )
                return self.slots_cache[slot][node_idx]
            return self.slots_cache[slot][0]
        except (IndexError, TypeError):
            raise SlotNotCoveredError(
                f'Slot "{slot}" not covered by the cluster. '
                f'"require_full_coverage={self.require_full_coverage}"'
            )
  1. In order to fix the issue I just patched the method to check whether the index returned by the load balancer is available in the slots cache and in case it isn't just get it from the primary node instead.
                node_idx = self.read_load_balancer.get_server_index(
                    primary_name, len(self.slots_cache[slot])
                )

                node = node_idx if node_idx < len(self.slots_cache[slot]) else 0
                return self.slots_cache[slot][node]

Happy to create a PR if this makes sense. Sorry if I'm missing something.

Thanks in advance 🙌

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions