SlotNotCoveredError when cluster is resharding

**Version**: What redis-py and what redis version is the issue happening on?

- Client: 4.5.5
- Cluster Engine: 6.2.6

**Platform**: What platform / version? (For example Python 3.5.1 on Windows 7 / Ubuntu 15.10 / Azure)

- ECS and ElastiCache

**Description**: Description of your issue, stack traces from errors and code that reproduces the issue

When the cluster is up or down scaling, while slots are being migrated to the new shard, the client raises the following error:
```
redis.exceptions.SlotNotCoveredError: Command # 9 (HGETALL ...) of pipeline caused error: ('Slot "4718" not covered by the cluster. "require_full_coverage=True"',)
```

**Full stack trace:**
```
  File "/root/clients/redis_client.py", line 219, in run_queries
    return await pipeline.execute()
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1455, in execute
    return await self._execute(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1536, in _execute
    raise result
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1520, in _execute
    cmd.result = await client.execute_command(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 725, in execute_command
    raise e
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 696, in execute_command
    ret = await self._execute_command(target_nodes[0], *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 745, in _execute_command
    target_node = self.nodes_manager.get_node_from_slot(
  File "/usr/local/lib/python3.8/dist-packages/redis/asyncio/cluster.py", line 1174, in get_node_from_slot
    raise SlotNotCoveredError(
redis.exceptions.SlotNotCoveredError: Command # 6 (HGET ... ...) of pipeline caused error: ('Slot "11246" not covered by the cluster. "require_full_coverage=True"',)
```

**Reproduce locally**
It's also possible to reproduce this locally quite easily:
1. Setup a Redis cluster locally (https://github.com/Grokzen/docker-redis-cluster)
2. Create a simple Redis app that leverages a pipeline of read-only commands and run a load test to force multiple reads. I also made sure there was a different client writing non-stop to the local cluster.
3. Force resharding of a few slots (e.g. 5k) with `redis-cli --cluster reshard 0.0.0.0:7000`
4. Redis app should raise `SlotNotCoveredError` quite often while the migration of slots is happening, once it's done it stops.

**Possible root cause**
I tried debugging the library and here's what I could find so far:


1. The slots cache is outdated, the slot we need is in a different shard. Client replaces `self.slots_cache[slot]` with the `redirected_node` (returned by the `MOVED` error) which is **only** the master node for the new shard of the slot.

https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1177

```python
else:
    # The new slot owner is a new server, or a server from a different
    # shard. We need to remove all current nodes from the slot's list
    # (including replications) and add just the new node.
    self.slots_cache[e.slot_id] = [redirected_node]
```

2. However at the same time, we might call `NodesManager.get_node_from_slot`. Since we are only reading from read replicas it starts by getting the server index using the RR load balancer. Let's say we have 1 master node and 2 replicas for each shard, it can return an index from 0 to 2. However, as we've seen earlier for this specific slot the slots cache only has a single entry (the master node). This essentially means that if the RR load balancer returns either 1 or 2, it will raise an `IndexError`, thus a `SlotNotCoveredError`.

https://github.com/redis/redis-py/blob/master/redis/asyncio/cluster.py#L1191C1-L1204C14

```python
        try:
            if read_from_replicas:
                # get the server index in a Round-Robin manner
                primary_name = self.slots_cache[slot][0].name
                node_idx = self.read_load_balancer.get_server_index(
                    primary_name, len(self.slots_cache[slot])
                )
                return self.slots_cache[slot][node_idx]
            return self.slots_cache[slot][0]
        except (IndexError, TypeError):
            raise SlotNotCoveredError(
                f'Slot "{slot}" not covered by the cluster. '
                f'"require_full_coverage={self.require_full_coverage}"'
            )
```

3. In order to fix the issue I just patched the method to check whether the index returned by the load balancer is available in the slots cache and in case it isn't just get it from the primary node instead.

```python
                node_idx = self.read_load_balancer.get_server_index(
                    primary_name, len(self.slots_cache[slot])
                )

                node = node_idx if node_idx < len(self.slots_cache[slot]) else 0
                return self.slots_cache[slot][node]
```


Happy to create a PR if this makes sense. Sorry if I'm missing something.

Thanks in advance 🙌 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SlotNotCoveredError when cluster is resharding #2988

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SlotNotCoveredError when cluster is resharding #2988

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions