|
1 | 1 | .. _replication-recover:
|
2 | 2 |
|
3 |
| -================================================================================ |
4 | 3 | Recovering from a degraded state
|
5 |
| -================================================================================ |
| 4 | +================================ |
6 | 5 |
|
7 | 6 | "Degraded state" is a situation when the master becomes unavailable -- due to
|
8 | 7 | hardware or network failure, or due to a programming bug.
|
9 | 8 |
|
10 | 9 | .. image:: mr-degraded.svg
|
11 | 10 | :align: center
|
12 | 11 |
|
13 |
| -In a master-replica set, if a master disappears, error messages appear on the |
14 |
| -replicas stating that the connection is lost: |
15 |
| - |
16 |
| -.. code-block:: console |
17 |
| -
|
18 |
| - $ # messages from a replica's log |
19 |
| - 2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. I> can't read row |
20 |
| - 2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. coio.cc:349 !> SystemError |
21 |
| - unexpected EOF when reading from socket, called on fd 17, aka 192.168.0.101:57815, |
22 |
| - peer of 192.168.0.101:3301: Broken pipe |
23 |
| - 2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. I> will retry every 1 second |
24 |
| - 2017-06-14 16:23:10.993 [19153] relay/[::ffff:192.168.0.101]:/101/main I> the replica has closed its socket, exiting |
25 |
| - 2017-06-14 16:23:10.993 [19153] relay/[::ffff:192.168.0.101]:/101/main C> exiting the relay loop |
26 |
| -
|
27 |
| -... and the master's status is reported as "disconnected": |
28 |
| - |
29 |
| -.. code-block:: tarantoolsession |
30 |
| -
|
31 |
| - # report from replica #1 |
32 |
| - tarantool> box.info.replication |
33 |
| - --- |
34 |
| - - 1: |
35 |
| - id: 1 |
36 |
| - uuid: 70e8e9dc-e38d-4046-99e5-d25419267229 |
37 |
| - lsn: 542 |
38 |
| - upstream: |
39 |
| - |
40 |
| - lag: 0.00026607513427734 |
41 |
| - status: disconnected |
42 |
| - idle: 182.36929893494 |
43 |
| - message: connect, called on fd 13, aka 192.168.0.101:58244 |
44 |
| - 2: |
45 |
| - id: 2 |
46 |
| - uuid: fb252ac7-5c34-4459-84d0-54d248b8c87e |
47 |
| - lsn: 0 |
48 |
| - 3: |
49 |
| - id: 3 |
50 |
| - uuid: fd7681d8-255f-4237-b8bb-c4fb9d99024d |
51 |
| - lsn: 0 |
52 |
| - downstream: |
53 |
| - vclock: {1: 542} |
54 |
| - ... |
55 |
| -
|
56 |
| -.. code-block:: tarantoolsession |
57 |
| -
|
58 |
| - # report from replica #2 |
59 |
| - tarantool> box.info.replication |
60 |
| - --- |
61 |
| - - 1: |
62 |
| - id: 1 |
63 |
| - uuid: 70e8e9dc-e38d-4046-99e5-d25419267229 |
64 |
| - lsn: 542 |
65 |
| - upstream: |
66 |
| - |
67 |
| - lag: 0.00027203559875488 |
68 |
| - status: disconnected |
69 |
| - idle: 186.76988101006 |
70 |
| - message: connect, called on fd 13, aka 192.168.0.101:58253 |
71 |
| - 2: |
72 |
| - id: 2 |
73 |
| - uuid: fb252ac7-5c34-4459-84d0-54d248b8c87e |
74 |
| - lsn: 0 |
75 |
| - upstream: |
76 |
| - status: follow |
77 |
| - idle: 186.76960110664 |
78 |
| - |
79 |
| - lag: 0.00020599365234375 |
80 |
| - 3: |
81 |
| - id: 3 |
82 |
| - uuid: fd7681d8-255f-4237-b8bb-c4fb9d99024d |
83 |
| - lsn: 0 |
84 |
| - ... |
85 |
| -
|
86 |
| -To declare that one of the replicas must now take over as a new master: |
87 |
| - |
88 |
| -1. Make sure that the old master is gone for good: |
89 |
| - |
90 |
| - * change network routing rules to avoid any more packets being delivered to |
91 |
| - the master, or |
92 |
| - * shut down the master instance, if you have access to the machine, or |
93 |
| - * power off the container or the machine. |
94 |
| - |
95 |
| -2. Say ``box.cfg{read_only=false, listen=URI}`` on the replica, and |
96 |
| - ``box.cfg{replication=URI}`` on the other replicas in the set. |
97 |
| - |
98 |
| -.. NOTE:: |
99 |
| - |
100 |
| - If there are updates on the old master that were not propagated before the |
101 |
| - old master went down, |
102 |
| - :ref:`re-apply them manually <admin-disaster_recovery-master_replica>` to the |
103 |
| - new master using ``tt cat`` and ``tt play`` commands. |
104 |
| - |
105 |
| -There is no automatic way for a replica to detect that the master is gone |
106 |
| -forever, since sources of failure and replication environments vary |
107 |
| -significantly. So the detection of degraded state requires an external observer. |
| 12 | +- In a master-replica set with manual failover, if a master disappears, error messages appear on the |
| 13 | + replicas stating that the connection is lost: |
| 14 | + |
| 15 | + .. code-block:: console |
| 16 | +
|
| 17 | + 2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 I> can't read row |
| 18 | + 2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 19, aka 127.0.0.1:55932, peer of 127.0.0.1:3301: Broken pipe |
| 19 | + 2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 I> will retry every 1.00 second |
| 20 | + 2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302, peer of 127.0.0.1:55940: Broken pipe |
| 21 | + 2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main I> exiting the relay loop |
| 22 | +
|
| 23 | +- In a master-replica set with automated failover, a log should contain Raft messages showing the process of a new master's election: |
| 24 | + |
| 25 | + .. code-block:: console |
| 26 | +
|
| 27 | + 2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 I> can't read row |
| 28 | + 2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 24, aka 127.0.0.1:55687, peer of 127.0.0.1:3302: Broken pipe |
| 29 | + 2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 I> will retry every 1.00 second |
| 30 | + 2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka 127.0.0.1:3301, peer of 127.0.0.1:55695: Broken pipe |
| 31 | + 2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main I> exiting the relay loop |
| 32 | + 2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: message {term: 3, vote: 2, state: candidate, vclock: {1: 9}} from 2 |
| 33 | + 2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: received a newer term from 2 |
| 34 | + 2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: bump term to 3, follow |
| 35 | + 2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: vote for 2, follow |
| 36 | + 2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3} |
| 37 | + 2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3, vote: 2} |
| 38 | + 2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: message {term: 3, vote: 2, leader: 2, state: leader} from 2 |
| 39 | + 2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: vote request is skipped - this is a notification about a vote for a third node, not a request |
| 40 | + 2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: leader is 2, follow |
| 41 | +
|
| 42 | +
|
| 43 | +
|
| 44 | +The master's status is reported as ``disconnected`` when executing ``box.info.replication`` on a replica: |
| 45 | + |
| 46 | +.. include:: /how-to/replication/repl_bootstrap_auto.rst |
| 47 | + :start-after: box_info_replication_auto_leader_disconnected_start |
| 48 | + :end-before: box_info_replication_auto_leader_disconnected_end |
| 49 | + |
| 50 | + |
| 51 | +Performing failover: |
| 52 | + |
| 53 | +- Master-replica: :ref:`Performing manual failover <replication-controlled_failover>` |
| 54 | +- Master-replica: :ref:`Testing automated failover <replication-automated-failover-testing>` |
| 55 | + |
0 commit comments