Skip to content

Commit 0e7b52a

Browse files
committed
3.0 configuration: replication administration
1 parent 2b08f1d commit 0e7b52a

File tree

7 files changed

+72
-150
lines changed

7 files changed

+72
-150
lines changed
Loading
Loading

doc/book/admin/replication/repl_monitoring.rst

Lines changed: 19 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -4,49 +4,26 @@ Monitoring a replica set
44
========================
55

66
To learn what instances belong to the replica set and obtain statistics for all
7-
these instances, issue a :doc:`/reference/reference_lua/box_info/replication` request:
8-
9-
.. code-block:: tarantoolsession
10-
11-
tarantool> box.info.replication
12-
---
13-
replication:
14-
1:
15-
id: 1
16-
uuid: b8a7db60-745f-41b3-bf68-5fcce7a1e019
17-
lsn: 88
18-
2:
19-
id: 2
20-
uuid: cd3c7da2-a638-4c5d-ae63-e7767c3a6896
21-
lsn: 31
22-
upstream:
23-
status: follow
24-
idle: 43.187747001648
25-
peer: [email protected]:3301
26-
lag: 0
27-
downstream:
28-
vclock: {1: 31}
29-
3:
30-
id: 3
31-
uuid: e38ef895-5804-43b9-81ac-9f2cd872b9c4
32-
lsn: 54
33-
upstream:
34-
status: follow
35-
idle: 43.187621831894
36-
peer: [email protected]:3301
37-
lag: 2
38-
downstream:
39-
vclock: {1: 54}
40-
...
41-
42-
This report is for a master-master replica set of three instances, each having
43-
its own instance id, UUID and log sequence number.
44-
45-
.. image:: /concepts/replication/images/mm-3m-mesh.svg
7+
these instances, execute a :ref:`box.info.replication <box_info_replication>` request.
8+
The output below shows replication status for a replica set containing one :ref:`master and two replicas <replication-master_replica_bootstrap>`:
9+
10+
.. include:: /how-to/replication/repl_bootstrap.rst
11+
:start-after: box_info_replication_manual_leader_start
12+
:end-before: box_info_replication_manual_leader_end
13+
14+
The following diagram illustrates the ``upstream`` and ``downstream`` connections for the ``box.info.replication`` executed at the master instance (``instance001``):
15+
16+
.. image:: _images/box_info_replication_instance001.png
17+
:align: center
18+
:alt: replication status on master
19+
20+
If ``box.info.replication`` is executed on ``instance002``, the ``upstream`` and ``downstream`` connections look as follows:
21+
22+
.. image:: _images/box_info_replication_instance002.png
4623
:align: center
24+
:alt: replication status on replica
4725

48-
The request was issued at master #1, and the reply includes statistics for the
49-
other two masters, given in regard to master #1.
26+
This means that statistics for replicas is given in regard to the instance on which ``box.info.replication`` is executed.
5027

5128
The primary indicators of replication health are:
5229

@@ -74,9 +51,4 @@ The primary indicators of replication health are:
7451
machines, do not be surprised if it’s negative: a time drift may lead to the
7552
remote master clock being consistently behind the local instance's clock.
7653

77-
For multi-master configurations, ``lag`` is the maximal lag.
78-
79-
For better understanding, see the following diagram illustrating the ``upstream`` and ``downstream`` connections within the replica set of three instances:
80-
81-
.. image:: /concepts/replication/images/replication.svg
82-
:align: left
54+
For a :ref:`master-master <replication-bootstrap-master-master>` configuration, ``lag`` is the maximal lag.
Lines changed: 45 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,55 @@
11
.. _replication-recover:
22

3-
================================================================================
43
Recovering from a degraded state
5-
================================================================================
4+
================================
65

76
"Degraded state" is a situation when the master becomes unavailable -- due to
87
hardware or network failure, or due to a programming bug.
98

109
.. image:: mr-degraded.svg
1110
:align: center
1211

13-
In a master-replica set, if a master disappears, error messages appear on the
14-
replicas stating that the connection is lost:
15-
16-
.. code-block:: console
17-
18-
$ # messages from a replica's log
19-
2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. I> can't read row
20-
2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. coio.cc:349 !> SystemError
21-
unexpected EOF when reading from socket, called on fd 17, aka 192.168.0.101:57815,
22-
peer of 192.168.0.101:3301: Broken pipe
23-
2017-06-14 16:23:10.993 [19153] main/105/applier/[email protected]. I> will retry every 1 second
24-
2017-06-14 16:23:10.993 [19153] relay/[::ffff:192.168.0.101]:/101/main I> the replica has closed its socket, exiting
25-
2017-06-14 16:23:10.993 [19153] relay/[::ffff:192.168.0.101]:/101/main C> exiting the relay loop
26-
27-
... and the master's status is reported as "disconnected":
28-
29-
.. code-block:: tarantoolsession
30-
31-
# report from replica #1
32-
tarantool> box.info.replication
33-
---
34-
- 1:
35-
id: 1
36-
uuid: 70e8e9dc-e38d-4046-99e5-d25419267229
37-
lsn: 542
38-
upstream:
39-
peer: [email protected]:3301
40-
lag: 0.00026607513427734
41-
status: disconnected
42-
idle: 182.36929893494
43-
message: connect, called on fd 13, aka 192.168.0.101:58244
44-
2:
45-
id: 2
46-
uuid: fb252ac7-5c34-4459-84d0-54d248b8c87e
47-
lsn: 0
48-
3:
49-
id: 3
50-
uuid: fd7681d8-255f-4237-b8bb-c4fb9d99024d
51-
lsn: 0
52-
downstream:
53-
vclock: {1: 542}
54-
...
55-
56-
.. code-block:: tarantoolsession
57-
58-
# report from replica #2
59-
tarantool> box.info.replication
60-
---
61-
- 1:
62-
id: 1
63-
uuid: 70e8e9dc-e38d-4046-99e5-d25419267229
64-
lsn: 542
65-
upstream:
66-
peer: [email protected]:3301
67-
lag: 0.00027203559875488
68-
status: disconnected
69-
idle: 186.76988101006
70-
message: connect, called on fd 13, aka 192.168.0.101:58253
71-
2:
72-
id: 2
73-
uuid: fb252ac7-5c34-4459-84d0-54d248b8c87e
74-
lsn: 0
75-
upstream:
76-
status: follow
77-
idle: 186.76960110664
78-
peer: [email protected]:3301
79-
lag: 0.00020599365234375
80-
3:
81-
id: 3
82-
uuid: fd7681d8-255f-4237-b8bb-c4fb9d99024d
83-
lsn: 0
84-
...
85-
86-
To declare that one of the replicas must now take over as a new master:
87-
88-
1. Make sure that the old master is gone for good:
89-
90-
* change network routing rules to avoid any more packets being delivered to
91-
the master, or
92-
* shut down the master instance, if you have access to the machine, or
93-
* power off the container or the machine.
94-
95-
2. Say ``box.cfg{read_only=false, listen=URI}`` on the replica, and
96-
``box.cfg{replication=URI}`` on the other replicas in the set.
97-
98-
.. NOTE::
99-
100-
If there are updates on the old master that were not propagated before the
101-
old master went down,
102-
:ref:`re-apply them manually <admin-disaster_recovery-master_replica>` to the
103-
new master using ``tt cat`` and ``tt play`` commands.
104-
105-
There is no automatic way for a replica to detect that the master is gone
106-
forever, since sources of failure and replication environments vary
107-
significantly. So the detection of degraded state requires an external observer.
12+
- In a master-replica set with manual failover, if a master disappears, error messages appear on the
13+
replicas stating that the connection is lost:
14+
15+
.. code-block:: console
16+
17+
2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 I> can't read row
18+
2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 19, aka 127.0.0.1:55932, peer of 127.0.0.1:3301: Broken pipe
19+
2023-12-04 13:19:04.724 [16755] main/110/applier/[email protected]:3301 I> will retry every 1.00 second
20+
2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 23, aka 127.0.0.1:3302, peer of 127.0.0.1:55940: Broken pipe
21+
2023-12-04 13:19:04.724 [16755] relay/127.0.0.1:55940/101/main I> exiting the relay loop
22+
23+
- In a master-replica set with automated failover, a log should contain Raft messages showing the process of a new master's election:
24+
25+
.. code-block:: console
26+
27+
2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 I> can't read row
28+
2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 24, aka 127.0.0.1:55687, peer of 127.0.0.1:3302: Broken pipe
29+
2023-12-04 13:16:56.340 [16615] main/111/applier/[email protected]:3302 I> will retry every 1.00 second
30+
2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka 127.0.0.1:3301, peer of 127.0.0.1:55695: Broken pipe
31+
2023-12-04 13:16:56.340 [16615] relay/127.0.0.1:55695/101/main I> exiting the relay loop
32+
2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: message {term: 3, vote: 2, state: candidate, vclock: {1: 9}} from 2
33+
2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: received a newer term from 2
34+
2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: bump term to 3, follow
35+
2023-12-04 13:16:59.690 [16615] main/112/applier/[email protected]:3303 I> RAFT: vote for 2, follow
36+
2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3}
37+
2023-12-04 13:16:59.691 [16615] main/119/raft_worker I> RAFT: persisted state {term: 3, vote: 2}
38+
2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: message {term: 3, vote: 2, leader: 2, state: leader} from 2
39+
2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: vote request is skipped - this is a notification about a vote for a third node, not a request
40+
2023-12-04 13:16:59.691 [16615] main/112/applier/[email protected]:3303 I> RAFT: leader is 2, follow
41+
42+
43+
44+
The master's status is reported as ``disconnected`` when executing ``box.info.replication`` on a replica:
45+
46+
.. include:: /how-to/replication/repl_bootstrap_auto.rst
47+
:start-after: box_info_replication_auto_leader_disconnected_start
48+
:end-before: box_info_replication_auto_leader_disconnected_end
49+
50+
51+
Performing failover:
52+
53+
- Master-replica: :ref:`Performing manual failover <replication-controlled_failover>`
54+
- Master-replica: :ref:`Testing automated failover <replication-automated-failover-testing>`
55+

doc/how-to/replication/repl_bootstrap.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,8 @@ After adding ``instance003`` to the configuration and starting it, configuration
338338
3. Execute ``box.info.replication`` to check a replica set status.
339339
Make sure that ``upstream.status`` and ``downstream.status`` are ``follow`` for ``instance003``.
340340

341+
.. box_info_replication_manual_leader_start
342+
341343
.. code-block:: console
342344
343345
manual_leader:instance001> box.info.replication
@@ -379,6 +381,8 @@ After adding ``instance003`` to the configuration and starting it, configuration
379381
lag: 0
380382
...
381383
384+
.. box_info_replication_manual_leader_end
385+
382386
383387
384388
.. _replication-controlled_failover:

doc/how-to/replication/repl_bootstrap_auto.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,8 @@ To test how automated failover works if the current master is stopped, follow th
311311
- ``upstream.status`` is ``disconnected``.
312312
- ``downstream.status`` is ``stopped``.
313313

314+
.. box_info_replication_auto_leader_disconnected_start
315+
314316
.. code-block:: console
315317
316318
auto_leader:instance001> box.info.replication
@@ -354,6 +356,8 @@ To test how automated failover works if the current master is stopped, follow th
354356
lag: 0.00051403045654297
355357
...
356358
359+
.. box_info_replication_auto_leader_disconnected_end
360+
357361
358362
4. Start ``instance002`` back using ``tt start``:
359363

doc/reference/reference_lua/box_info/replication.rst

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -136,9 +136,3 @@ box.info.replication
136136
from socket'``, and ``system_message = 'Broken pipe'``.
137137
See also :ref:`degraded state <replication-recover>`.
138138

139-
140-
For better understanding, see the following diagram illustrating the ``upstream`` and ``downstream`` connections within the replica set of three instances:
141-
142-
.. image:: /concepts/replication/images/replication.svg
143-
:align: left
144-

0 commit comments

Comments
 (0)