tarantool
diff --git a/‎doc/book/admin/disaster_recovery.rst
Lines changed: 110 additions & 74 deletions b/‎doc/book/admin/disaster_recovery.rst
Lines changed: 110 additions & 74 deletions
diff --git a/‎doc/book/admin/replication/_images/box_info_replication_instance001.png
165 KB b/‎doc/book/admin/replication/_images/box_info_replication_instance001.png
165 KB
diff --git a/‎doc/book/admin/replication/_images/box_info_replication_instance002.png
179 KB b/‎doc/book/admin/replication/_images/box_info_replication_instance002.png
179 KB
diff --git a/‎doc/book/admin/replication/index.rst
Lines changed: 1 addition & 1 deletion b/‎doc/book/admin/replication/index.rst
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/book/admin/replication/repl_monitoring.rst
Lines changed: 22 additions & 48 deletions b/‎doc/book/admin/replication/repl_monitoring.rst
Lines changed: 22 additions & 48 deletions
@@ -1,126 +1,162 @@
 .. _admin-disaster_recovery:
 
-================================================================================
 Disaster recovery
-================================================================================
+=================
 
-The minimal fault-tolerant Tarantool configuration would be a
-:ref:`replication cluster<replication-topologies>`
+The minimal fault-tolerant Tarantool configuration would be a :ref:`replica set <replication-architecture>`
 that includes a master and a replica, or two masters.
+The basic recommendation is to configure all Tarantool instances in a replica set to create :ref:`snapshot files <index-box_persistence>` on a regular basis.
 
-The basic recommendation is to configure all Tarantool instances in a cluster to
-create :ref:`snapshot files <index-box_persistence>` at a regular basis.
+Here are action plans for typical crash scenarios.
 
-Here follow action plans for typical crash scenarios.
 
 .. _admin-disaster_recovery-master_replica:
 
---------------------------------------------------------------------------------
 Master-replica
---------------------------------------------------------------------------------
+--------------
 
-Configuration: One master and one replica.
+.. _admin-disaster_recovery-master_replica_manual_failover:
 
-Problem: The master has crashed.
+Master crash: manual failover
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Your actions:
+**Configuration:** master-replica (:ref:`manual failover <replication-master_replica_bootstrap>`).
 
-1. Ensure the master is stopped for good. For example, log in to the master
-   machine and use ``systemctl stop tarantool@<instance_name>``.
+**Problem:** The master has crashed.
 
-2. Switch the replica to master mode by setting
-   :ref:`box.cfg.read_only <cfg_basic-read_only>` parameter to *false* and let
-   the load be handled by the replica (effective master).
+**Actions:**
 
-3. Set up a replacement for the crashed master on a spare host, with
-   :ref:`replication <cfg_replication-replication>` parameter set to replica
-   (effective master), so it begins to catch up with the new master’s state.
-   The new instance should have :ref:`box.cfg.read_only <cfg_basic-read_only>`
-   parameter set to *true*.
+1.  Ensure the master is stopped.
+    For example, log in to the master machine and use ``tt stop``.
 
-You lose the few transactions in the master
-:ref:`write ahead log file <index-box_persistence>`, which it may have not
-transferred to the replica before crash. If you were able to salvage the master
-.xlog file, you may be able to recover these. In order to do it:
+2.  Configure a new replica set leader using the :ref:`<replicaset_name>.leader <configuration_reference_replicasets_name_leader>` option.
 
-1. Find out the position of the crashed master, as reflected on the new master.
+3.  Reload configuration on all instances using :ref:`config:reload() <config-module>`.
 
-   a. Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`:
+4.  Make sure that a new replica set leader is a master using :ref:`box.info.ro <box_introspection-box_info>`.
 
-      .. code-block:: console
+5.  Remove a crashed master from a replica set as described in :ref:`Removing instances <replication-remove_instances>`.
 
-          $ head -5 *.xlog | grep Instance
-          Instance: ed607cad-8b6d-48d8-ba0b-dae371b79155
+6.  Set up a replacement for the crashed master on a spare host as described in :ref:`Adding instances <replication-add_instances>`.
 
-   b. On the new master, use the UUID to find the position:
+See also: :ref:`Performing manual failover <replication-controlled_failover>`.
 
-      .. code-block:: tarantoolsession
 
-          tarantool> box.info.vclock[box.space._cluster.index.uuid:select{'ed607cad-8b6d-48d8-ba0b-dae371b79155'}[1][1]]
-          ---
-          - 23425
-          <...>
+.. _admin-disaster_recovery-master_replica_auto_failover:
 
-2. Play the records from the crashed .xlog to the new master, starting from the
-   new master position:
+Master crash: automated failover
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-   a. Issue this request locally at the new master's machine to find out
-      instance ID of the new master:
+**Configuration:** master-replica (:ref:`automated failover <replication-bootstrap-auto>`).
 
-      .. code-block:: tarantoolsession
+**Problem:** The master has crashed.
 
-          tarantool> box.space._cluster:select{}
-          ---
-          - - [1, '88580b5c-4474-43ab-bd2b-2409a9af80d2']
-          ...
+**Actions:**
 
-   b. Play the records to the new master:
+1.  Use ``box.info.election`` to make sure a new master is elected automatically.
 
-      .. code-block:: console
+2.  Remove a crashed master from a replica set.
+
+3.  Set up a replacement for the crashed master on a spare host.
+    Learn more from :ref:`Adding and removing instances <replication-automated-failover-add-remove-instances>`.
+
+See also: :ref:`Testing automated failover <replication-automated-failover-testing>`.
+
+
+.. _admin-disaster_recovery-master_replica_data_loss:
+
+Data loss
+~~~~~~~~~
+
+**Configuration:** master-replica.
+
+**Problem:** Some transactions are missing on a replica after the master has crashed.
+
+**Actions:**
+
+You lose a few transactions in the master
+:ref:`write-ahead log file <index-box_persistence>`, which may have not
+transferred to the replica before the crash. If you were able to salvage the master
+``.xlog`` file, you may be able to recover these.
+
+1.  Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`:
+
+    .. code-block:: console
+
+        $ head -5 var/lib/instance001/*.xlog | grep Instance
+        Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660
+
+2.  On the new master, use the UUID to find the position:
+
+    .. code-block:: tarantoolsession
+
+        app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]]
+        ---
+        - 999
+        ...
+
+3.  :ref:`Play the records <tt-play>` from the crashed ``.xlog`` to the new master, starting from the
+    new master position:
+
+    .. code-block:: console
+
+        $ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \
+                  --from 1000 \
+                  --replica 1 \
+                  --username admin --password secret
 
-          $ tt play <new_master_uri> <xlog_file> --from 23425 --replica 1
 
 .. _admin-disaster_recovery-master_master:
 
---------------------------------------------------------------------------------
 Master-master
---------------------------------------------------------------------------------
+-------------
+
+**Configuration:** :ref:`master-master <replication-bootstrap-master-master>`.
 
-Configuration: Two masters.
+**Problem:** one master has crashed.
 
-Problem: Master#1 has crashed.
+**Actions:**
 
-Your actions:
+1.  Let the load be handled by another master alone.
 
-1. Let the load be handled by master#2 (effective master) alone.
+2.  Remove a crashed master from a replica set.
+
+3.  Set up a replacement for the crashed master on a spare host.
+    Learn more from :ref:`Adding and removing instances <replication-master-master-add-remove-instances>`.
 
-2. Follow the same steps as in the
-:ref:`master-replica <admin-disaster_recovery-master_replica>` recovery scenario
-to create a new master and salvage lost data.
 
 .. _admin-disaster_recovery-data_loss:
 
---------------------------------------------------------------------------------
 Data loss
---------------------------------------------------------------------------------
+---------
+
+**Configuration:** master-replica or master-master.
+
+**Problem:** Data was deleted at one master and this data loss was propagated to the other node (master or replica).
+
+**Actions:**
+
+1.  Put all nodes in read-only mode.
+    Depending on the :ref:`replication.failover <configuration_reference_replication_failover>` mode, this can be done as follows:
+
+    -   ``manual``: change a replica set leader to ``null``.
+    -   ``election``: switch from the ``election`` failover mode to ``manual`` and change a replica set leader to ``null``.
+    -   ``off``: set ``database.mode`` to ``ro``.
 
-Configuration: Master-master or master-replica.
+    Reload configurations on all instances using the ``reload()`` function provided by the :ref:`config <config-module>` module.
 
-Problem: Data was deleted at one master and this data loss was propagated to the
-other node (master or replica).
+2.  Turn off deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`.
+    This prevents the Tarantool garbage collector from removing files
+    made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called.
 
-The following steps are applicable only to data in memtx storage engine.
-Your actions:
+3.  Get the latest valid :ref:`.snap file <internals-snapshot>` and
+    use ``tt cat`` command to calculate at which LSN the data loss occurred.
 
-1. Put all nodes in :ref:`read-only mode <cfg_basic-read_only>` and disable
-   deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`.
-   This will prevent the Tarantool garbage collector from removing files
-   made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called.
+4.  Start a new instance and use :ref:`tt play <tt-play>` command to
+    play to it the contents of ``.snap`` and ``.xlog`` files up to the calculated LSN.
 
-2. Get the latest valid :ref:`.snap file <internals-snapshot>` and
-   use ``tt cat`` command to calculate at which lsn the data loss occurred.
+5.  Bootstrap a new replica from the recovered master.
 
-3. Start a new instance (instance#1) and use ``tt play`` command to
-   play to it the contents of .snap/.xlog files up to the calculated lsn.
+..  NOTE::
 
-4. Bootstrap a new replica from the recovered master (instance#1).
+    The steps above are applicable only to data in the memtx storage engine.
@@ -2,7 +2,7 @@ Replication administration
 ==========================
 
 ..  toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
     repl_monitoring
     repl_recover
 
@@ -4,49 +4,28 @@ Monitoring a replica set
 ========================
 
 To learn what instances belong to the replica set and obtain statistics for all
-these instances, issue a :doc:`/reference/reference_lua/box_info/replication` request:
-
-..  code-block:: tarantoolsession
-
-    tarantool> box.info.replication
-    ---
-      replication:
-        1:
-          id: 1
-          uuid: b8a7db60-745f-41b3-bf68-5fcce7a1e019
-          lsn: 88
-        2:
-          id: 2
-          uuid: cd3c7da2-a638-4c5d-ae63-e7767c3a6896
-          lsn: 31
-          upstream:
-            status: follow
-            idle: 43.187747001648
-            peer: [email protected]:3301
-            lag: 0
-          downstream:
-            vclock: {1: 31}
-        3:
-          id: 3
-          uuid: e38ef895-5804-43b9-81ac-9f2cd872b9c4
-          lsn: 54
-          upstream:
-            status: follow
-            idle: 43.187621831894
-            peer: [email protected]:3301
-            lag: 2
-          downstream:
-            vclock: {1: 54}
-    ...
-
-This report is for a master-master replica set of three instances, each having
-its own instance id, UUID and log sequence number.
-
-..  image:: /concepts/replication/images/mm-3m-mesh.svg
+these instances, execute a :ref:`box.info.replication <box_info_replication>` request.
+The output below shows the replication status for a replica set containing one :ref:`master and two replicas <replication-master_replica_bootstrap>`:
+
+..  include:: /how-to/replication/repl_bootstrap.rst
+    :start-after: box_info_replication_manual_leader_start
+    :end-before: box_info_replication_manual_leader_end
+
+The following diagram illustrates the ``upstream`` and ``downstream`` connections if ``box.info.replication`` executed at the master instance (``instance001``):
+
+..  image:: _images/box_info_replication_instance001.png
+    :width: 600
+    :align: center
+    :alt: replication status on master
+
+If ``box.info.replication`` is executed on ``instance002``, the ``upstream`` and ``downstream`` connections look as follows:
+
+..  image:: _images/box_info_replication_instance002.png
+    :width: 600
     :align: center
+    :alt: replication status on replica
 
-The request was issued at master #1, and the reply includes statistics for the
-other two masters, given in regard to master #1.
+This means that statistics for replicas are given in regard to the instance on which ``box.info.replication`` is executed.
 
 The primary indicators of replication health are:
 
@@ -68,15 +47,10 @@ The primary indicators of replication health are:
 *   :ref:`lag <box_info_replication_upstream_lag>`: the time difference between
     the local time at the instance, recorded when the event was received, and the
     local time at another master recorded when the event was written to the
-    :ref:`write ahead log <internals-wal>` on that master.
+    :ref:`write-ahead log <internals-wal>` on that master.
 
     Since the ``lag`` calculation uses the operating system clocks from two different
     machines, do not be surprised if it’s negative: a time drift may lead to the
     remote master clock being consistently behind the local instance's clock.
 
-    For multi-master configurations, ``lag`` is the maximal lag.
-
-For better understanding, see the following diagram illustrating the ``upstream`` and ``downstream`` connections within the replica set of three instances:
-
-..  image:: /concepts/replication/images/replication.svg
-    :align: left
+    For a :ref:`master-master <replication-bootstrap-master-master>` configuration, ``lag`` is the maximal lag.