tarantool
diff --git a/‎doc/book/admin/disaster_recovery.rst
Lines changed: 109 additions & 71 deletions b/‎doc/book/admin/disaster_recovery.rst
Lines changed: 109 additions & 71 deletions
diff --git a/‎doc/book/admin/replication/_images/box_info_replication_instance001.png
165 KB b/‎doc/book/admin/replication/_images/box_info_replication_instance001.png
165 KB
diff --git a/‎doc/book/admin/replication/_images/box_info_replication_instance002.png
179 KB b/‎doc/book/admin/replication/_images/box_info_replication_instance002.png
179 KB
diff --git a/‎doc/book/admin/replication/index.rst
Lines changed: 1 addition & 1 deletion b/‎doc/book/admin/replication/index.rst
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/book/admin/replication/repl_monitoring.rst
Lines changed: 21 additions & 47 deletions b/‎doc/book/admin/replication/repl_monitoring.rst
Lines changed: 21 additions & 47 deletions
@@ -1,126 +1,164 @@
 .. _admin-disaster_recovery:
 
-================================================================================
 Disaster recovery
-================================================================================
+=================
 
 The minimal fault-tolerant Tarantool configuration would be a
-:ref:`replication cluster<replication-topologies>`
+:ref:`replica set <replication-architecture>`
 that includes a master and a replica, or two masters.
 
-The basic recommendation is to configure all Tarantool instances in a cluster to
+The basic recommendation is to configure all Tarantool instances in a replica set to
 create :ref:`snapshot files <index-box_persistence>` at a regular basis.
 
 Here follow action plans for typical crash scenarios.
 
 .. _admin-disaster_recovery-master_replica:
 
---------------------------------------------------------------------------------
 Master-replica
---------------------------------------------------------------------------------
+--------------
 
-Configuration: One master and one replica.
+.. _admin-disaster_recovery-master_replica_manual_failover:
 
-Problem: The master has crashed.
+Master crash: manual failover
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Your actions:
+**Configuration:** master-replica (:ref:`manual failover <replication-master_replica_bootstrap>`).
 
-1. Ensure the master is stopped for good. For example, log in to the master
-   machine and use ``systemctl stop tarantool@<instance_name>``.
+**Problem:** The master has crashed.
 
-2. Switch the replica to master mode by setting
-   :ref:`box.cfg.read_only <cfg_basic-read_only>` parameter to *false* and let
-   the load be handled by the replica (effective master).
+**Actions:**
 
-3. Set up a replacement for the crashed master on a spare host, with
-   :ref:`replication <cfg_replication-replication>` parameter set to replica
-   (effective master), so it begins to catch up with the new master’s state.
-   The new instance should have :ref:`box.cfg.read_only <cfg_basic-read_only>`
-   parameter set to *true*.
+1.  Ensure the master is stopped.
+    For example, log in to the master machine and use ``tt stop``.
 
-You lose the few transactions in the master
-:ref:`write ahead log file <index-box_persistence>`, which it may have not
-transferred to the replica before crash. If you were able to salvage the master
-.xlog file, you may be able to recover these. In order to do it:
+2.  Configure a new replica set leader using the :ref:`<replicaset_name>.leader <configuration_reference_replicasets_name_leader>` option.
+
+3.  Reload configuration on all instances using :ref:`config:reload() <config-module>`.
+
+4.  Make sure that a new replica set leader is a master using :ref:`box.info.ro <box_introspection-box_info>`.
+
+5.  Remove a crashed master from a replica set as described in :ref:`Removing instances <replication-remove_instances>`.
+
+6.  Set up a replacement for the crashed master on a spare host as described in :ref:`Adding instances <replication-add_instances>`.
+
+See also: :ref:`Performing manual failover <replication-controlled_failover>`.
+
+
+.. _admin-disaster_recovery-master_replica_auto_failover:
+
+Master crash: automated failover
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Configuration:** master-replica (:ref:`automated failover <replication-bootstrap-auto>`).
+
+**Problem:** The master has crashed.
+
+**Actions:**
+
+1.  Use ``box.info.election`` to make sure a new master is elected automatically.
+
+2.  Remove a crashed master from a replica set.
 
-1. Find out the position of the crashed master, as reflected on the new master.
+3.  Set up a replacement for the crashed master on a spare host.
+    Learn more from :ref:`Adding and removing instances <replication-automated-failover-add-remove-instances>`.
 
-   a. Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`:
+See also: :ref:`Testing automated failover <replication-automated-failover-testing>`.
 
-      .. code-block:: console
 
-          $ head -5 *.xlog | grep Instance
-          Instance: ed607cad-8b6d-48d8-ba0b-dae371b79155
+.. _admin-disaster_recovery-master_replica_data_loss:
 
-   b. On the new master, use the UUID to find the position:
+Data loss
+~~~~~~~~~
+
+**Configuration:** master-replica.
+
+**Problem:** Some transaction are missing on a replica after the master has crashed.
+
+**Actions:**
+
+You lose the few transactions in the master
+:ref:`write ahead log file <index-box_persistence>`, which it may have not
+transferred to the replica before crash. If you were able to salvage the master
+``.xlog`` file, you may be able to recover these.
 
-      .. code-block:: tarantoolsession
+1.  Find out instance UUID from the crashed master :ref:`xlog <internals-wal>`:
 
-          tarantool> box.info.vclock[box.space._cluster.index.uuid:select{'ed607cad-8b6d-48d8-ba0b-dae371b79155'}[1][1]]
-          ---
-          - 23425
-          <...>
+    .. code-block:: console
 
-2. Play the records from the crashed .xlog to the new master, starting from the
-   new master position:
+        $ head -5 var/lib/instance001/*.xlog | grep Instance
+        Instance: 9bb111c2-3ff5-36a7-00f4-2b9a573ea660
 
-   a. Issue this request locally at the new master's machine to find out
-      instance ID of the new master:
+2.  On the new master, use the UUID to find the position:
 
-      .. code-block:: tarantoolsession
+    .. code-block:: tarantoolsession
 
-          tarantool> box.space._cluster:select{}
-          ---
-          - - [1, '88580b5c-4474-43ab-bd2b-2409a9af80d2']
-          ...
+        app:instance002> box.info.vclock[box.space._cluster.index.uuid:select{'9bb111c2-3ff5-36a7-00f4-2b9a573ea660'}[1][1]]
+        ---
+        - 999
+        ...
 
-   b. Play the records to the new master:
+3.  :ref:`Play the records <tt-play>` from the crashed ``.xlog`` to the new master, starting from the
+    new master position:
 
-      .. code-block:: console
+    .. code-block:: console
+
+        $ tt play 127.0.0.1:3302 var/lib/instance001/00000000000000000000.xlog \
+                  --from 1000 \
+                  --replica 1 \
+                  --username admin --password secret
 
-          $ tt play <new_master_uri> <xlog_file> --from 23425 --replica 1
 
 .. _admin-disaster_recovery-master_master:
 
---------------------------------------------------------------------------------
 Master-master
---------------------------------------------------------------------------------
+-------------
+
+**Configuration:** :ref:`master-master <replication-bootstrap-master-master>`.
 
-Configuration: Two masters.
+**Problem:** one master has crashed.
 
-Problem: Master#1 has crashed.
+**Actions:**
 
-Your actions:
+1.  Let the load be handled by another master alone.
 
-1. Let the load be handled by master#2 (effective master) alone.
+2.  Remove a crashed master from a replica set.
+
+3.  Set up a replacement for the crashed master on a spare host.
+    Learn more from :ref:`Adding and removing instances <replication-master-master-add-remove-instances>`.
 
-2. Follow the same steps as in the
-:ref:`master-replica <admin-disaster_recovery-master_replica>` recovery scenario
-to create a new master and salvage lost data.
 
 .. _admin-disaster_recovery-data_loss:
 
---------------------------------------------------------------------------------
 Data loss
---------------------------------------------------------------------------------
+---------
+
+**Configuration:** master-replica or master-master.
+
+**Problem:** Data was deleted at one master and this data loss was propagated to the other node (master or replica).
+
+**Actions:**
+
+1.  Put all nodes in read-only mode.
+    Depending on the used :ref:`replication.failover <configuration_reference_replication_failover>` mode, this can be done as follows:
+
+    -   ``manual``: change a replica set leader to ``null``.
+    -   ``election``: switch from the ``election`` failover mode to ``manual`` and change a replica set leader to ``null``.
+    -   ``off``: set ``database.mode`` to ``ro``.
 
-Configuration: Master-master or master-replica.
+    Reload configurations on all instances using the ``reload()`` function provided by the :ref:`config <config-module>` module.
 
-Problem: Data was deleted at one master and this data loss was propagated to the
-other node (master or replica).
+2.  Turn off deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`.
+    This prevents the Tarantool garbage collector from removing files
+    made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called.
 
-The following steps are applicable only to data in memtx storage engine.
-Your actions:
+3.  Get the latest valid :ref:`.snap file <internals-snapshot>` and
+    use ``tt cat`` command to calculate at which LSN the data loss occurred.
 
-1. Put all nodes in :ref:`read-only mode <cfg_basic-read_only>` and disable
-   deletion of expired checkpoints with :doc:`/reference/reference_lua/box_backup/start`.
-   This will prevent the Tarantool garbage collector from removing files
-   made with older checkpoints until :doc:`/reference/reference_lua/box_backup/stop` is called.
+4.  Start a new instance and use :ref:`tt play <tt-play>` command to
+    play to it the contents of ``.snap`` and ``.xlog`` files up to the calculated LSN.
 
-2. Get the latest valid :ref:`.snap file <internals-snapshot>` and
-   use ``tt cat`` command to calculate at which lsn the data loss occurred.
+5.  Bootstrap a new replica from the recovered master.
 
-3. Start a new instance (instance#1) and use ``tt play`` command to
-   play to it the contents of .snap/.xlog files up to the calculated lsn.
+..  NOTE::
 
-4. Bootstrap a new replica from the recovered master (instance#1).
+    The steps above are applicable only to data in the memtx storage engine.
@@ -2,7 +2,7 @@ Replication administration
 ==========================
 
 ..  toctree::
-    :maxdepth: 2
+    :maxdepth: 1
 
     repl_monitoring
     repl_recover
 
@@ -4,49 +4,28 @@ Monitoring a replica set
 ========================
 
 To learn what instances belong to the replica set and obtain statistics for all
-these instances, issue a :doc:`/reference/reference_lua/box_info/replication` request:
-
-..  code-block:: tarantoolsession
-
-    tarantool> box.info.replication
-    ---
-      replication:
-        1:
-          id: 1
-          uuid: b8a7db60-745f-41b3-bf68-5fcce7a1e019
-          lsn: 88
-        2:
-          id: 2
-          uuid: cd3c7da2-a638-4c5d-ae63-e7767c3a6896
-          lsn: 31
-          upstream:
-            status: follow
-            idle: 43.187747001648
-            peer: [email protected]:3301
-            lag: 0
-          downstream:
-            vclock: {1: 31}
-        3:
-          id: 3
-          uuid: e38ef895-5804-43b9-81ac-9f2cd872b9c4
-          lsn: 54
-          upstream:
-            status: follow
-            idle: 43.187621831894
-            peer: [email protected]:3301
-            lag: 2
-          downstream:
-            vclock: {1: 54}
-    ...
-
-This report is for a master-master replica set of three instances, each having
-its own instance id, UUID and log sequence number.
-
-..  image:: /concepts/replication/images/mm-3m-mesh.svg
+these instances, execute a :ref:`box.info.replication <box_info_replication>` request.
+The output below shows replication status for a replica set containing one :ref:`master and two replicas <replication-master_replica_bootstrap>`:
+
+..  include:: /how-to/replication/repl_bootstrap.rst
+    :start-after: box_info_replication_manual_leader_start
+    :end-before: box_info_replication_manual_leader_end
+
+The following diagram illustrates the ``upstream`` and ``downstream`` connections for the ``box.info.replication`` executed at the master instance (``instance001``):
+
+..  image:: _images/box_info_replication_instance001.png
+    :width: 600
+    :align: center
+    :alt: replication status on master
+
+If ``box.info.replication`` is executed on ``instance002``, the ``upstream`` and ``downstream`` connections look as follows:
+
+..  image:: _images/box_info_replication_instance002.png
+    :width: 600
     :align: center
+    :alt: replication status on replica
 
-The request was issued at master #1, and the reply includes statistics for the
-other two masters, given in regard to master #1.
+This means that statistics for replicas is given in regard to the instance on which ``box.info.replication`` is executed.
 
 The primary indicators of replication health are:
 
@@ -74,9 +53,4 @@ The primary indicators of replication health are:
     machines, do not be surprised if it’s negative: a time drift may lead to the
     remote master clock being consistently behind the local instance's clock.
 
-    For multi-master configurations, ``lag`` is the maximal lag.
-
-For better understanding, see the following diagram illustrating the ``upstream`` and ``downstream`` connections within the replica set of three instances:
-
-..  image:: /concepts/replication/images/replication.svg
-    :align: left
+    For a :ref:`master-master <replication-bootstrap-master-master>` configuration, ``lag`` is the maximal lag.