docs: Add "How to upgrade etcd"

surajssd · surajssd · commit f5d24e03b2ec · 2020-08-14T16:31:30.000+05:30
Signed-off-by: Suraj Deshmukh &lt;suraj@kinvolk.io&gt;
diff --git a/docs/how-to-guides/upgrade-etcd.md b/docs/how-to-guides/upgrade-etcd.md
@@ -0,0 +1,117 @@
+# Upgrading etcd
+
+## Contents
+
+- [Introduction](#introduction)
+- [Steps](#steps)
+  - [Step 1: Find out the IP and SSH](#step-1-find-out-the-ip-and-ssh)
+  - [Step 2: Create necessary directories with correct permissions](#step-2-create-necessary-directories-with-correct-permissions)
+  - [Step 3: Upgrade etcd](#step-3-upgrade-etcd)
+  - [Step 4: Verify upgrade](#step-4-verify-upgrade)
+  - [Step 5: Verify using `etcdctl`](#step-5-verify-using-etcdctl)
+
+## Introduction
+
+[Etcd](https://etcd.io/) is the most crucial component of a Kubernetes cluster. It stores the cluster state.
+
+This document will provide step by step guide on upgrading etcd in Lokomotive.
+
+## Steps
+
+Repeat the following steps on all the controller node one node at a time.
+
+### Step 1: Find out the IP and SSH
+
+Find the IP of the controller node by visiting the cloud provider dashboard and ssh into it.
+
+```bash
+ssh core@<IP Address>
+```
+
+### Step 2: Create necessary directories with correct permissions
+
+Latest etcd (`v3.4.10`) necessitates the data directory permissions to be `0700`, accordingly change the permissions. Verify the permissions are changed to `rwx------`.
+
+```bash
+sudo chmod 0700 /var/lib/etcd/
+sudo ls -ld /var/lib/etcd/
+```
+
+If the node reboots, we need the right settings in place so that `systemd-tmpfile` service does not alter the permissions of the data directory. To make the changes made above persistent run the following command:
+
+```bash
+echo "d    /var/lib/etcd 0700 etcd etcd - -" | sudo tee /etc/tmpfiles.d/etcd-wrapper.conf
+```
+
+### Step 3: Upgrade etcd
+
+Run the following commands:
+
+> **NOTE**: Before proceeding to other commands, set the `etcd_version` variable to the latest etcd version.
+
+```bash
+export etcd_version=<latest etcd version e.g. v3.4.10>
+
+sudo sed -i "s,ETCD_IMAGE_TAG=.*,ETCD_IMAGE_TAG=${etcd_version}," \
+        /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf
+sudo systemctl daemon-reload
+sudo systemctl restart etcd-member
+```
+
+### Step 4: Verify upgrade
+
+Verify that the etcd service is in `active (running)` state:
+
+```bash
+sudo systemctl status --no-pager etcd-member
+```
+
+Run the following command to see logs of the process since the last restart:
+
+```bash
+sudo journalctl _SYSTEMD_INVOCATION_ID=$(sudo systemctl \
+              show -p InvocationID --value etcd-member.service)
+```
+
+Once you see the following log line, you can discern that the etcd daemon has come up without errors:
+
+```log
+etcdserver: starting server... [version: 3.4.10, cluster version: to_be_decided]
+```
+
+Once you see the following log line, you can discern that the etcd has rejoined the cluster without issues:
+
+```log
+embed: serving client requests on 10.88.81.1:2379
+```
+
+### Step 5: Verify using `etcdctl`
+
+We can use `etcdctl` client to verify the state of etcd cluster.
+
+> **NOTE**: Before proceeding to other commands, set the `no_of_controller_nodes` variable to the number of controller nodes in the cluster.
+
+```bash
+export no_of_controller_nodes=<no of controller nodes>
+
+# Find the endpoint of etcd0:
+export endpoint=$(grep ETCD_ADVERTISE_CLIENT_URLS /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf | cut -d"=" -f3 | tr -d '"')
+export endpoints="${endpoint}"
+
+# Create list of other endpoints:
+for ((n = 1; n < no_of_controller_nodes; n++)); do
+  np=$(sed "s|etcd0|etcd${n}|g" <<< $endpoint)
+  endpoints="${endpoints},${np}"
+done
+
+export flags="--cacert=/etc/ssl/etcd/etcd-client-ca.crt \
+              --cert=/etc/ssl/etcd/etcd-client.crt \
+              --key=/etc/ssl/etcd/etcd-client.key \
+              --endpoints=${endpoints}"
+
+# Verify:
+sudo ETCDCTL_API=3 etcdctl member list $flags
+sudo ETCDCTL_API=3 etcdctl endpoint health $flags
+```
+
+The last command should report each node as healthy.