docs: Add "How to upgrade etcd"

surajssd · surajssd · commit fd056aef3cfe · 2020-08-19T14:58:39.000+05:30
Signed-off-by: Suraj Deshmukh &lt;suraj@kinvolk.io&gt;
diff --git a/docs/how-to-guides/upgrade-etcd.md b/docs/how-to-guides/upgrade-etcd.md
@@ -0,0 +1,112 @@
+# Upgrading etcd
+
+## Contents
+
+- [Introduction](#introduction)
+- [Steps](#steps)
+  - [Step 1: Find out the IP and SSH](#step-1-find-out-the-ip-and-ssh)
+  - [Step 2: Create necessary directories with correct permissions](#step-2-create-necessary-directories-with-correct-permissions)
+  - [Step 3: Upgrade etcd](#step-3-upgrade-etcd)
+  - [Step 4: Verify upgrade](#step-4-verify-upgrade)
+  - [Step 5: Verify using `etcdctl`](#step-5-verify-using-etcdctl)
+
+## Introduction
+
+[Etcd](https://etcd.io/) is the most crucial component of a Kubernetes cluster. It stores the cluster state.
+
+This document will provide step by step guide on upgrading etcd in Lokomotive.
+
+## Steps
+
+Repeat the following steps on all the controller node one node at a time.
+
+### Step 1: Find out the IP and SSH
+
+Find the IP of the controller node by visiting the cloud provider dashboard and ssh into it.
+
+```bash
+ssh core@<IP Address>
+```
+
+### Step 2: Create necessary directories with correct permissions
+
+Latest etcd (`>= v3.4.10`) necessitates the data directory permissions to be `0700`, accordingly change the permissions. Verify the permissions are changed to `rwx------`.
+
+> **NOTE**: This step is needed only for the Lokomotive deployment done using `lokoctl` version `< 0.4.0`.
+
+```bash
+sudo chmod 0700 /var/lib/etcd/
+sudo ls -ld /var/lib/etcd/
+```
+
+If the node reboots, we need the right settings in place so that `systemd-tmpfile` service does not alter the permissions of the data directory. To make the changes made above persistent run the following command:
+
+```bash
+echo "d    /var/lib/etcd 0700 etcd etcd - -" | sudo tee /etc/tmpfiles.d/etcd-wrapper.conf
+```
+
+### Step 3: Upgrade etcd
+
+Run the following commands:
+
+> **NOTE**: Before proceeding to other commands, set the `etcd_version` variable to the latest etcd version.
+
+```bash
+export etcd_version=<latest etcd version e.g. v3.4.10>
+
+sudo sed -i "s,ETCD_IMAGE_TAG=.*,ETCD_IMAGE_TAG=${etcd_version}," \
+        /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf
+sudo systemctl daemon-reload
+sudo systemctl restart etcd-member
+```
+
+### Step 4: Verify upgrade
+
+Verify that the etcd service is in `active (running)` state:
+
+```bash
+sudo systemctl status --no-pager etcd-member
+```
+
+Run the following command to see logs of the process since the last restart:
+
+```bash
+sudo journalctl _SYSTEMD_INVOCATION_ID=$(sudo systemctl \
+              show -p InvocationID --value etcd-member.service)
+```
+
+> **NOTE**: Do not proceed with the upgrade of the rest of the cluster if you encounter any errors.
+
+Once you see the following log line, you can discern that the etcd daemon has come up without errors:
+
+```log
+etcdserver: starting server... [version: 3.4.10, cluster version: to_be_decided]
+```
+
+Once you see the following log line, you can discern that the etcd has rejoined the cluster without issues:
+
+```log
+embed: serving client requests on 10.88.81.1:2379
+```
+
+### Step 5: Verify using `etcdctl`
+
+We can use `etcdctl` client to verify the state of etcd cluster.
+
+```bash
+# Find the endpoint of this node's etcd:
+export endpoint=$(grep ETCD_ADVERTISE_CLIENT_URLS \
+        /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf | cut -d"=" -f3 | tr -d '"')
+export flags="--cacert=/etc/ssl/etcd/etcd-client-ca.crt \
+              --cert=/etc/ssl/etcd/etcd-client.crt \
+              --key=/etc/ssl/etcd/etcd-client.key"
+endpoints=$(sudo ETCDCTL_API=3 etcdctl member list $flags --endpoints=${endpoint} \
+            --write-out=json | jq -r '.members[].clientURLs[]')
+endpoints=$(sed 's| |,|g' <<< ${endpoints})
+
+# Verify:
+sudo ETCDCTL_API=3 etcdctl member list $flags --endpoints=${endpoint}
+sudo ETCDCTL_API=3 etcdctl endpoint health $flags --endpoints=${endpoints}
+```
+
+The last command should report that nodes are healthy. If it indicates otherwise then try commands from [Step 4](#step-4-verify-upgrade) to see what's wrong. If the nodes are healthy, it is safe to move forward with the next controller node.