-
Notifications
You must be signed in to change notification settings - Fork 41k
Description
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Kube-proxy get stuck after master goes down and is recreated on a new machine.
We run kube-proxy
as daemon set and under normal circumstances kube-proxy
works fine.
We run k8s nodes as imutable instances and if there is reboot, stop or error, the node is recreated as whole new machine with new ip,mac and everyhting. Etcd data is stored on persistent storage but OS is not.
K8s API endpoint stays same.
This lead to an issue when the master is "recreated" then kube-proxy
is in some weird stuck state when it doesn't work. We run health checks on the kube-proxy
, but this does not trigger any restart as the kube-proxy
thinks that its healthy and there is not a single log entry indicating that anything is wrong.
To fix it we need to kill all kube-proxy pods and then it works again.
My wild assumption is that kube-proxy is holding open connection to the k8s-api and if the master is recreated with new ip, kubeproxy is still using the old non-working connection.
What you expected to happen:
Kube-proxy is checking if the current connection to the K8S api is valid in some period of time and if not the it force reconnection.
How to reproduce it (as minimally and precisely as possible):
- Create running k8s cluster with single master.
- recreate master with same etcd data and endpoint but different instance ip.
- Test k8s resource type
service
(they should not work properly) or create a new one and test that new service.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5+coreos.0", GitCommit:"070d238cd2ec359928548e486a9171b498573181", GitTreeState:"clean", BuildDate:"2017-08-31T21:28:39Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
but we got similar behavior also on 1.8.1
- Cloud provider or hardware configuration: baremetal
- OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1465.8.0
VERSION_ID=1465.8.0
BUILD_ID=2017-09-20-2237
PRETTY_NAME="Container Linux by CoreOS 1465.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
- Kernel (e.g.
uname -a
):
Linux 00008df14a32b2b9 4.12.14-coreos #1 SMP Wed Sep 20 22:20:05 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz GenuineIntel GNU/Linux
- Install tools: selfhosted
- Others:
@kubernetes/sig-network-bugs