Skip to content

kube-proxy get stuck if master is recreated on new instance #56720

@calvix

Description

@calvix

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Kube-proxy get stuck after master goes down and is recreated on a new machine.

We run kube-proxy as daemon set and under normal circumstances kube-proxy works fine.

We run k8s nodes as imutable instances and if there is reboot, stop or error, the node is recreated as whole new machine with new ip,mac and everyhting. Etcd data is stored on persistent storage but OS is not.
K8s API endpoint stays same.

This lead to an issue when the master is "recreated" then kube-proxy is in some weird stuck state when it doesn't work. We run health checks on the kube-proxy, but this does not trigger any restart as the kube-proxy thinks that its healthy and there is not a single log entry indicating that anything is wrong.
To fix it we need to kill all kube-proxy pods and then it works again.

My wild assumption is that kube-proxy is holding open connection to the k8s-api and if the master is recreated with new ip, kubeproxy is still using the old non-working connection.

What you expected to happen:
Kube-proxy is checking if the current connection to the K8S api is valid in some period of time and if not the it force reconnection.

How to reproduce it (as minimally and precisely as possible):

  • Create running k8s cluster with single master.
  • recreate master with same etcd data and endpoint but different instance ip.
  • Test k8s resource type service (they should not work properly) or create a new one and test that new service.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5+coreos.0", GitCommit:"070d238cd2ec359928548e486a9171b498573181", GitTreeState:"clean", BuildDate:"2017-08-31T21:28:39Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

but we got similar behavior also on 1.8.1

  • Cloud provider or hardware configuration: baremetal
  • OS (e.g. from /etc/os-release):
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1465.8.0
VERSION_ID=1465.8.0
BUILD_ID=2017-09-20-2237
PRETTY_NAME="Container Linux by CoreOS 1465.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a):
Linux 00008df14a32b2b9 4.12.14-coreos #1 SMP Wed Sep 20 22:20:05 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz GenuineIntel GNU/Linux
  • Install tools: selfhosted
  • Others:

@kubernetes/sig-network-bugs

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.sig/networkCategorizes an issue or PR as relevant to SIG Network.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions