-
Notifications
You must be signed in to change notification settings - Fork 163
Kindling启动或更新导致K8S APIServer OOM崩溃 #459
Description
Describe the bug
在一个拥有30台机器的集群部署和更新Kindling DaemonSet出现了这样的问题:更新DaemonSet配置文件中的容器镜像,然后删除DaemonSet全部的Pod以使镜像更新快速生效。在执行这两步操作后,执行kubectl get pod 查看 Kindling的运行状态,总是被拒绝响应。要等待5到10分钟才能恢复。
✗ kubectl -n kindling edit deamonset kindling
✗ kubectl -n kindling get pod -o custom-columns=NAME:.metadata.name | grep kindling | xargs kubectl -n kindling delete pod >/dev/null
✗ kubectl -n kindling get pods -o wide
E0209 09:21:29.286705 3996882 memcache.go:238] couldn't get current server API group list: Get "https://10.91.173.62:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: EOF
E0209 09:22:01.287685 3996882 memcache.go:238] couldn't get current server API group list: Get "https://10.91.173.62:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: EOF
E0209 09:22:51.568619 3996882 memcache.go:238] couldn't get current server API group list:
Error from server (Forbidden): pods is forbidden: User "xxxxxx" cannot list resource "pods" in API group "" in the namespace "kindling"
查看集群日志发现,删除重建后,Kubernetes API Server 总是因为OOM 崩溃重启。
API Server单实例,且只有16GB内存
集群内pod、service、deployment、ReplicateSet 资源的数量分别是1200、2800、1400、8400