Skip to content

Kindling启动或更新导致K8S APIServer OOM崩溃 #459

@LambertZhaglog

Description

@LambertZhaglog

Describe the bug

在一个拥有30台机器的集群部署和更新Kindling DaemonSet出现了这样的问题:更新DaemonSet配置文件中的容器镜像,然后删除DaemonSet全部的Pod以使镜像更新快速生效。在执行这两步操作后,执行kubectl get pod 查看 Kindling的运行状态,总是被拒绝响应。要等待5到10分钟才能恢复。

✗ kubectl -n kindling edit deamonset kindling
✗ kubectl -n kindling get pod -o custom-columns=NAME:.metadata.name | grep kindling | xargs kubectl -n kindling delete pod >/dev/null
✗ kubectl -n kindling get pods -o wide
E0209 09:21:29.286705 3996882 memcache.go:238] couldn't get current server API group list: Get "https://10.91.173.62:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: EOF
E0209 09:22:01.287685 3996882 memcache.go:238] couldn't get current server API group list: Get "https://10.91.173.62:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: EOF
E0209 09:22:51.568619 3996882 memcache.go:238] couldn't get current server API group list:
Error from server (Forbidden): pods is forbidden: User "xxxxxx" cannot list resource "pods" in API group "" in the namespace "kindling"

查看集群日志发现,删除重建后,Kubernetes API Server 总是因为OOM 崩溃重启。

API Server单实例,且只有16GB内存

集群内pod、service、deployment、ReplicateSet 资源的数量分别是1200、2800、1400、8400

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/collectorIssues or PRs related to agent metric collectorbugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions