EM: Add restart on-failure for metadata service#1362
Conversation
|
Seeing this PR, I think we should slowly start changing Packet -> Equinx Metal |
We have #1060 :) |
invidian
left a comment
There was a problem hiding this comment.
The other services which have Type=oneshot are following: wait-for-dns, bootkube, delete-node, create-etcd-config, persist-data-raid.
Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.
If there is a DNS outage, I think we should retry wait-for-dns.service, as it will block starting kubelet.
delete-node.service we should also retry IMO, so the node does not go away unregistered. As the pods will stay assigned for long on this node and won't be re-scheduled.
2 points above also applies to other platforms, not only Packet.
assets/terraform-modules/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl
Outdated
Show resolved
Hide resolved
This commit adds `Restart=on-failure` and `RestartSec=10s` to the metadata service. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
's | Flatcar | Flatcar Container Linux | g' Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
This commit adds `Restart=on-failure` and `RestartSec=5s` to the wait-for-dns service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
This commit adds `Restart=on-failure` and `RestartSec=5s` to the delete-node service on all platforms. Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
I don't know what the unforeseen consequences will be for adding retries for |
f0813be to
08d342f
Compare
|
I can wait until #1368 is merged. |
|
One more thing to consider: The |
|
Creating a new issue for what Kai has suggested. |
This PR adds
Restart=on-failureandRestartSec=5sto the metadata service.Fixes #1298
The other services which have
Type=oneshotare following:wait-for-dns,bootkube,delete-node,create-etcd-config,persist-data-raid.Most of them are the type of services which we want them to fail early and the user know about it, instead of them endlessly trying and someone else doing a time out on them.