Checkpoint and restart recovery

There are several restart recovery problems with current cri-containerd:
1) cri-containerd restart. Because cri-containerd maintains all internal state in-memory, including sandbox list, container list and image list, once restarted all state will be lost.
2) containerd restart. When containerd restart and reconnect, there may be state mismatch between containerd and cri-containerd, e.g. a container dies during containerd is down.

To fix this, we should recover/reconcile state during cri-containerd start or after containerd restart and reconnect.

There are 3 kinds of internal state:
1) Image list. Containerd has all the information we need, we just need to list images from containerd and recover the image list.
2) Sandbox/container metadata: Most of the metadata is not provided by containerd, we need to checkpoint them for restart recovery. However, because metadata is constant, we could save it into containerd container label so as to leverage containerd metadata store to save it for us.
3) Container status: Container status is not persisted by containerd, we need to persist it ourselves. And because it's constantly changing, we may not want to abuse containerd container label to save it. So we need to maintain its checkpoint ourselves.

/cc @kubernetes-incubator/maintainers-cri-containerd 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint and restart recovery #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint and restart recovery #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions