Skip to content
This repository was archived by the owner on Mar 9, 2022. It is now read-only.

Conversation

@Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Oct 30, 2017

In a test, I saw that we could never remove a container any more once containerd container removal returns an error.

E1029 23:59:48.015653    1035 container_remove.go:56] failed to reset removing state for container "9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56": failed to checkpoint status to "/var/lib/cri-containerd/containers/9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56/status": open /var/lib/cri-containerd/containers/9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56/.tmp-status551243121: no such file or directory
E1029 23:59:48.015675    1035 instrumented_service.go:176] RemoveContainer for "9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56" failed, error: failed to delete containerd container "9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56": context deadline exceeded: unknown
E1030 00:26:47.270788    1035 instrumented_service.go:176] RemoveContainer for "9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56" failed, error: failed to set removing state for container "9271ad01c1ef091260fca1b432128d98d3b6e40735f8e6b3d8f6d8d10d58ac56": container is already in removing state

The reason is that removing state could not be reset, because the on-disk container status has been removed, thus the reset function fails.

This PR:

  1. Move container status deletion to be in front of container root directory deletion. Status checkpoint file is in container root directory. We may want to atomically remove the status file before we remove the whole root directory.
  2. Move container status/root directory deletion after containerd container deletion. Actually it doesn't matter we delete them before or after containerd container. Restart recovery could handle both cases. However, it makes more sense to make sure that a containerd container always has status associated in its whole lifecycle.
  3. Address a TODO to distinguish UpdateSync and Update. Actually in most cases, we just need the code to run in a Update transaction to avoid race, and don't actually update the on-disk status. We could use Update for this case in the future. This also fixed the issue I mentioned above. With this PR, we'll only use Update to set/reset removing state, which doesn't include any disk operation, and won't be affected by container status/root directory removal.

Signed-off-by: Lantao Liu [email protected]

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on if returning early is the right thing when deleting?

// Delete containerd container.
if err := container.Container.Delete(ctx, containerd.WithSnapshotCleanup); err != nil {
if !errdefs.IsNotFound(err) {
return nil, fmt.Errorf("failed to delete containerd container %q: %v", id, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just pop out warnings and continue instead of returning early then return the error after we attempt to delete all the parts, best can do model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we fail to delete container, we want to keep its status and root directory so that we could still retrieve information about this container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok per discussion, let's add a todo / issue regarding forcibly removing / killing on the remove request. Or getting guidance from the CRI sig-node team on remove expectations.


// Delete container checkpoint.
if err := container.Delete(); err != nil {
return nil, fmt.Errorf("failed to delete container checkpoint for %q: %v", id, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerd.Delete is atomic, we should only use it to delete container status.

If we fail to do that, we may not want to remove container root directory, because that removal is not atomic.

glog.V(5).Infof("Remove called for containerd container %q that does not exist", id, err)
containerRootDir := getContainerRootDir(c.config.RootDir, id)
if err := system.EnsureRemoveAll(containerRootDir); err != nil {
return nil, fmt.Errorf("failed to remove container root directory %q: %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@Random-Liu Random-Liu added this to the v1.0.0-alpha.1 milestone Oct 31, 2017
@Random-Liu
Copy link
Member Author

Random-Liu commented Oct 31, 2017

In short, we do need to enforce order of those deletion. :)

@Random-Liu
Copy link
Member Author

Random-Liu commented Oct 31, 2017

@mikebrow Added TODO for forcibly stop.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@Random-Liu Random-Liu merged commit c44f798 into containerd:master Oct 31, 2017
@Random-Liu Random-Liu deleted the fix-removing branch October 31, 2017 21:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants