Skip to content

Pre-open container root directory#5190

Open
kolyshkin wants to merge 4 commits intoopencontainers:mainfrom
kolyshkin:preopen_root
Open

Pre-open container root directory#5190
kolyshkin wants to merge 4 commits intoopencontainers:mainfrom
kolyshkin:preopen_root

Conversation

@kolyshkin
Copy link
Copy Markdown
Contributor

A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.

Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.

This is a somewhat naive attempt at it, but it is surprisingly simple.

@kolyshkin kolyshkin marked this pull request as ready for review March 20, 2026 22:54
@kolyshkin kolyshkin requested review from cyphar and rata March 20, 2026 22:54
@cyphar
Copy link
Copy Markdown
Member

cyphar commented Mar 22, 2026

Thanks, this was actually on my list of libpathrs TODOs, and I'm super glad the patch turned out this minimal! 😄

My only real concern is that if we want to re-enable support for mounting on top of / in the container (#5070) then we will need to have special handling to re-open the root handle after every mount (or only do it if the mount is over the root but that requires doing different validation work) otherwise mounts will act weird (mounts always get put on top of the mount stack but the path resolution would be done from the lower mount so nested mounts won't work or could result in weird results).

IIRC you also can't just do a basic re-open or open . to get the top mount, but I will need to double-check that with some testing...

WDYT? We can probably just leave it as a comment for now though...

@kolyshkin
Copy link
Copy Markdown
Contributor Author

I don't really know what to do about #5070, to me it seems like a very weird case scenario, but since runc used to support it, maybe we'll need to fix it eventually.

I've added a comment.

@cyphar cyphar added the backport/1.5-todo A PR in main branch which needs to be backported to release-1.5 label Mar 25, 2026
@cyphar cyphar modified the milestones: 1.6.0, 1.5.0-rc.2 Mar 25, 2026
@kolyshkin kolyshkin requested a review from lifubang March 26, 2026 04:27
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we can use unix.Fchdir? And two other places:

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 128fb2fa..ac3edfd1 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -204,7 +204,7 @@ func prepareRootfs(pipe *syncSocket, iConfig *initConfig) (err error) {
        // container. It's just cleaner to do this here (at the expense of the
        // operation not being perfectly split).
 
-       if err := unix.Chdir(config.Rootfs); err != nil {
+       if err := unix.Fchdir(int(rootFd.Fd())); err != nil {
                return &os.PathError{Op: "chdir", Path: config.Rootfs, Err: err}
        }
 
@@ -217,9 +217,9 @@ func prepareRootfs(pipe *syncSocket, iConfig *initConfig) (err error) {
        }
 
        if config.NoPivotRoot {
-               err = msMoveRoot(config.Rootfs)
+               err = msMoveRoot(rootFd)
        } else if config.Namespaces.Contains(configs.NEWNS) {
-               err = pivotRoot(config.Rootfs)
+               err = pivotRoot(rootFd)
        } else {
                err = chroot()
        }
@@ -1125,7 +1125,7 @@ func setupPtmx(config *configs.Config) error {
 
 // pivotRoot will call pivot_root such that rootfs becomes the new root
 // filesystem, and everything else is cleaned up.
-func pivotRoot(rootfs string) error {
+func pivotRoot(root *os.File) error {
        // While the documentation may claim otherwise, pivot_root(".", ".") is
        // actually valid. What this results in is / being the new root but
        // /proc/self/cwd being the old root. Since we can play around with the cwd
@@ -1138,15 +1138,9 @@ func pivotRoot(rootfs string) error {
        }
        defer unix.Close(oldroot)
 
-       newroot, err := linux.Open(rootfs, unix.O_DIRECTORY|unix.O_RDONLY, 0)
-       if err != nil {
-               return err
-       }
-       defer unix.Close(newroot)
-
        // Change to the new root so that the pivot_root actually acts on it.
-       if err := unix.Fchdir(newroot); err != nil {
-               return &os.PathError{Op: "fchdir", Path: "fd " + strconv.Itoa(newroot), Err: err}
+       if err := unix.Fchdir(int(root.Fd())); err != nil {
+               return &os.PathError{Op: "fchdir", Path: "fd " + strconv.Itoa(int(root.Fd())), Err: err}
        }
 
        if err := unix.PivotRoot(".", "."); err != nil {
@@ -1182,7 +1176,7 @@ func pivotRoot(rootfs string) error {
        return nil
 }
 
-func msMoveRoot(rootfs string) error {
+func msMoveRoot(root *os.File) error {
        // Before we move the root and chroot we have to mask all "full" sysfs and
        // procfs mounts which exist on the host. This is because while the kernel
        // has protections against mounting procfs if it has masks, when using
@@ -1205,7 +1199,7 @@ func msMoveRoot(rootfs string) error {
                // are non-full mounts or are inside the rootfs of the container.
                if info.Root != "/" ||
                        (info.FSType != "proc" && info.FSType != "sysfs") ||
-                       strings.HasPrefix(info.Mountpoint, rootfs) {
+                       strings.HasPrefix(info.Mountpoint, root.Name()) {
                        skip = true
                }
                return skip, stop
@@ -1239,7 +1233,9 @@ func msMoveRoot(rootfs string) error {
        }
 
        // Move the rootfs on top of "/" in our mount namespace.
-       if err := mount(rootfs, "/", "", unix.MS_MOVE, ""); err != nil {
+       src, closer := utils.ProcThreadSelfFd(procRootIno)
+       defer closer()
+       if err := mount(src, "/", "", unix.MS_MOVE, ""); err != nil {
                return err
        }
        return chroot()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src, closer := utils.ProcThreadSelfFd(procRootIno)

@lifubang I don't understand this part. You supply procRootIno (which is an inode number of a root directory on /proc mount) as a file descriptor (1), which does not make any sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you meant rootFd.Fd() here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang I've added your code as a second commit, co-authored by me. I've fixed the above issue and did some other changes. One big change is I decided to not use rootFd in msMoveRoot as in there it is only used for mount, and I don't think mounting via fd is any better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I don't think mounting via fd is any better.

I still suggest to use a proc fd style path here, in theory, if the rootfs path contains a symlink that gets swapped between the time rootFd was opened and when MS_MOVE runs, we end up moving the wrong mount. Using /proc/thread-self/fd/N here closes that window consistently with the rest of the PR's approach. But maybe this would not happen in practice.

if err != nil {
return fmt.Errorf("open rootfs handle: %w", err)
}
defer rootFd.Close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the close fails and the open fd is leaked into the container? (I know we have this close all fds safe-guard, so it won't really leak). Could the container potentially use ".." to go to the host?

Shall we try to check the return code of close and return an error if we fail or something? Assuming the open fd is a way to go to the host fs. If on pivot_root the kernel does something smart because this is the root (I doubt?) and even if leaked is safe, then I wouldn't bother

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could use it to escape to the host -- and pivot_root(2) doesn't chroot file descriptor references unfortunately (or fortunately, if you care about performance).

Hmmm... Maybe it's better to punt this until we have switched to the new mount API? We could make this an OPEN_TREE_CLONE file descriptor instead which cannot be escaped that way (though I am not sure if this will work without OPEN_TREE_NAMESPACE -- pivot_root won't work the trivial way IIRC). 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, runc also currently opens all sorts of files within the host mount namespace (to support the permission-less bind-mount stuff but also as a standard Go program), so this really isn't that big of a change and we need to rely on stuff being closed anyway. So it's probably fine...? 😞

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could use it to escape to the host -- and pivot_root(2) doesn't chroot file descriptor references unfortunately

I’ve test it, it seems that pivot_root(2) DOES chroot file descriptor, so I said:
#5190 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe it's not that it chroots it (i.e., the awful magic in chroot_fs_refs).

I think it's just that the fd refers to the bind-mount that becomes the filesystem root and so when that happens it just is the root.

In either case, neat.

return fmt.Errorf("filesystem %q must be mounted on ordinary directory", m.Device)
}
dstFile, err := pathrs.MkdirAllInRoot(rootfs, dest, 0o755)
dstFile, err := pathrs.MkdirAllInRoot(c.root, dest, 0o755)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow why this patch is needed for the next one, not what is trying to do. It changes the scope of the rootfs var, and it changes so we use c.root instead of the rootfs var?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slight code refactoring to ease the review process (if you're reviewing commit by commit). It is here to make the last commit smaller in size. Your description of what it does is correct.

I can certainly merge this if you prefer it this way.

@cyphar cyphar self-requested a review March 27, 2026 15:26
Copy link
Copy Markdown
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it a little more and still LGTM.

Indeed, it does not make sense to prepend c.root once we started using
MkdirAllInRoot in commit 63c2908.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
No change in functionality, just a preparation for the next patch.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the preopen_root branch 2 times, most recently from 11d6bae to 05bdb49 Compare March 27, 2026 21:44
@kolyshkin
Copy link
Copy Markdown
Contributor Author

I thought about it a little more and still LGTM.

I've since added the 4th commit based on @lifubang suggestion above (and added O_PATH in the 3rd one). No other changes so you can just review the last commit.

kolyshkin and others added 2 commits March 27, 2026 14:53
A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.

Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This uses preopened rootfs in Chdir and pivotRoot.

While at it, add O_PATH when opening oldroot in pivotRoot.

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
// This function also creates missing mountpoints as long as they
// are not on top of a tmpfs, as CRIU will restore tmpfs content anyway.
func (c *Container) prepareCriuRestoreMounts(mounts []*configs.Mount) error {
rootFd, err := os.OpenFile(c.config.Rootfs, unix.O_DIRECTORY|unix.O_CLOEXEC|unix.O_PATH, 0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added O_PATH to here)

// Pre-open rootfs. NOTE that if we need to re-enable support for mounting
// on top of container root (see issue 5070), we will need to reopen rootFd
// after such mounts.
rootFd, err := os.OpenFile(config.Rootfs, unix.O_DIRECTORY|unix.O_CLOEXEC|unix.O_PATH, 0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added O_PATH to here)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, O_DIRECTORY is actually fine as-is because directories aren't problematic (libpathrs does O_PATH for consistency more than anything else) but it's fine either way.

@kolyshkin kolyshkin requested review from lifubang and rata March 27, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.5-todo A PR in main branch which needs to be backported to release-1.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants