-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Abstract
This proposal introduces a comprehensive crash recovery mechanism for stargz-snapshotter to address reliability issues and hot-upgrade limitations in the current FUSE-based implementation. By preserving /dev/fuse file descriptors, refactoring the FUSE filesystem, and optimizing recovery procedures, we enable seamless recovery after service crashes—minimizing impact on container workloads and significantly improving service reliability in production environments.
1 Background & Motivation
Stargz-snapshotter currently uses FUSE to implement a userspace filesystem, providing on-demand loading of container images to reduce startup latency. However, this architecture faces two critical challenges in production:
- FUSE Reliability Issues: Userspace filesystems pose stability risks; process failures can disrupt filesystem access.
- No Hot-Upgrade Support: The current design prevents seamless service restarts, such that containers often fail to resume normal operation after restart.
Demo:
https://asciinema.org/a/wgkSkJ1hQl2CSpn1

Two key issues observed:
- Filesystem Unavailable After Restart: When the FUSE connection breaks, causing "stale file handle" errors in running containers.
- Prolonged Recovery Time: Existing restart mechanisms scale linearly with the number of image layers (e.g., 1 minute 38 seconds for 70 layers across 12 images), severely impacting availability.
2 Solution & Achievements
We designed and implemented a complete crash recovery mechanism with the following core benefits:
2.1 Fast Recovery
- Recovery no longer depends on network downloads of existing data.
- Recovery time is reduced from minutes to milliseconds (e.g., from 1m38s to under 400ms), minimizing service disruption.
2.2 Transparency to Running Containers
- Crash recovery is largely transparent to container workloads.
- May cause only brief latency in file operations (~400ms), without affecting overall container operation.
3 Technical Approach
3.1 FUSE Connection Preservation (fdstore)
- Use systemd's
fdstoreto preserve/dev/fusefile descriptors across restarts, preventing kernel FUSE connection termination.
3.2 Low-Level FUSE API & Metadata Persistence
- Re-implements the read-only filesystem using the low-level FUSE API to minimize stateful information (e.g., nodeid allocation).
- Adopts fixed NodeID allocation to ensure consistent inode identifiers after restart.
- Persists TOC and other metadata to avoid redundant network fetches.
3.3 Recovery Process Optimization
- Reuses local caches (TOC, blob sizes, fscache, httpcache) to eliminate redundant network traffic on restart.
- Parallel Recovery: Supports concurrent layer recovery to utilize multi-core capabilities.
3.4 go-fuse support
(Upstream Linux kernel support is in place)
- Adds fuse_resend functionality.
- Gets the /dev/fuse file descriptor from fuse server
- Restores mountpoints from a specified /dev/fuse file descriptor.
- Supports cleanup of all passthrough backend IDs. (https://lore.kernel.org/linux-fsdevel/[email protected]/)
4 Validation
We conducted full functional and performance testing, confirming:
- Crash Recovery Transparency: Containers continue to access the filesystem normally after snapshotter restart.
- Performance Improvement: Recovery time reduced from minutes to milliseconds.
Demo:
https://asciinema.org/a/Dq2yiUL1Y0KfmnME

5 Modification Scope
5.1 stargz-snapshotter Modifications
- Files modified: 23 files, approximately 2,800+ lines changed
cache/cache.go | 6 +-
cmd/containerd-stargz-grpc/db/db.go | 8 +
cmd/containerd-stargz-grpc/db/reader.go | 47 +++++
cmd/containerd-stargz-grpc/main.go | 1 +
cmd/go.mod | 1 +
estargz/estargz.go | 89 +++++++-
estargz/restore.go | 206 ++++++++++++++++++
fs/config/config.go | 4 +
fs/fdstore/fdstore.go | 299 +++++++++++++++++++++++++++
fs/fs.go | 226 ++++++++++++++++++--
fs/layer/layer.go | 70 ++++++-
fs/layer/node.go | 175 +++-------------
fs/layer/node_lowlevel.go | 1627 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/layer/testutil.go | 1 +
fs/reader/reader.go | 33 ++-
fs/remote/blob.go | 106 +++++++++-
fs/remote/resolver.go | 46 ++++-
fusemanager/client.go | 4 +
go.mod | 1 +
metadata/memory/reader.go | 54 ++++-
metadata/metadata.go | 10 +
snapshot/snapshot.go | 99 +++++----
23 files changed, 2859 insertions(+), 257 deletions(-)5.2 go-fuse Modifications
- Files modified: 3 files, approximately 140+ lines changed
fuse/opcode.go | 4 +++-
fuse/passthrough_linux.go | 19 +++++++++++++++++--
fuse/server.go | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 143 insertions(+), 3 deletions(-)6 Contribution Plan
We welcome community feedback on this proposal and plan to contribute code incrementally based on discussion outcomes:
- Submit the design document for community review.
- Contribute code implementation in modular phases.
7 Conclusion
This crash recovery mechanism significantly enhances stargz-snapshotter's reliability in production, solving long-standing FUSE stability and hot-upgrade challenges. We believe this improvement will facilitate broader adoption of stargz-snapshotter in production environments and provide a more robust foundation for container startup acceleration.
We welcome community feedback and discussion to refine this proposal and collaborate on integrating it into the upstream project.