Skip to content

RFC: Crash Recovery Mechanism for Stargz Snapshotter #2238

@luochenglcs

Description

@luochenglcs

Abstract

This proposal introduces a comprehensive crash recovery mechanism for stargz-snapshotter to address reliability issues and hot-upgrade limitations in the current FUSE-based implementation. By preserving /dev/fuse file descriptors, refactoring the FUSE filesystem, and optimizing recovery procedures, we enable seamless recovery after service crashes—minimizing impact on container workloads and significantly improving service reliability in production environments.

1 Background & Motivation

Stargz-snapshotter currently uses FUSE to implement a userspace filesystem, providing on-demand loading of container images to reduce startup latency. However, this architecture faces two critical challenges in production:

  1. FUSE Reliability Issues: Userspace filesystems pose stability risks; process failures can disrupt filesystem access.
  2. No Hot-Upgrade Support: The current design prevents seamless service restarts, such that containers often fail to resume normal operation after restart.

Demo:
https://asciinema.org/a/wgkSkJ1hQl2CSpn1

Two key issues observed:

  • Filesystem Unavailable After Restart: When the FUSE connection breaks, causing "stale file handle" errors in running containers.
  • Prolonged Recovery Time: Existing restart mechanisms scale linearly with the number of image layers (e.g., 1 minute 38 seconds for 70 layers across 12 images), severely impacting availability.

2 Solution & Achievements

We designed and implemented a complete crash recovery mechanism with the following core benefits:

2.1 Fast Recovery

  • Recovery no longer depends on network downloads of existing data.
  • Recovery time is reduced from minutes to milliseconds (e.g., from 1m38s to under 400ms), minimizing service disruption.

2.2 Transparency to Running Containers

  • Crash recovery is largely transparent to container workloads.
  • May cause only brief latency in file operations (~400ms), without affecting overall container operation.

3 Technical Approach

3.1 FUSE Connection Preservation (fdstore)

  • Use systemd's fdstore to preserve /dev/fuse file descriptors across restarts, preventing kernel FUSE connection termination.

3.2 Low-Level FUSE API & Metadata Persistence

  • Re-implements the read-only filesystem using the low-level FUSE API to minimize stateful information (e.g., nodeid allocation).
  • Adopts fixed NodeID allocation to ensure consistent inode identifiers after restart.
  • Persists TOC and other metadata to avoid redundant network fetches.

3.3 Recovery Process Optimization

  • Reuses local caches (TOC, blob sizes, fscache, httpcache) to eliminate redundant network traffic on restart.
  • Parallel Recovery: Supports concurrent layer recovery to utilize multi-core capabilities.

3.4 go-fuse support

(Upstream Linux kernel support is in place)

4 Validation

We conducted full functional and performance testing, confirming:

  1. Crash Recovery Transparency: Containers continue to access the filesystem normally after snapshotter restart.
  2. Performance Improvement: Recovery time reduced from minutes to milliseconds.

Demo:

https://asciinema.org/a/Dq2yiUL1Y0KfmnME

5 Modification Scope

5.1 stargz-snapshotter Modifications

  • Files modified: 23 files, approximately 2,800+ lines changed
 cache/cache.go                          |    6 +-
 cmd/containerd-stargz-grpc/db/db.go     |    8 +
 cmd/containerd-stargz-grpc/db/reader.go |   47 +++++
 cmd/containerd-stargz-grpc/main.go      |    1 +
 cmd/go.mod                              |    1 +
 estargz/estargz.go                      |   89 +++++++-
 estargz/restore.go                      |  206 ++++++++++++++++++
 fs/config/config.go                     |    4 +
 fs/fdstore/fdstore.go                   |  299 +++++++++++++++++++++++++++
 fs/fs.go                                |  226 ++++++++++++++++++--
 fs/layer/layer.go                       |   70 ++++++-
 fs/layer/node.go                        |  175 +++-------------
 fs/layer/node_lowlevel.go               | 1627 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/layer/testutil.go                    |    1 +
 fs/reader/reader.go                     |   33 ++-
 fs/remote/blob.go                       |  106 +++++++++-
 fs/remote/resolver.go                   |   46 ++++-
 fusemanager/client.go                   |    4 +
 go.mod                                  |    1 +
 metadata/memory/reader.go               |   54 ++++-
 metadata/metadata.go                    |   10 +
 snapshot/snapshot.go                    |   99 +++++----
 23 files changed, 2859 insertions(+), 257 deletions(-)

5.2 go-fuse Modifications

  • Files modified: 3 files, approximately 140+ lines changed
 fuse/opcode.go            |   4 +++-
 fuse/passthrough_linux.go |  19 +++++++++++++++++--
 fuse/server.go            | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+), 3 deletions(-)

6 Contribution Plan

We welcome community feedback on this proposal and plan to contribute code incrementally based on discussion outcomes:

  1. Submit the design document for community review.
  2. Contribute code implementation in modular phases.

7 Conclusion

This crash recovery mechanism significantly enhances stargz-snapshotter's reliability in production, solving long-standing FUSE stability and hot-upgrade challenges. We believe this improvement will facilitate broader adoption of stargz-snapshotter in production environments and provide a more robust foundation for container startup acceleration.

We welcome community feedback and discussion to refine this proposal and collaborate on integrating it into the upstream project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions