RFC: Crash Recovery Mechanism for Stargz Snapshotter

## Abstract

This proposal introduces a comprehensive crash recovery mechanism for stargz-snapshotter to address reliability issues and hot-upgrade limitations in the current FUSE-based implementation. By preserving `/dev/fuse` file descriptors, refactoring the FUSE filesystem, and optimizing recovery procedures, we enable seamless recovery after service crashes—minimizing impact on container workloads and significantly improving service reliability in production environments.

## 1 Background & Motivation

Stargz-snapshotter currently uses FUSE to implement a userspace filesystem, providing on-demand loading of container images to reduce startup latency. However, this architecture faces two critical challenges in production:

1. FUSE Reliability Issues: Userspace filesystems pose stability risks; process failures can disrupt filesystem access.
2. No Hot-Upgrade Support: The current design prevents seamless service restarts, such that containers often fail to resume normal operation after restart.

Demo: 
https://asciinema.org/a/wgkSkJ1hQl2CSpn1
<img src="https://github.com/user-attachments/assets/a43fa358-a9ce-4988-81b1-0c52e7c4c46d"  width="60%" />

Two key issues observed:

- Filesystem Unavailable After Restart: When the FUSE connection breaks,  causing "stale file handle" errors in running containers.
- Prolonged Recovery Time: Existing restart mechanisms scale linearly with the number of image layers (e.g., 1 minute 38 seconds for 70 layers across 12 images), severely impacting availability.

## 2 Solution & Achievements

We designed and implemented a complete crash recovery mechanism with the following core benefits:

### 2.1 Fast Recovery

- Recovery no longer depends on network downloads of existing data.
- Recovery time is reduced from minutes to milliseconds (e.g., from 1m38s to under 400ms), minimizing service disruption.

### 2.2 Transparency to Running Containers

- Crash recovery is largely transparent to container workloads.
- May cause only brief latency in file operations (~400ms), without affecting overall container operation.

## 3 Technical Approach

### 3.1 FUSE Connection Preservation (`fdstore`)

- Use systemd's `fdstore` to preserve `/dev/fuse` file descriptors across restarts, preventing kernel FUSE connection termination.

### 3.2 Low-Level FUSE API & Metadata Persistence

- Re-implements the read-only filesystem using the low-level FUSE API to minimize stateful information (e.g., nodeid allocation).
- Adopts fixed NodeID allocation to ensure consistent inode identifiers after restart.
- Persists TOC and other metadata to avoid redundant network fetches.

### 3.3 Recovery Process Optimization

- Reuses local caches (TOC, blob sizes, fscache, httpcache) to eliminate redundant network traffic on restart.
- Parallel Recovery: Supports concurrent layer recovery to utilize multi-core capabilities.

### 3.4 go-fuse support

(Upstream Linux kernel support is in place)

- Adds fuse_resend functionality.
- Gets the /dev/fuse file descriptor from fuse server
- Restores mountpoints from a specified /dev/fuse file descriptor.
- Supports cleanup of all passthrough backend IDs. (https://lore.kernel.org/linux-fsdevel/20260119083750.2055-1-luochunsheng@ustc.edu/)

## 4 Validation

We conducted full functional and performance testing, confirming:

1. Crash Recovery Transparency: Containers continue to access the filesystem normally after snapshotter restart.
2. Performance Improvement: Recovery time reduced from minutes to milliseconds.

Demo:

 https://asciinema.org/a/Dq2yiUL1Y0KfmnME
<img src="https://github.com/user-attachments/assets/18446bf3-7eda-4668-945c-3711ae170222"  width="60%" />


## 5 Modification Scope

### 5.1 stargz-snapshotter Modifications

- Files modified: 23 files, approximately 2,800+ lines changed
```shell
 cache/cache.go                          |    6 +-
 cmd/containerd-stargz-grpc/db/db.go     |    8 +
 cmd/containerd-stargz-grpc/db/reader.go |   47 +++++
 cmd/containerd-stargz-grpc/main.go      |    1 +
 cmd/go.mod                              |    1 +
 estargz/estargz.go                      |   89 +++++++-
 estargz/restore.go                      |  206 ++++++++++++++++++
 fs/config/config.go                     |    4 +
 fs/fdstore/fdstore.go                   |  299 +++++++++++++++++++++++++++
 fs/fs.go                                |  226 ++++++++++++++++++--
 fs/layer/layer.go                       |   70 ++++++-
 fs/layer/node.go                        |  175 +++-------------
 fs/layer/node_lowlevel.go               | 1627 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/layer/testutil.go                    |    1 +
 fs/reader/reader.go                     |   33 ++-
 fs/remote/blob.go                       |  106 +++++++++-
 fs/remote/resolver.go                   |   46 ++++-
 fusemanager/client.go                   |    4 +
 go.mod                                  |    1 +
 metadata/memory/reader.go               |   54 ++++-
 metadata/metadata.go                    |   10 +
 snapshot/snapshot.go                    |   99 +++++----
 23 files changed, 2859 insertions(+), 257 deletions(-)
```

### 5.2 go-fuse Modifications

- Files modified: 3 files, approximately 140+ lines changed
```shell
 fuse/opcode.go            |   4 +++-
 fuse/passthrough_linux.go |  19 +++++++++++++++++--
 fuse/server.go            | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+), 3 deletions(-)
```

## 6 Contribution Plan

We welcome community feedback on this proposal and plan to contribute code incrementally based on discussion outcomes:

1. Submit the design document for community review.
2. Contribute code implementation in modular phases.

## 7 Conclusion

This crash recovery mechanism significantly enhances stargz-snapshotter's reliability in production, solving long-standing FUSE stability and hot-upgrade challenges. We believe this improvement will facilitate broader adoption of stargz-snapshotter in production environments and provide a more robust foundation for container startup acceleration.

We welcome community feedback and discussion to refine this proposal and collaborate on integrating it into the upstream project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Crash Recovery Mechanism for Stargz Snapshotter #2238

Abstract

1 Background & Motivation

2 Solution & Achievements

2.1 Fast Recovery

2.2 Transparency to Running Containers

3 Technical Approach

3.1 FUSE Connection Preservation (`fdstore`)

3.2 Low-Level FUSE API & Metadata Persistence

3.3 Recovery Process Optimization

3.4 go-fuse support

4 Validation

5 Modification Scope

5.1 stargz-snapshotter Modifications

5.2 go-fuse Modifications

6 Contribution Plan

7 Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Crash Recovery Mechanism for Stargz Snapshotter #2238

Description

Abstract

1 Background & Motivation

2 Solution & Achievements

2.1 Fast Recovery

2.2 Transparency to Running Containers

3 Technical Approach

3.1 FUSE Connection Preservation (fdstore)

3.2 Low-Level FUSE API & Metadata Persistence

3.3 Recovery Process Optimization

3.4 go-fuse support

4 Validation

5 Modification Scope

5.1 stargz-snapshotter Modifications

5.2 go-fuse Modifications

6 Contribution Plan

7 Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3.1 FUSE Connection Preservation (`fdstore`)