Skip to content

fix: atomic writes for userdata to prevent data loss on crash#12987

Merged
comfyanonymous merged 1 commit intomasterfrom
fix/atomic-userdata-writes
Mar 17, 2026
Merged

fix: atomic writes for userdata to prevent data loss on crash#12987
comfyanonymous merged 1 commit intomasterfrom
fix/atomic-userdata-writes

Conversation

@christian-byrne
Copy link
Contributor

Summary

The POST /userdata/{file} endpoint opens the target path with "wb" (truncating it to zero bytes immediately) then writes the body. If the process crashes between truncation and write completion, the file is left as a zero-byte file and the workflow is lost.

This changes the write to use tempfile.mkstemp in the same directory followed by os.replace(), so either the old file remains intact or the new file is fully written — never a zero-byte intermediate state.

Fixes #11298

Tradeoffs

Concern Assessment
Extra syscalls One additional mkstemp + rename per save. Negligible vs. the HTTP round-trip + JSON serialization already happening.
Temp file cleanup on crash If the process dies between mkstemp and os.replace, an orphaned temp file is left in the directory. This is strictly better than the current behavior (losing the workflow entirely).
Windows os.replace atomicity os.replace is not truly atomic on NTFS but is the best available primitive. A concurrent process holding a handle (antivirus, file indexer) could cause a PermissionError, but this is the same failure mode as the current direct open("wb") — no regression.
Custom node ecosystem No backend hooks or file watchers exist on the user/ directory. Custom nodes reading via GET /userdata are unaffected. Nodes writing to the same path concurrently already have no coordination — atomic writes actually improve this by preventing partial reads.

Why this is not a performance concern

  • Autosave is off by default. When enabled, it is debounced at a minimum of 1000ms with an in-flight guard that serializes writes — this path cannot fire more than ~1x/sec regardless of edit rate.
  • Manual saves are human-rate-limited (Ctrl+S).
  • The only non-debounced writes through this path are bookmark toggles (.index.json), which are infrequent user actions.
  • The assets system (app/assets/) already uses this same os.replace pattern in ingest.py for asset uploads with no reported performance issues.

Write to a temp file in the same directory then os.replace() onto the
target path.  If the process crashes mid-write, the original file is
left intact instead of being truncated to zero bytes.

Fixes #11298
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

The post_userdata function in user_manager.py has been refactored to implement atomic file writing. The implementation creates a temporary file in the target directory, writes data to it, and then atomically replaces the original file using os.replace. The temporary file is explicitly cleaned up on operation failure. This change adds the tempfile module import and modifies the write operation logic without altering any public function signatures.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: implementing atomic writes using a temporary file pattern to prevent data loss during crashes.
Description check ✅ Passed The description clearly explains the problem (truncation + crash = zero-byte file), the solution (atomic writes via tempfile + os.replace), and addresses tradeoffs comprehensively.
Linked Issues check ✅ Passed The PR implements the exact solution requested in #11298: writing to a temporary file first, then atomically replacing the original, preventing zero-byte file loss on crash.
Out of Scope Changes check ✅ Passed All changes are in-scope: the modified post_userdata function in app/user_manager.py implements atomic write handling as specified in #11298 with no extraneous changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/user_manager.py (1)

387-389: Cleanup failure can mask the original exception.

If os.unlink(tmp_path) raises (e.g., permissions issue or race condition), the original exception that triggered the cleanup is lost—line 389's raise is never reached. Additionally, a bare except: catches KeyboardInterrupt/SystemExit.

Wrap the cleanup in its own try-except to ensure the original error propagates:

♻️ Proposed fix for robust cleanup
-            except:
-                os.unlink(tmp_path)
-                raise
+            except BaseException:
+                try:
+                    os.unlink(tmp_path)
+                except OSError:
+                    pass  # Cleanup failed; still re-raise original
+                raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/user_manager.py` around lines 387 - 389, The current bare `except:` block
around the failing operation allows `KeyboardInterrupt`/`SystemExit` to be
caught and also can lose the original exception if `os.unlink(tmp_path)` raises;
change the handler to `except Exception as err:` (preserving the original
exception in `err`), then perform cleanup in its own try/except: `try:
os.unlink(tmp_path)`; `except Exception as cleanup_err:` log or swallow
`cleanup_err` but do not replace `err`; finally re-raise the original `err`
(e.g., `raise`) so the original exception from the protected block (not any
unlink failure) always propagates; references: `tmp_path`, `os.unlink`, and the
bare `except:` in the current handler.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/user_manager.py`:
- Around line 387-389: The current bare `except:` block around the failing
operation allows `KeyboardInterrupt`/`SystemExit` to be caught and also can lose
the original exception if `os.unlink(tmp_path)` raises; change the handler to
`except Exception as err:` (preserving the original exception in `err`), then
perform cleanup in its own try/except: `try: os.unlink(tmp_path)`; `except
Exception as cleanup_err:` log or swallow `cleanup_err` but do not replace
`err`; finally re-raise the original `err` (e.g., `raise`) so the original
exception from the protected block (not any unlink failure) always propagates;
references: `tmp_path`, `os.unlink`, and the bare `except:` in the current
handler.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f8acd8db-519d-4de5-8d07-cfd35b0028ad

📥 Commits

Reviewing files that changed from the base of the PR and between 593be20 and 499abac.

📒 Files selected for processing (1)
  • app/user_manager.py

@comfyanonymous comfyanonymous merged commit 9a870b5 into master Mar 17, 2026
15 checks passed
@comfyanonymous comfyanonymous deleted the fix/atomic-userdata-writes branch March 17, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lost workflow when saving

4 participants