Skip to content

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Jul 30, 2025

This allows the worker to flush rocksdb as soon as the worker role is stopped and before the rest of the system is shut down. In order to achieve this, datafusion queries will always use the remote scanner even on worker nodes. This adds serialization/memory cost that would have been otherwise avoided but potentially can be optimized in future PRs.


Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Jul 30, 2025

Test Results

  7 files  ±0    7 suites  ±0   4m 51s ⏱️ + 1m 48s
 54 tests ±0   53 ✅ ±0  1 💤 ±0  0 ❌ ±0 
223 runs  ±0  220 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit e713752. ± Comparison against base commit 1ef58b3.

♻️ This comment has been updated with latest results.

This introduces a few crucial changes to how we handle panics in restate. Prior to this change, we would abort the process at panic time without considering a clean shutdown nor rocksdb wal fsync.

The summary of changes is as follows:
- We now always unwind the stack on panics. TaskCenter is designed to catch panics of important tasks and trigger a clean shutdown and reports a non-zero exit code.
- Ensure that on graceful shutdown timeout that we attempt to cleanly flush/shutdown rocksdb manager. This is important to avoid massive backfills of lost memtables on unclean shutdown.
- Catch panics at top-level task-center runtime control loop and trigger an emergency rocksdb WAL fsync to ensure that we flush the WAL to avoid loss of in-memory WAL buffer if/when we add support to manual wal flushing in the future.
- Makes sure that panics from network connection tasks do not trigger a system shutdown, instead, they are caught and properly logged. This avoids a situation where a network bad request/handler can cause the entire node to panic.
- In situations where tracing might have been lost/dropped, ensure that we also log critical information on stderr.

This hardens restate against unclean crashes and ensures we perform a clean handoff to other cluster members in case of an unrecoverable crash.
This allows the worker to flush rocksdb as soon as the worker role is stopped and before the rest of the system is shut down. In order to achieve this, datafusion queries will always use the remote scanner even on worker nodes. This adds serialization/memory cost that would have been otherwise avoided but potentially can be optimized in future PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant