Skip to content

client application process kills and consistency/correctness #308

@nurturenature

Description

@nurturenature

Hi,

When testing client application process kills, e.g.:

  • someone's phone running out of battery
  • a user turning off their laptop without shutting down
  • force quitting an app
  • etc

PowerSync is usually quite resilient with a full recovery after client is restarted and reconnects to the sync service:

  • quickly/fully downloads to catch up with the other clients
  • resumes uploading, including any pending pre-kill transactions
    • even for transactions that were in progress, only partially uploaded, when client was killed
  • continues with normal operation as if nothing happened

Test choreography:

  • repeatedly
    • 1 to a majority of clients are randomly killed
    • wait a random interval, 0-10s
    • restart/reconnect clients that were killed reusing existing SQLite3 files
    • wait a random interval, 0-10s

At the end of the test:

  • insure all clients are started/connected
  • quiesce for 3s, i.e. no further transactions
  • wait for db.uploadQueueStats.count: 0 for all clients
  • wait for db.SyncStatus.downloading: false for all clients
  • do a final read on all clients and check for Strong Convergence

Without killing, all test runs are successful.

With process kills, occasionally, ~0.5%, the test will end with:

  • db.currentStatus.downloading staying indefinitely true
  • db.currentStatus.progress counting to #/# but never nulling

and local transactions committed after the kill not being uploaded to the sync service.

There's no obvious differences between the Dart and Rust sync implementations.

There's a lot going on in the logs during a kill/restart/reconnect sequence, but here's a commented snippet from the end of a test that illustrates the issue:

# last ~20s of test

# local writes of 4461 to several keys in a transaction which are uploaded

[2025-07-20 23:53:44.467176] [main] [FINE] SQL txn: response: {clientType: ps, f: txn, type: ok, value: [{f: readAll, k: null, v: {0: 4423, ...}}, {f: writeSome, k: null, v: {88: 4461, 0: 4461, 31: 4461, 93: 4461}}]}
2025-07-20T23:53:44.464781  0:00:00.002482 POST    [200] /sql-txn
[2025-07-20 23:53:44.468805] [main] [FINER] uploadData: call: 311 txn: 610 begin CrudTransaction<610, [CrudEntry<610/2437 PATCH mww/88 {v: 4461}>, CrudEntry<610/2438 PATCH mww/0 {v: 4461}>, CrudEntry<610/2439 PATCH mww/31 {v: 4461}>, CrudEntry<610/2440 PATCH mww/93 {v: 4461}>]>
...
[2025-07-20 23:53:44.474808] [main] [FINER] uploadData: call: 311 txn: 610 committed
[2025-07-20 23:53:44.475026] [main] [FINER] uploadData: call: 311 end with UploadQueueStats.count: 0

# upload queue is now empty

# several sequences of normal downloading
# downloads fully complete until progress is null

[2025-07-20 23:53:44.480877] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 0 / 4) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
...
[2025-07-20 23:53:44.555293] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 28 / 28) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:44.555937] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
...

# local writes of 4472 to several keys in a transaction
# client will be killed before they can be successfully uploaded

[2025-07-20 23:53:44.674403] [main] [FINE] SQL txn: response: {clientType: ps, f: txn, type: ok, value: [{f: readAll, k: null, v: {0: 4461, ...}}, {f: writeSome, k: null, v: {12: 4472, 88: 4472, 87: 4472, 60: 4472}}]}

# start upload of transaction that wrote 4472
# client will be killed mid upload

[2025-07-20 23:53:44.683554] [main] [FINER] uploadData: call: 312 begin with UploadQueueStats.count: 4
[2025-07-20 23:53:44.683750] [main] [FINER] uploadData: call: 312 txn: 611 begin CrudTransaction<611, [CrudEntry<611/2441 PATCH mww/12 {v: 4472}>, CrudEntry<611/2442 PATCH mww/88 {v: 4472}>, CrudEntry<611/2443 PATCH mww/87 {v: 4472}>, CrudEntry<611/2444 PATCH mww/60 {v: 4472}>]>

# local writes of 4473 to several keys in a transaction
# client will be killed before the transaction is uploaded

[2025-07-20 23:53:44.690599] [main] [FINE] SQL txn: response: {clientType: ps, f: txn, type: ok, value: [{f: readAll, k: null, v: {0: 4461, ...}}, {f: writeSome, k: null, v: {17: 4473, 24: 4473, 64: 4473, 8: 4473}}]}

# still uploading transaction that wrote 4472
# client will be killed before upload is committed 

[2025-07-20 23:53:44.700366] [main] [FINER] uploadData: call: 312 txn: 611 patch: {12: 4472} 
[2025-07-20 23:53:44.701982] [main] [FINER] uploadData: call: 312 txn: 611 patch: {88: 4472} 
[2025-07-20 23:53:44.714241] [main] [FINER] uploadData: call: 312 txn: 611 patch: {87: 4472} 
[2025-07-20 23:53:44.719150] [main] [FINER] uploadData: call: 312 txn: 611 patch: {60: 4472} 

#
# CLIENT PROCESS KILLED
#

# client remains killed for ~7s, other clients continue with their local transactions

# client application is restarted

2025-07-20 23:53:51 Jepsen starting  /jepsen/jepsen-powersync/powersync_endpoint/powersync_http --endpoint powersync --clientImpl rust

# existing SQLite3 file is preserved

[2025-07-20 23:53:51.477284] [main] [INFO] db: init: preexisting SQLite3 file: /jepsen/jepsen-powersync/powersync_endpoint/http.sqlite3
[2025-07-20 23:53:51.477302] [main] [INFO] 	preexisting file preserved

# normal API calls for db creation, initialization, connect, etc.

# auth token is requested/created/etc

[2025-07-20 23:53:51.992961] [main] [FINER] auth: created token w/payload: {iss: https://github.com/nurturenature/jepsen-powersync, sub: userId, aud: [powersync-dev, powersync], iat: 1753055631.977, exp: 1753055931.977, parameters: {sync_filter: *}}

# upload data is called to upload the 2 transactions in the upload queue, i.e. not uploaded before the process kill
# note: BackendConnector is max write wins so greater values were written during the time client was killed

[2025-07-20 23:53:51.998536] [main] [FINER] uploadData: call: 1 begin with UploadQueueStats.count: 8
[2025-07-20 23:53:52.000342] [main] [FINER] uploadData: call: 1 txn: 611 begin CrudTransaction<611, [CrudEntry<611/2441 PATCH mww/12 {v: 4472}>, CrudEntry<611/2442 PATCH mww/88 {v: 4472}>, CrudEntry<611/2443 PATCH mww/87 {v: 4472}>, CrudEntry<611/2444 PATCH mww/60 {v: 4472}>]>
...
[2025-07-20 23:53:52.052889] [main] [FINER] uploadData: call: 1 txn: 611 committed
[2025-07-20 23:53:52.053189] [main] [FINER] uploadData: call: 1 txn: 612 begin CrudTransaction<612, [CrudEntry<612/2445 PATCH mww/17 {v: 4473}>, CrudEntry<612/2446 PATCH mww/24 {v: 4473}>, CrudEntry<612/2447 PATCH mww/64 {v: 4473}>, CrudEntry<612/2448 PATCH mww/8 {v: 4473}>]>
...
[2025-07-20 23:53:52.065657] [main] [FINER] uploadData: call: 1 txn: 612 committed
[2025-07-20 23:53:52.066858] [main] [FINER] uploadData: call: 1 end with UploadQueueStats.count: 0

# transactions writing 4472 and 4473, the transactions in the upload queue at the time of the kill
# have been upload and the queue is empty

# the following 2 new transactions, after the kill and before the end of the test, with local writes will never be uploaded

[2025-07-20 23:53:52.115743] [main] [FINE] SQL txn: response: {clientType: ps, f: txn, type: ok, value: [{f: readAll, k: null, v: {0: 4461, }}, {f: writeSome, k: null, v: {90: 4842, 80: 4842, 5: 4842, 7: 4842}}]}
[2025-07-20 23:53:52.143933] [main] [FINE] SQL txn: response: {clientType: ps, f: txn, type: ok, value: [{f: readAll, k: null, v: {0: 4461, }}, {f: writeSome, k: null, v: {65: 4843, 94: 4843, 43: 4843, 49: 4843}}]}

# client is still downloading all of the transactions from the other clients that happened while it was killed

[2025-07-20 23:53:52.160452] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 592 / 592) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
...
[2025-07-20 23:53:52.341508] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 644 / 644) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>

# it's the end of the test
# test begins waiting for the upload queue to be empty

[2025-07-20 23:53:52.372307] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:52.372329] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 644 / 644) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>

# these will be the last SyncStatus messages received

[2025-07-20 23:53:52.380560] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 644 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:52.384780] [main] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>

# test continues to wait for the upload queue to be empty, which never happens
# test will wait for 11 seconds before failing although it can wait indefinitely
# note that downloading will stay true and progress stays 656/656 and never nulls

[2025-07-20 23:53:53.373404] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:53.373433] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:54.374406] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:54.374436] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:55.375396] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:55.375433] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:56.376429] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:56.376482] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:57.377454] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:57.377504] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:58.378385] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:58.378416] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:53:59.379378] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:53:59.379412] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:54:00.380425] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:54:00.380460] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:54:01.381362] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:54:01.381392] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:54:02.382556] [main] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 8 ...
[2025-07-20 23:54:02.382598] [main] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:54:03.383374] [main] [SEVERE] UploadQueueStats.count appears to be stuck at 8 after waiting for 11s
[2025-07-20 23:54:03.383401] [main] [SEVERE] 	db.closed: false
[2025-07-20 23:54:03.383408] [main] [SEVERE] 	db.connected: true
[2025-07-20 23:54:03.383417] [main] [SEVERE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true (progress: for total: 656 / 656) uploading: false lastSyncedAt: 2025-07-20 23:53:44.000, hasSynced: true, error: null>
[2025-07-20 23:54:03.383544] [main] [SEVERE] 	db.getUploadQueueStats(): UploadQueueStats<count: 8 size: 0.15625kB>
[2025-07-20 23:54:03.383566] [main] [SEVERE] Error exit: reason: uploadQueueStatsCount, code: 12

Full client log - killed-client.zip.

As expected, network errors exist in sync service logs:

$ find . -name 'client.log' -exec grep -H -P 'error: (?!null)' {} \;
$ grep -i 'err' powersync.log 
{"cause":{"code":"ECONNREFUSED","message":"request to http://localhost:8089/api/auth/keys failed, reason: connect ECONNREFUSED 127.0.0.1:8089","name":"FetchError"},...}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions