-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Hi,
Transient network partitions can disrupt replication.
Most commonly, the client stops uploading, just indefinitely queuing local writes.
In other cases, less frequently, the tests complete without any indication of errors, local writes are uploaded to PostgreSQL, but only replicated to 0 or a subset of clients.
There's no obvious differences between the Dart and Rust implementations.
Partition behavior:
- repeatedly
- 0 to all client hosts are randomly partitioned from the PowerSync sync service host
- the client network failure is randomly selected to be none, inbound, outbound, or all traffic
- wait a random interval, 0-10s
- heal network partition
- wait a random interval, 0-10s
- 0 to all client hosts are randomly partitioned from the PowerSync sync service host
At the end of the test:
- insure network is healed for all clients
- quiesce for 3s, i.e. no further transactions
- wait for
db.uploadQueueStats.count: 0
for all clients - wait for
db.SyncStatus.downloading: false
for all clients - do a final read on all clients and check for Strong Convergence
Without partitioning, all test runs are successful.
Successful Tests
During successful runs you can see:
# partition clients
[2025-07-20 02:18:23.222360] [main] [INFO] nemesis: partition: starting: outbound
[2025-07-20 02:18:23.236820] [main] [INFO] nemesis: partition: current: outbound, hosts: {powersync}
# clients continue with local transactions
[2025-07-20 02:18:23.254476] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 15, type: ok, value: [{f: readAll, v: {0: -1, 1: 42, 2: 21, ...}}, {f: writeSome, v: {16: 62, 53: 62, 50: 62, 33: 62}}]}
# partition is healed
[2025-07-20 02:18:31.337347] [main] [INFO] nemesis: partition: starting: none
[2025-07-20 02:18:31.363529] [main] [INFO] nemesis: partition: current: none, hosts: {}
# clients catchup with downloads from sync service
[2025-07-20 02:18:36.427705] [ps-1] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 4 / 8) uploading: false lastSyncedAt: 2025-07-20 02:18:23.000, hasSynced: true, error: null>
...
[2025-07-20 02:18:36.439118] [ps-1] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 20 / 20) uploading: true lastSyncedAt: 2025-07-20 02:18:23.000, hasSynced: true, error: null>
# clients upload to sync service
[2025-07-20 02:18:36.439257] [ps-1] [FINER] uploadData: call: 17 begin with UploadQueueStats.count: 108
# local transactions, downloading, uploading go on as normal
...
# at the end of the test
[2025-07-20 02:20:09.373311] [main] [INFO] wait for upload queue to be empty in clients
[2025-07-20 02:20:09.375017] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 0, type: ok, value: {f: uploadQueueCount, v: {db.uploadQueueStats.count: 0}}}
...
[2025-07-20 02:20:09.379592] [main] [INFO] wait for downloading to be false in clients
[2025-07-20 02:20:09.379765] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 2, type: ok, value: {f: downloadingWait, v: {db.SyncStatus.downloading:: false}}}
...
[2025-07-20 02:20:09.380379] [main] [INFO] check for strong convergence in final reads
[2025-07-20 02:20:09.505932] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 3, type: ok, value: {f: selectAll, v: {0: 938, 1: 964, 2: 989, ...}}}
...
[2025-07-20 02:20:09.528708] [main] [INFO] Success!
[2025-07-20 02:20:09.528714] [main] [INFO] Final reads had Strong Convergence
[2025-07-20 02:20:09.528717] [main] [INFO] All transactions were Atomic and Causally Consistent
As expected, network errors exist in sync service logs:
$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log
{"cause":{"code":"ECONNREFUSED","message":"request to http://localhost:8089/api/auth/keys failed, reason: connect ECONNREFUSED 127.0.0.1:8089","name":"FetchError"},...}
Wedged/Stalled Client Uploads
At the end of the test with:
- healed partitions
- no transactions
uploading remains true
, but the upload queue never clears:
- note lack of updated
SyncStatus
, or other messages db.UploadQueueStats.count
is wedged/stuck, never changes- we wait 11 seconds with no upload activity before ending the test, but the wait can be indefinite. i.e. client never recovers
[2025-07-15 13:27:10.486902] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:10.486917] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:11.487930] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:11.487946] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:12.488792] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:12.488834] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:13.489750] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:13.489785] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:14.490856] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:14.490984] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:15.491745] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:15.491817] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:16.492777] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:16.492787] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:17.493804] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:17.493867] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:18.494741] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:18.494804] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:19.495766] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:19.495905] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:20.496789] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:20.496848] [ps-1] [FINE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:21.497830] [ps-1] [SEVERE] UploadQueueStats.count appears to be stuck at 200 after waiting for 11s
[2025-07-15 13:27:21.497843] [ps-1] [SEVERE] db.closed: false
[2025-07-15 13:27:21.497901] [ps-1] [SEVERE] db.connected: true
[2025-07-15 13:27:21.497911] [ps-1] [SEVERE] db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:21.498085] [ps-1] [SEVERE] db.getUploadQueueStats(): UploadQueueStats<count: 200 size: 3.90625kB>
[2025-07-15 13:27:21.498140] [ps-1] [SEVERE] Error exit: reason: uploadQueueStatsCount, code: 12
Unexpectedly, there are no network errors in the logs:
$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log
Divergent Final Reads
At the end of the test with:
- healed partitions
- no transactions
db.uploadQueueStats.count: 0
for all clientsdb.SyncStatus.downloading: false
for all clients- no errors
there can be divergent final reads:
# client 5 writes {0: 992}
[2025-07-17 04:07:26.390143] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 183, type: ok, value: [{f: readAll, v: {0: 487,...}}, {f: writeSome, v: {0: 992, ...}}]}
# client 5 reads {0: 992}
[2025-07-17 04:07:26.589324] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 184, type: ok, value: [{f: readAll, v: {0: 992, ...}}, {f: writeSome, v: {...}}]}
# client 5 uploads {0: 992}
[2025-07-17 04:07:36.402584] [ps-5] [FINER] uploadData: call: 48 txn: 184 patch: {0: 992}
# note: no other client ever observes {0: 992}
# insure network is healed
[2025-07-17 04:07:30.684918] [main] [INFO] nemesis: partition: current: none, hosts: {}
# quiet time, no more transactions
[2025-07-17 04:07:30.684950] [main] [INFO] quiesce for 3 seconds...
# wait for `db.uploadQueueStats.count: 0` for all clients
[2025-07-17 04:07:33.685707] [main] [INFO] wait for upload queue to be empty in clients
[2025-07-17 04:07:33.687003] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 0, type: ok, value: {f: uploadQueueCount, v: {db.uploadQueueStats.count: 0}}}
...
# wait for `db.SyncStatus.downloading: false` for all clients
[2025-07-17 04:07:36.693082] [main] [INFO] wait for downloading to be false in clients
[2025-07-17 04:07:36.693400] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 2, type: ok, value: {f: downloadingWait, v: {db.SyncStatus.downloading:: false}}}
# client 5 still reads {0: 992} for its final read
[2025-07-17 04:07:36.811193] [ps-5] [FINE] database api: response: {clientNum: 5, clientType: ps, f: api, id: 3, type: ok, value: {f: selectAll, v: {0: 992, ...}}}
# only client 5 and PostgreSQL read {0: 992} during final reads
# clients 1-4 all read a different value for key 0
[2025-07-17 04:07:36.811322] [main] [SEVERE] Divergent final reads!:
[2025-07-17 04:07:36.811342] [main] [SEVERE] PostgreSQL {k: v} for client diversions:
[2025-07-17 04:07:36.811409] [main] [SEVERE] pg: {0: 992, ...}
[2025-07-17 04:07:36.811441] [main] [SEVERE] ps-1 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811556] [main] [SEVERE] ps-1 {0: 777, ...}
[2025-07-17 04:07:36.811575] [main] [SEVERE] ps-2 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811619] [main] [SEVERE] ps-2 {0: 975, ...}
[2025-07-17 04:07:36.811625] [main] [SEVERE] ps-3 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811651] [main] [SEVERE] ps-3 {0: 929, ...}
[2025-07-17 04:07:36.811656] [main] [SEVERE] ps-4 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811683] [main] [SEVERE] ps-4 {0: 951, ...}
[2025-07-17 04:07:36.811722] [main] [SEVERE] :(
[2025-07-17 04:07:36.811736] [main] [SEVERE] Error exit: reason: strongConvergence, code: 1
Unexpectedly, there are no network errors in the logs:
$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log
Fuzzing with a test matrix of:
- 5 | 10 clients
- 10 | 20 | 30 | 40 transactions/second
- 100 | 200 | 300 second run times
will generate errors in ~30-40% of the tests.
Let me know if there's anything that can be changed in the test environment to provide more relevant or helpful information, or if the the tests can be simplified to be more usable.