Skip to content

network partitioning and consistency/correctness #307

@nurturenature

Description

@nurturenature

Hi,

Transient network partitions can disrupt replication.

Most commonly, the client stops uploading, just indefinitely queuing local writes.

In other cases, less frequently, the tests complete without any indication of errors, local writes are uploaded to PostgreSQL, but only replicated to 0 or a subset of clients.

There's no obvious differences between the Dart and Rust implementations.

Partition behavior:

  • repeatedly
    • 0 to all client hosts are randomly partitioned from the PowerSync sync service host
      • the client network failure is randomly selected to be none, inbound, outbound, or all traffic
    • wait a random interval, 0-10s
    • heal network partition
    • wait a random interval, 0-10s

At the end of the test:

  • insure network is healed for all clients
  • quiesce for 3s, i.e. no further transactions
  • wait for db.uploadQueueStats.count: 0 for all clients
  • wait for db.SyncStatus.downloading: false for all clients
  • do a final read on all clients and check for Strong Convergence

Without partitioning, all test runs are successful.


Successful Tests

During successful runs you can see:

# partition clients
[2025-07-20 02:18:23.222360] [main] [INFO] nemesis: partition: starting: outbound
[2025-07-20 02:18:23.236820] [main] [INFO] nemesis: partition: current: outbound, hosts: {powersync}

# clients continue with local transactions
[2025-07-20 02:18:23.254476] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 15, type: ok, value: [{f: readAll, v: {0: -1, 1: 42, 2: 21, ...}}, {f: writeSome, v: {16: 62, 53: 62, 50: 62, 33: 62}}]}

# partition is healed
[2025-07-20 02:18:31.337347] [main] [INFO] nemesis: partition: starting: none
[2025-07-20 02:18:31.363529] [main] [INFO] nemesis: partition: current: none, hosts: {}

# clients catchup with downloads from sync service
[2025-07-20 02:18:36.427705] [ps-1] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 4 / 8) uploading: false lastSyncedAt: 2025-07-20 02:18:23.000, hasSynced: true, error: null>
...
[2025-07-20 02:18:36.439118] [ps-1] [FINEST] SyncStatus<connected: true connecting: false downloading: true (progress: for total: 20 / 20) uploading: true lastSyncedAt: 2025-07-20 02:18:23.000, hasSynced: true, error: null>

# clients upload to sync service
[2025-07-20 02:18:36.439257] [ps-1] [FINER] uploadData: call: 17 begin with UploadQueueStats.count: 108

# local transactions, downloading, uploading go on as normal
...

# at the end of the test
[2025-07-20 02:20:09.373311] [main] [INFO] wait for upload queue to be empty in clients
[2025-07-20 02:20:09.375017] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 0, type: ok, value: {f: uploadQueueCount, v: {db.uploadQueueStats.count: 0}}}
...

[2025-07-20 02:20:09.379592] [main] [INFO] wait for downloading to be false in clients
[2025-07-20 02:20:09.379765] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 2, type: ok, value: {f: downloadingWait, v: {db.SyncStatus.downloading:: false}}}
...

[2025-07-20 02:20:09.380379] [main] [INFO] check for strong convergence in final reads
[2025-07-20 02:20:09.505932] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 3, type: ok, value: {f: selectAll, v: {0: 938, 1: 964, 2: 989, ...}}}
...

[2025-07-20 02:20:09.528708] [main] [INFO] Success!
[2025-07-20 02:20:09.528714] [main] [INFO] Final reads had Strong Convergence
[2025-07-20 02:20:09.528717] [main] [INFO] All transactions were Atomic and Causally Consistent

As expected, network errors exist in sync service logs:

$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log 
{"cause":{"code":"ECONNREFUSED","message":"request to http://localhost:8089/api/auth/keys failed, reason: connect ECONNREFUSED 127.0.0.1:8089","name":"FetchError"},...}

Wedged/Stalled Client Uploads

At the end of the test with:

  • healed partitions
  • no transactions

uploading remains true, but the upload queue never clears:

  • note lack of updated SyncStatus, or other messages
  • db.UploadQueueStats.count is wedged/stuck, never changes
  • we wait 11 seconds with no upload activity before ending the test, but the wait can be indefinite. i.e. client never recovers
[2025-07-15 13:27:10.486902] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:10.486917] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:11.487930] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:11.487946] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:12.488792] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:12.488834] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:13.489750] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:13.489785] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:14.490856] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:14.490984] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:15.491745] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:15.491817] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:16.492777] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:16.492787] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:17.493804] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:17.493867] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:18.494741] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:18.494804] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:19.495766] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:19.495905] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:20.496789] [ps-1] [FINE] database api: uploadQueueWait: waiting on UploadQueueStats.count: 200 ...
[2025-07-15 13:27:20.496848] [ps-1] [FINE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:21.497830] [ps-1] [SEVERE] UploadQueueStats.count appears to be stuck at 200 after waiting for 11s
[2025-07-15 13:27:21.497843] [ps-1] [SEVERE] 	db.closed: false
[2025-07-15 13:27:21.497901] [ps-1] [SEVERE] 	db.connected: true
[2025-07-15 13:27:21.497911] [ps-1] [SEVERE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: false (progress: null) uploading: true lastSyncedAt: 2025-07-15 13:26:32.000, hasSynced: true, error: null>
[2025-07-15 13:27:21.498085] [ps-1] [SEVERE] 	db.getUploadQueueStats(): UploadQueueStats<count: 200 size: 3.90625kB>
[2025-07-15 13:27:21.498140] [ps-1] [SEVERE] Error exit: reason: uploadQueueStatsCount, code: 12

Unexpectedly, there are no network errors in the logs:

$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log

Divergent Final Reads

At the end of the test with:

  • healed partitions
  • no transactions
  • db.uploadQueueStats.count: 0 for all clients
  • db.SyncStatus.downloading: false for all clients
  • no errors

there can be divergent final reads:

# client 5 writes {0: 992}
[2025-07-17 04:07:26.390143] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 183, type: ok, value: [{f: readAll, v: {0: 487,...}}, {f: writeSome, v: {0: 992, ...}}]}

# client 5 reads {0: 992}
[2025-07-17 04:07:26.589324] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 184, type: ok, value: [{f: readAll, v: {0: 992, ...}}, {f: writeSome, v: {...}}]}

# client 5 uploads {0: 992}
[2025-07-17 04:07:36.402584] [ps-5] [FINER] uploadData: call: 48 txn: 184 patch: {0: 992} 

# note: no other client ever observes {0: 992}

# insure network is healed
[2025-07-17 04:07:30.684918] [main] [INFO] nemesis: partition: current: none, hosts: {}

# quiet time, no more transactions
[2025-07-17 04:07:30.684950] [main] [INFO] quiesce for 3 seconds...

# wait for `db.uploadQueueStats.count: 0` for all clients
[2025-07-17 04:07:33.685707] [main] [INFO] wait for upload queue to be empty in clients
[2025-07-17 04:07:33.687003] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 0, type: ok, value: {f: uploadQueueCount, v: {db.uploadQueueStats.count: 0}}}
...

# wait for `db.SyncStatus.downloading: false` for all clients
[2025-07-17 04:07:36.693082] [main] [INFO] wait for downloading to be false in clients
[2025-07-17 04:07:36.693400] [ps-1] [FINE] database api: response: {clientNum: 1, clientType: ps, f: api, id: 2, type: ok, value: {f: downloadingWait, v: {db.SyncStatus.downloading:: false}}}

# client 5 still reads {0: 992} for its final read
[2025-07-17 04:07:36.811193] [ps-5] [FINE] database api: response: {clientNum: 5, clientType: ps, f: api, id: 3, type: ok, value: {f: selectAll, v: {0: 992, ...}}}

# only client 5 and PostgreSQL read {0: 992} during final reads
# clients 1-4 all read a different value for key 0
[2025-07-17 04:07:36.811322] [main] [SEVERE] Divergent final reads!:
[2025-07-17 04:07:36.811342] [main] [SEVERE] PostgreSQL {k: v} for client diversions:
[2025-07-17 04:07:36.811409] [main] [SEVERE] pg: {0: 992, ...}
[2025-07-17 04:07:36.811441] [main] [SEVERE] ps-1 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811556] [main] [SEVERE] ps-1 {0: 777, ...}
[2025-07-17 04:07:36.811575] [main] [SEVERE] ps-2 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811619] [main] [SEVERE] ps-2 {0: 975, ...}
[2025-07-17 04:07:36.811625] [main] [SEVERE] ps-3 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811651] [main] [SEVERE] ps-3 {0: 929, ...}
[2025-07-17 04:07:36.811656] [main] [SEVERE] ps-4 {k: v} that diverged from PostgreSQL
[2025-07-17 04:07:36.811683] [main] [SEVERE] ps-4 {0: 951, ...}
[2025-07-17 04:07:36.811722] [main] [SEVERE] :(
[2025-07-17 04:07:36.811736] [main] [SEVERE] Error exit: reason: strongConvergence, code: 1

Unexpectedly, there are no network errors in the logs:

$ grep -P 'error: (?!null)' powersync_fuzz.log
$ grep -i 'err' powersync.log

Fuzzing with a test matrix of:

  • 5 | 10 clients
  • 10 | 20 | 30 | 40 transactions/second
  • 100 | 200 | 300 second run times

will generate errors in ~30-40% of the tests.

Let me know if there's anything that can be changed in the test environment to provide more relevant or helpful information, or if the the tests can be simplified to be more usable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions