[MongoDB Replication] Fix resumeTokens going back in time on busy change streams #301
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Due to a bug in the NodeJS driver, in some cases, the change stream may go back in time. In most cases, it would only go back a short period, and quickly catch up again, resulting in almost no symptoms. But if the change stream is never idle for more than 10 seconds at a time, this it could go back multiple hours, resulting in significant replication lag while it attempts to catch up.
The symptoms are (1) replication lag increasing significantly despite no recent sync rule deploys, often in a big jump and (2) repeatedly logging
Re-applied transaction ... - skipping checkpoint
until replication has caught up.The issue is triggered by a "ResumableChangeStreamError". It could be from a replication fail-over event, a temporary network issue, or some other scenarios, in which case the driver restarts the change stream with the wrong resume token. See upstream driver issue for details: https://jira.mongodb.org/browse/NODE-7042
The workaround here is to detect when the issue happens, by comparing the LSNs. When the issue is detected, we just throw an error, causing replication to be restarted with a new change stream, from starting at the last good resume token.
To reproduce the issue, see the upstream bug report above. To reproduce within PowerSync, run the same script, while running powersync replication from the same source database. With this fix in place, it now results in logs like this: