fix(db-server): Database replica provisioning from a server snapshot#6741
Open
saurabh6790 wants to merge 4 commits into
Open
fix(db-server): Database replica provisioning from a server snapshot#6741saurabh6790 wants to merge 4 commits into
saurabh6790 wants to merge 4 commits into
Conversation
Contributor
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6741 +/- ##
============================================
- Coverage 62.93% 50.74% -12.19%
============================================
Files 117 994 +877
Lines 18112 83827 +65715
Branches 527 526 -1
============================================
+ Hits 11398 42541 +31143
- Misses 6681 41253 +34572
Partials 33 33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…replica
Setting up a database replica from a server snapshot fails at "Restart
MariaDB Service After Adding skip-slave-start" with:
Unable to start service mariadb: Failed to start mariadb.service:
Unit mariadb.service is masked.
The replica's root volume comes from the standard database VMI, where
mariadb.service is masked. On a normal provision the masked unit gets
unmasked as a side effect of `apt install mariadb-server` in the MariaDB
upgrade role. But when the snapshot is already at the target version
(11.8), the whole mariadb_10_6_to_11_8 role is skipped by its `< 11.8.0`
guard, so the unit is never unmasked. The data volume was also mounted
with start_mariadb_after_mount=False (the replica must be prepared before
starting), so the first task that touches the service is the skip-slave-start
restart — which uses `name: mariadb` and hits the masked unit.
Add `masked: false` to the three `name: mariadb` restart tasks in the
mariadb_prepare_replica.yml chain (add_skip_slave_start, prepare_replica,
remove_skip_slave_start). The systemd module unmasks before restarting, so
each task is self-sufficient regardless of role order or whether the upgrade
role ran. The first restart now unmasks, starts and enables the unit, so the
"Wait For MariaDB To Be Ready" check in prepare_replica passes.
This is the second "skipped when the snapshot is already current" gap in the
Create Server replica flow; the first was the stale data-volume mount id
fixed in 115818b.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…plica
Provisioning a database replica from a server snapshot failed at the
"Configure Mariadb Replica" step with:
ValidationError: Cannot enable binlog auto purge for replication
configured servers
on_update forbids auto_purge_binlog_based_on_size on a replication
configured server. But every new Database Server defaults that flag on
(before_insert sets it in both branches, and Cluster.create_server sets it
too), and nothing clears it when the server later becomes a replica. So when
configure_replication() flips is_replication_setup to True and saves, the
guard rejects the save and provisioning breaks.
Clear auto_purge_binlog_based_on_size at the two points where a server
transitions to replication configured — configure_replication() and
_setup_secondary() — so the invariant the guard protects holds before the
save. The guard itself is left intact, so a user enabling auto purge on an
already-replication server is still rejected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…plication
configure_replication() assumed the preceding "Prepare Mariadb Replica"
press job step left MariaDB running. That holds on a clean run, but the
provisioning steps execute as separate jobs — retrying "Configure Mariadb
Replica" in isolation (or any race in the Prepare -> Configure handoff) can
hit a stopped server, surfacing as an opaque connection-refused deep inside
the agent:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server ...
[Errno 111] Connection refused")
Restart MariaDB before issuing replication commands by reusing
_restart_mariadb() (restart_mysql.yml). MariaDB's systemd unit is
Type=notify, so the play returns only once the server accepts connections,
making the Configure step self-sufficient regardless of prior state.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Provisioning a database replica from a Consistent server snapshot
(
setup_db_replication) failed at three successive steps of theCreate Serverpress job. Each failure shares one root theme: the create-server flow assumes
the upgrade / mount / default-config steps did setup work that gets skipped — or
conflicts — when the snapshot is already at the current version. Fixing each
unblocked the next, so all three are needed for the flow to complete.
Upgrade Mariadb(mount failed)Restart MariaDB … skip-slave-startmariadb.servicemaskedConfigure Mariadb Replicaauto_purge_binlogon a replicaFix 1 — Stale data-volume mount id (
Upgrade Mariadb)Symptom: the
Mount Volumesplaybook couldn't mount the data disk. TheDatabase Server's mount pointed at
vol-0843daa3133…, but the VM only hadvol-046afa9acba5e91aa(data) + root.Cause: a snapshot-based server first boots with the VMI's default data
volume. During early provisioning,
VirtualMachine.sync()→update_servers()→
server.save()→validate()→validate_mounts()seeds the mount off thatdefault volume and computes its AWS by-id source path.
create_volume_from_snapshotthen deletes that volume and creates a fresh one from the snapshot. The later
Sync Attached Volumesstep callsvalidate_mountsagain to reseed, but itsnot self.mountsguard makes it a no-op — so the mount keeps the deleted volumeid and its dead device path.
Change: in
validate_mounts, skip seeding while a snapshot swap is pending(
machine.data_disk_snapshot and not machine.data_disk_snapshot_attached). Thedefault volume is about to be deleted;
Sync Attached Volumesseeds correctlyonce the snapshot volume is attached. Unit-tested.
Fix 2 — Masked mariadb.service (
Restart MariaDB … skip-slave-start)Symptom:
Unit mariadb.service is masked.Cause: the replica's root volume comes from the standard database VMI, where
mariadb.serviceis masked. On a normal provision it's unmasked as a side effectof
apt install mariadb-serverin the MariaDB upgrade role — but that whole roleis skipped when the snapshot is already at the target version (11.8). The data
volume was also mounted with
start_mariadb_after_mount=False(the replica mustbe prepared before starting), so the first task to touch the service is the
skip-slave-start restart, which uses
name: mariadband hits the masked unit.Change: add
masked: falseto the threename: mariadbrestart tasks in themariadb_prepare_replica.ymlchain (add_skip_slave_start,prepare_replica,remove_skip_slave_start). The systemd module unmasks before restarting, so eachtask is self-sufficient regardless of role order or whether the upgrade role ran.
Fix 3 — Binlog auto purge on a replica (
Configure Mariadb Replica)Symptom:
ValidationError: Cannot enable binlog auto purge for replication configured servers.Cause:
on_updateforbidsauto_purge_binlog_based_on_sizeon a replicationconfigured server, but every new Database Server defaults that flag on
(
before_insertsets it in both branches, andCluster.create_servertoo), andnothing clears it when the server becomes a replica. When
configure_replication()flips
is_replication_setupto True and saves, the guard rejects the save.Change: clear
auto_purge_binlog_based_on_sizeat the two points where aserver transitions to replication configured (
configure_replicationand_setup_secondary), so the invariant holds before the save. The guard is leftintact — a user enabling auto purge on an existing replica is still rejected.
Unit-tested.
Testing
test_validate_mounts_seeds_snapshot_volume_not_doomed_default_volumetest_configure_replication_disables_binlog_auto_purge_on_replicaserveranddatabase_servertest modules green.three restart tasks resolve to
masked=False. Verify end-to-end by provisioninga replica from an already-11.8 snapshot.