Skip to content

fix(db-server): Database replica provisioning from a server snapshot#6741

Open
saurabh6790 wants to merge 4 commits into
frappe:developfrom
saurabh6790:develop
Open

fix(db-server): Database replica provisioning from a server snapshot#6741
saurabh6790 wants to merge 4 commits into
frappe:developfrom
saurabh6790:develop

Conversation

@saurabh6790

@saurabh6790 saurabh6790 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

Provisioning a database replica from a Consistent server snapshot
(setup_db_replication) failed at three successive steps of the Create Server
press job. Each failure shares one root theme: the create-server flow assumes
the upgrade / mount / default-config steps did setup work that gets skipped — or
conflicts — when the snapshot is already at the current version.
Fixing each
unblocked the next, so all three are needed for the flow to complete.

Step that failed Cause Fix
Upgrade Mariadb (mount failed) Mount kept a deleted volume id Refresh mount after snapshot swap
Restart MariaDB … skip-slave-start mariadb.service masked Unmask before restart
Configure Mariadb Replica auto_purge_binlog on a replica Clear flag when configuring replica

Fix 1 — Stale data-volume mount id (Upgrade Mariadb)

Symptom: the Mount Volumes playbook couldn't mount the data disk. The
Database Server's mount pointed at vol-0843daa3133…, but the VM only had
vol-046afa9acba5e91aa (data) + root.

Cause: a snapshot-based server first boots with the VMI's default data
volume. During early provisioning, VirtualMachine.sync()update_servers()
server.save()validate()validate_mounts() seeds the mount off that
default volume and computes its AWS by-id source path. create_volume_from_snapshot
then deletes that volume and creates a fresh one from the snapshot. The later
Sync Attached Volumes step calls validate_mounts again to reseed, but its
not self.mounts guard makes it a no-op — so the mount keeps the deleted volume
id and its dead device path.

Change: in validate_mounts, skip seeding while a snapshot swap is pending
(machine.data_disk_snapshot and not machine.data_disk_snapshot_attached). The
default volume is about to be deleted; Sync Attached Volumes seeds correctly
once the snapshot volume is attached. Unit-tested.

Fix 2 — Masked mariadb.service (Restart MariaDB … skip-slave-start)

Symptom: Unit mariadb.service is masked.

Cause: the replica's root volume comes from the standard database VMI, where
mariadb.service is masked. On a normal provision it's unmasked as a side effect
of apt install mariadb-server in the MariaDB upgrade role — but that whole role
is skipped when the snapshot is already at the target version (11.8). The data
volume was also mounted with start_mariadb_after_mount=False (the replica must
be prepared before starting), so the first task to touch the service is the
skip-slave-start restart, which uses name: mariadb and hits the masked unit.

Change: add masked: false to the three name: mariadb restart tasks in the
mariadb_prepare_replica.yml chain (add_skip_slave_start, prepare_replica,
remove_skip_slave_start). The systemd module unmasks before restarting, so each
task is self-sufficient regardless of role order or whether the upgrade role ran.

Fix 3 — Binlog auto purge on a replica (Configure Mariadb Replica)

Symptom: ValidationError: Cannot enable binlog auto purge for replication configured servers.

Cause: on_update forbids auto_purge_binlog_based_on_size on a replication
configured server, but every new Database Server defaults that flag on
(before_insert sets it in both branches, and Cluster.create_server too), and
nothing clears it when the server becomes a replica. When configure_replication()
flips is_replication_setup to True and saves, the guard rejects the save.

Change: clear auto_purge_binlog_based_on_size at the two points where a
server transitions to replication configured (configure_replication and
_setup_secondary), so the invariant holds before the save. The guard is left
intact — a user enabling auto purge on an existing replica is still rejected.
Unit-tested.

Testing

  • Fix 1: test_validate_mounts_seeds_snapshot_volume_not_doomed_default_volume
  • Fix 3: test_configure_replication_disables_binlog_auto_purge_on_replica
  • server and database_server test modules green.
  • Fix 2 is an Ansible playbook change (no Python test layer); YAML validated, all
    three restart tasks resolve to masked=False. Verify end-to-end by provisioning
    a replica from an already-11.8 snapshot.

@saurabh6790 saurabh6790 requested a review from adityahase as a code owner June 18, 2026 12:22
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Confidence Score: 5/5

All three changes are narrowly scoped to the snapshot-based replica provisioning path; no existing production flows are affected.

Each fix addresses a well-understood, deterministic failure in the provisioning chain, is covered by unit tests, and the Ansible changes are purely additive (masked: false is a no-op when the unit is already unmasked). No security-sensitive code is touched and no ORM/SQL patterns are changed.

No files require special attention.

Important Files Changed

Filename Overview
press/press/doctype/server/server.py Adds an early return to validate_mounts when a snapshot swap is in-flight; logically correct and well-tested.
press/press/doctype/database_server/database_server.py Clears auto_purge_binlog_based_on_size at both replication-setup transition points and adds a defensive MariaDB restart before the agent call in configure_replication.
press/playbooks/roles/mariadb_add_skip_slave_start/tasks/main.yml Adds masked: false to the MariaDB restart task; this is the first role in the chain so it does the actual unmask work for the whole sequence.
press/press/doctype/database_server/test_database_server.py Adds two unit tests covering Fix 3: binlog purge flag cleared and MariaDB restart happens before agent call.
press/press/doctype/server/test_server.py Adds a unit test for Fix 1 covering both the 'no seeding while pending' and 'correct seeding after attach' branches.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Job as Create Server Job
    participant VM as VirtualMachine
    participant Server as BaseServer.validate_mounts
    participant Ansible as Ansible Roles
    participant DB as DatabaseServer

    Note over Job,DB: Snapshot swap pending (data_disk_snapshot set, not attached)
    Job->>VM: create_volume_from_snapshot (deletes default vol)
    VM-->>Job: "snapshot volume attached, data_disk_snapshot_attached=True"
    Job->>Server: sync_attached_volumes
    Server->>Server: validate_mounts (now seeds correctly)

    Note over Ansible: mariadb.service may be masked on snapshot-provisioned server
    Job->>Ansible: mariadb_add_skip_slave_start
    Ansible->>Ansible: "systemd masked=false, state=restarted"
    Job->>Ansible: mariadb_prepare_replica
    Ansible->>Ansible: wait_for port 3306 (succeeds, already running)
    Ansible->>Ansible: "systemd masked=false, state=restarted"
    Job->>Ansible: mariadb_remove_skip_slave_start
    Ansible->>Ansible: "systemd masked=false, state=restarted"

    Note over DB: configure_replication
    Job->>DB: configure_replication()
    DB->>Ansible: _restart_mariadb() ensure running before agent call
    DB->>DB: agent.configure_replication(...)
    DB->>DB: "is_replication_setup=True, auto_purge_binlog_based_on_size=False"
    DB->>DB: save() on_update guard passes
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Job as Create Server Job
    participant VM as VirtualMachine
    participant Server as BaseServer.validate_mounts
    participant Ansible as Ansible Roles
    participant DB as DatabaseServer

    Note over Job,DB: Snapshot swap pending (data_disk_snapshot set, not attached)
    Job->>VM: create_volume_from_snapshot (deletes default vol)
    VM-->>Job: "snapshot volume attached, data_disk_snapshot_attached=True"
    Job->>Server: sync_attached_volumes
    Server->>Server: validate_mounts (now seeds correctly)

    Note over Ansible: mariadb.service may be masked on snapshot-provisioned server
    Job->>Ansible: mariadb_add_skip_slave_start
    Ansible->>Ansible: "systemd masked=false, state=restarted"
    Job->>Ansible: mariadb_prepare_replica
    Ansible->>Ansible: wait_for port 3306 (succeeds, already running)
    Ansible->>Ansible: "systemd masked=false, state=restarted"
    Job->>Ansible: mariadb_remove_skip_slave_start
    Ansible->>Ansible: "systemd masked=false, state=restarted"

    Note over DB: configure_replication
    Job->>DB: configure_replication()
    DB->>Ansible: _restart_mariadb() ensure running before agent call
    DB->>DB: agent.configure_replication(...)
    DB->>DB: "is_replication_setup=True, auto_purge_binlog_based_on_size=False"
    DB->>DB: save() on_update guard passes
Loading

Reviews (4): Last reviewed commit: "fix(database-server): Ensure MariaDB is ..." | Re-trigger Greptile

@codecov-commenter

codecov-commenter commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.18182% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 50.74%. Comparing base (ab72943) to head (d4bfc36).
⚠️ Report is 8 commits behind head on develop.

Files with missing lines Patch % Lines
...s/press/doctype/database_server/database_server.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #6741       +/-   ##
============================================
- Coverage    62.93%   50.74%   -12.19%     
============================================
  Files          117      994      +877     
  Lines        18112    83827    +65715     
  Branches       527      526        -1     
============================================
+ Hits         11398    42541    +31143     
- Misses        6681    41253    +34572     
  Partials        33       33               
Flag Coverage Δ
dashboard 62.90% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…replica

Setting up a database replica from a server snapshot fails at "Restart
MariaDB Service After Adding skip-slave-start" with:

    Unable to start service mariadb: Failed to start mariadb.service:
    Unit mariadb.service is masked.

The replica's root volume comes from the standard database VMI, where
mariadb.service is masked. On a normal provision the masked unit gets
unmasked as a side effect of `apt install mariadb-server` in the MariaDB
upgrade role. But when the snapshot is already at the target version
(11.8), the whole mariadb_10_6_to_11_8 role is skipped by its `< 11.8.0`
guard, so the unit is never unmasked. The data volume was also mounted
with start_mariadb_after_mount=False (the replica must be prepared before
starting), so the first task that touches the service is the skip-slave-start
restart — which uses `name: mariadb` and hits the masked unit.

Add `masked: false` to the three `name: mariadb` restart tasks in the
mariadb_prepare_replica.yml chain (add_skip_slave_start, prepare_replica,
remove_skip_slave_start). The systemd module unmasks before restarting, so
each task is self-sufficient regardless of role order or whether the upgrade
role ran. The first restart now unmasks, starts and enables the unit, so the
"Wait For MariaDB To Be Ready" check in prepare_replica passes.

This is the second "skipped when the snapshot is already current" gap in the
Create Server replica flow; the first was the stale data-volume mount id
fixed in 115818b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…plica

Provisioning a database replica from a server snapshot failed at the
"Configure Mariadb Replica" step with:

    ValidationError: Cannot enable binlog auto purge for replication
    configured servers

on_update forbids auto_purge_binlog_based_on_size on a replication
configured server. But every new Database Server defaults that flag on
(before_insert sets it in both branches, and Cluster.create_server sets it
too), and nothing clears it when the server later becomes a replica. So when
configure_replication() flips is_replication_setup to True and saves, the
guard rejects the save and provisioning breaks.

Clear auto_purge_binlog_based_on_size at the two points where a server
transitions to replication configured — configure_replication() and
_setup_secondary() — so the invariant the guard protects holds before the
save. The guard itself is left intact, so a user enabling auto purge on an
already-replication server is still rejected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saurabh6790 saurabh6790 changed the title fix(server): Refresh data volume mount after snapshot swap during server creation fix(database-server): Fix database replica provisioning from a server snapshot Jun 18, 2026
@saurabh6790 saurabh6790 requested a review from tanmoysrt as a code owner June 18, 2026 12:47
…plication

configure_replication() assumed the preceding "Prepare Mariadb Replica"
press job step left MariaDB running. That holds on a clean run, but the
provisioning steps execute as separate jobs — retrying "Configure Mariadb
Replica" in isolation (or any race in the Prepare -> Configure handoff) can
hit a stopped server, surfacing as an opaque connection-refused deep inside
the agent:

    pymysql.err.OperationalError: (2003, "Can't connect to MySQL server ...
    [Errno 111] Connection refused")

Restart MariaDB before issuing replication commands by reusing
_restart_mariadb() (restart_mysql.yml). MariaDB's systemd unit is
Type=notify, so the play returns only once the server accepts connections,
making the Configure step self-sufficient regardless of prior state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saurabh6790 saurabh6790 changed the title fix(database-server): Fix database replica provisioning from a server snapshot fix(database-server): Database replica provisioning from a server snapshot Jun 18, 2026
@saurabh6790 saurabh6790 changed the title fix(database-server): Database replica provisioning from a server snapshot fix(db-server): Database replica provisioning from a server snapshot Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants