fix(site-update): Make site update recovery resilient and recoverable#6729
fix(site-update): Make site update recovery resilient and recoverable#6729balamurali27 wants to merge 33 commits into
Conversation
When a recovery job (Recover Failed Site Update/Pull/Migrate) fails due to a transient database error like "MySQL server has gone away" or "Lost connection to MySQL server", the site update was immediately marked Fatal. Retry the recovery job up to MAX_RECOVERY_RETRIES times before giving up so that a transient blip doesn't strand the site in a Fatal state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When an operator manually runs Restore Tables to recover a site stuck in a Fatal site update, a successful restore now clears the site's fatal_site_update and marks the associated Site Update as Recovered, instead of leaving it Fatal forever. Also make restore_tables idempotent with respect to status_before_update so repeated attempts don't overwrite the originally captured status (e.g. with Broken), which would prevent the site from being reactivated on success. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restoring site tables during recovery runs heavy queries that can exceed the database server's max_statement_time on large sites, killing the restore with "max_statement_time exceeded" and leaving the site Broken. When a Restore Site Tables job fails for that reason, double the max_statement_time variable on the database server — a dynamic variable, so it applies without a restart — and retry the restore, up to MAX_STATEMENT_TIMEOUT_RETRIES times. Each additional attempt is recorded as a comment on the fatal Site Update being recovered so the escalation is auditable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A site update run with backups skipped can fail mid-migration, leaving the database partially migrated. There is no backup to restore from, so the site cannot be recovered automatically — the user has to fix it over SSH. Detect this case when building the agent job failure notification and replace the generic message with an actionable one that tells the user to connect to the bench over SSH and fix the site manually. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A recovery migrate runs heavy restore and migrate queries that can exceed the database server's max_statement_time on large sites and get killed, sending the update Fatal. Proactively increase max_statement_time by an hour before triggering the recovery migrate job so the recovery isn't timed out mid-query in the first place. Extract the bump into a shared Site.increase_max_statement_time() helper (incrementing by an hour rather than doubling) and reuse it from both the proactive path and the restore-tables timeout retry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The transient-error check only looked at Agent Job Step output, but the "MySQL server has gone away" / "Lost connection" message also surfaces in the recover job's own output/traceback. Check those too so a retryable recovery isn't sent straight to Fatal. Also mark the Site Update (and site) as Recovering when scheduling a retry, so the in-progress state is observable instead of lingering on the prior Failure status until the new recover job starts running. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
STATEMENT_TIME_INCREMENT is defined at the bottom of the module, but using it as a default argument value evaluates it when the class body runs at import time — before the constant exists — raising NameError on import. Default to None and resolve inside the method instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cover what runs when an update fails: automatic recovery to the source bench, the proactive max_statement_time bump before a migrate recovery, retrying on transient database errors, restoring tables after a fatal update (with the statement-timeout retry), and the skipped-backups case that requires manual SSH intervention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tests for the site update recovery behaviour: - recovery retries on transient DB errors and goes Fatal after max retries - restore_tables success reactivates the site and marks the update Recovered - increase_max_statement_time bumps the database server variable by an hour - a Restore Site Tables statement timeout retries after bumping the variable - a skipped-backups update failure surfaces an actionable SSH notification Add an ignore_validate flag to create_test_site_update so record-only fixtures don't need a full destination-bench setup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
aed472a to
1a04a2d
Compare
Confidence Score: 3/5The recovery flow has correctness gaps that could leave site state inconsistent after a fallback restore; hold for fixes before merging. In press/press/doctype/site/site.py (process_restore_tables_job_update) and press/press/doctype/site_update/site_update.py (restore_tables_after_failed_recovery)
|
| Filename | Overview |
|---|---|
| press/press/doctype/site_update/site_update.py | Core recovery logic: adds max_statement_time bump/restore, transient-error fallback to Restore Site Tables, and skipped-backups fatal path. Several correctness edge cases around exception ordering and fallback triggering remain. |
| press/press/doctype/site/site.py | Adds database_size property, increase/set_max_statement_time helpers, restore_tables return value, and updated process_restore_tables_job_update. Removes the unconditional fatal_site_update clear, introducing an inconsistency if cleanup raises after site is Active. |
| press/press/doctype/agent_job/agent_job_notifications.py | Adds skipped-backups failure notification with SSH link; reformats existing message strings to single-line implicit concatenation. Logic is correct. |
| press/press/doctype/site_update/site_update.json | Adds previous_max_statement_time Int field; the field definition includes hide_days/hide_seconds which are Duration-field properties and have no effect on an Int field. |
| press/press/doctype/site_update/test_site_update.py | Comprehensive test coverage for all recovery paths: transient vs. non-transient, move-site guard, skipped-backups, max_statement_time stash/restore, and successful recovery. Well structured. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Update Site Migrate] -->|Failure| B{Skipped backups?}
B -->|Yes| C[Fatal — SSH notification]
B -->|No| D[Recover Failed Site Migrate]
D -->|Success| E[Recovered]
E --> F[restore_max_statement_time]
D -->|Failure| G{Move Site step succeeded?}
G -->|No| H[Fatal — no fallback]
G -->|Yes| I{Transient DB error?}
I -->|No| H
I -->|Yes| J[Restore Site Tables]
J -->|Success| K[Site Active, update stays Fatal]
J -->|Failure| L[Site Broken, fatal_site_update set]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Update Site Migrate] -->|Failure| B{Skipped backups?}
B -->|Yes| C[Fatal — SSH notification]
B -->|No| D[Recover Failed Site Migrate]
D -->|Success| E[Recovered]
E --> F[restore_max_statement_time]
D -->|Failure| G{Move Site step succeeded?}
G -->|No| H[Fatal — no fallback]
G -->|Yes| I{Transient DB error?}
I -->|No| H
I -->|Yes| J[Restore Site Tables]
J -->|Success| K[Site Active, update stays Fatal]
J -->|Failure| L[Site Broken, fatal_site_update set]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
press/press/doctype/site/site.py:5029-5038
**Unconditional `fatal_site_update` clear replaced by error-prone conditional**
The old code always ran `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` after `reset_previous_status`. Now the clear depends on `set_cause_of_failure_is_resolved()` completing successfully. If `frappe.get_doc("Site Update", fatal_update)` raises (e.g. the SiteUpdate was deleted) or `restore_max_statement_time()` raises (e.g. the database server is unreachable), the site will be left `Active` (status already persisted by `reset_previous_status` → `self.save()`) with `fatal_site_update` still set and `cause_of_failure_is_resolved` still `0`. Adding a `try/finally` that calls `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` — mirroring the old unconditional clear — would prevent this inconsistency.
Reviews (10): Last reviewed commit: "test(site-update): A transient update-jo..." | Re-trigger Greptile
| details["title"] = "Site update failed and cannot be recovered automatically" | ||
| details["message"] = f"""<p>The update for site <b>{job.site}</b> failed.</p> | ||
| <p>Because this update was run with backups skipped, there is no backup to restore from and the | ||
| site cannot be recovered automatically. The database may be left in a partially migrated state.</p> | ||
| <p>Please connect to the bench over SSH and fix the site manually.</p> | ||
| """ |
There was a problem hiding this comment.
User-visible strings not wrapped with
_() for i18n. details["title"] and all literal text in details["message"] should be translated per the project's internationalisation rules.
| details["title"] = "Site update failed and cannot be recovered automatically" | |
| details["message"] = f"""<p>The update for site <b>{job.site}</b> failed.</p> | |
| <p>Because this update was run with backups skipped, there is no backup to restore from and the | |
| site cannot be recovered automatically. The database may be left in a partially migrated state.</p> | |
| <p>Please connect to the bench over SSH and fix the site manually.</p> | |
| """ | |
| details["title"] = _("Site update failed and cannot be recovered automatically") | |
| details["message"] = f"""<p>{_("The update for site")} <b>{job.site}</b> {_("failed.")}</p> | |
| <p>{_("Because this update was run with backups skipped, there is no backup to restore from and the site cannot be recovered automatically. The database may be left in a partially migrated state.")}</p> | |
| <p>{_("Please connect to the bench over SSH and fix the site manually.")}</p> | |
| """ |
Context Used: This is a Frappe Framework application (Python bac... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/agent_job/agent_job_notifications.py
Line: 203-208
Comment:
User-visible strings not wrapped with `_()` for i18n. `details["title"]` and all literal text in `details["message"]` should be translated per the project's internationalisation rules.
```suggestion
details["title"] = _("Site update failed and cannot be recovered automatically")
details["message"] = f"""<p>{_("The update for site")} <b>{job.site}</b> {_("failed.")}</p>
<p>{_("Because this update was run with backups skipped, there is no backup to restore from and the site cannot be recovered automatically. The database may be left in a partially migrated state.")}</p>
<p>{_("Please connect to the bench over SSH and fix the site manually.")}</p>
"""
```
**Context Used:** This is a Frappe Framework application (Python bac... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| "creation": (">", fatal_update_creation), | ||
| }, | ||
| ) | ||
| if failed_attempts > MAX_STATEMENT_TIMEOUT_RETRIES: |
There was a problem hiding this comment.
Inconsistent retry-limit semantics vs
MAX_RECOVERY_RETRIES
> MAX_STATEMENT_TIMEOUT_RETRIES (strict greater-than) lets the function retry when failed_attempts is 1, 2, or 3, giving 3 retries and 4 total attempts before stopping. The symmetrical guard in should_retry_recovery uses < MAX_RECOVERY_RETRIES (strict less-than), which allows only 2 retries and 3 total attempts before going Fatal. Both constants equal 3, but the effective limits differ by one attempt. The docs for this method say "up to MAX_STATEMENT_TIMEOUT_RETRIES times" (i.e. 3 retries = 4 total), so the > comparison is internally correct — but future readers will naturally read both guards as symmetric and may accidentally normalise one to match the other, silently changing the intended limit.
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/site/site.py
Line: 1224
Comment:
**Inconsistent retry-limit semantics vs `MAX_RECOVERY_RETRIES`**
`> MAX_STATEMENT_TIMEOUT_RETRIES` (strict greater-than) lets the function retry when `failed_attempts` is 1, 2, **or 3**, giving 3 retries and 4 total attempts before stopping. The symmetrical guard in `should_retry_recovery` uses `< MAX_RECOVERY_RETRIES` (strict less-than), which allows only 2 retries and 3 total attempts before going Fatal. Both constants equal 3, but the effective limits differ by one attempt. The docs for this method say "up to `MAX_STATEMENT_TIMEOUT_RETRIES` times" (i.e. 3 retries = 4 total), so the `>` comparison is internally correct — but future readers will naturally read both guards as symmetric and may accidentally normalise one to match the other, silently changing the intended limit.
How can I resolve this? If you propose a fix, please make it concise.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6729 +/- ##
============================================
- Coverage 90.90% 56.91% -34.00%
============================================
Files 117 994 +877
Lines 18093 83995 +65902
Branches 678 682 +4
============================================
+ Hits 16448 47803 +31355
- Misses 1610 36159 +34549
+ Partials 35 33 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
The proactive max_statement_time bump before a recovery migrate is only needed for large sites — small databases finish well within the statement timeout, so bumping the DB server variable for them is pointless churn. Gate the bump on Site.database_size, a new property exposing the latest Site Usage database size. Site Usage stores sizes in MB (not bytes, despite the unitless Int field), so LARGE_DATABASE_SIZE is 1024, not 1024**3. The dashboard confirms the unit: SiteOverview renders these via $format.bytes(v, 2, 2), where current=2 shifts the scale to start at MB. Added a Site Usage README documenting the MB-not-bytes gotcha. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Retrying the Restore Site Tables job on a statement timeout is enough on its own — the heavy queries that timed out get another full max_statement_time window each attempt. Bumping the DB server variable on top of that was redundant churn, so drop it and just retry. The proactive bump before a recovery migrate (Site.increase_max_statement_time) is unchanged; it's the only remaining caller. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The one-hour max_statement_time bump done proactively before a recovery migrate already gives heavy queries enough headroom, so the separate retry-on-timeout loop around Restore Site Tables was redundant. Remove it along with its helper, constants, and the call in process_restore_tables_job_update — on a restore failure the site simply stays Broken. Removes retry_restore_tables_after_statement_timeout, restore_tables_failed_due_to_statement_timeout, STATEMENT_TIMEOUT_ERROR, and MAX_STATEMENT_TIMEOUT_RETRIES. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The docs/code/testing set (index, mocking, best-practices) is a cleaner, reorganized version that covers everything guide-to-testing.md did, namely prerequisites, test-site setup, running tests, mocking, and rerunnability. This removes the duplicate and points AGENTS.md at the new docs. CLAUDE.md picks it up transitively via @AGENTS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The recover_job link is overwritten with the latest recover job by trigger_recovery_job, so when recovery is retried on a transient DB error the previously failed recover jobs leave no trace on the Site Update — they were only findable by digging through the Agent Job list. Now retry_recovery adds a comment linking the failed recover job and noting the transient error before re-triggering, giving operators a visible trail of how many times recovery was retried. The recover_job field is cleared via db_set rather than frappe.db.set_value so the in-memory doc is updated too. Otherwise the recover_job guard in trigger_recovery_job short-circuits on the stale value and no fresh recover job gets created.
When a "Recover Failed Site Migrate" job fails, the site has already been moved back to the source bench but its tables are left half-restored. Re-running the recovery would fail at "Move Site" (the site directory is no longer on the destination bench, and the agent's move_site isn't idempotent), so instead trigger a single "Restore Site Tables" job to bring the site back up from its backup. Restore Site Tables keeps its existing behaviour and callback; the Site Update only carries a reference comment to the triggered job. On a successful restore the update stays Fatal but with cause_of_failure_is_resolved set (the update itself failed for good, but the site was recovered) — it is not flipped to Recovered. This also clears the site's fatal_site_update. Replaces the earlier 3x transient-error retry loop, which couldn't actually re-run the table restore. The max_statement_time bump for large databases is retained.
The skipped-backups update-failure notification told the user to fix the site over SSH but gave no pointer on how. Link the same SSH docs page the dashboard's SiteUpdateDialog uses (docs.frappe.io/cloud/benches/ssh), with the `underline` class so it renders as a visible link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Drop the removed transient-retry section and MAX_RECOVERY_RETRIES, describe the one-shot table-restore fallback, and redraw the status lifecycle. Also fix a stale inline comment claiming the fallback marks the update Recovered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
A skipped-backups update has no backup to roll back to, so a failed update must go straight to Fatal with no recover or Restore Site Tables job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
A body should give context, not retell the diff. A couple of sentences on the why is enough. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
The set_cause_of_failure_is_resolved method already clears fatal_site_update, and the else branch only ran when it was already empty — a no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Site("Site", name) over frappe.get_doc("Site", name), especially when the
controller lives in the same file — shorter and concretely typed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Per taste.md, prefer Site("Site", name) over frappe.get_doc in the
process_*_job_update callbacks, where the controller is in the same file.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
A comment that needs more is a sign the code should be clearer or the explanation belongs in a doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
The Restore Site Tables fallback assumes the failed migrate recovery already moved the site to the source bench. Guard on the recover job's Move Site step succeeding, so a recovery that fails at/before Move Site doesn't restore tables onto the wrong bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
| return ( | ||
| frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or 0 | ||
| ) |
There was a problem hiding this comment.
database_size silently returns 0 when no Site Usage row exists yet, causing the > LARGE_DATABASE_SIZE guard in trigger_recovery_job to be False for any site without usage history. A large site whose usage was never recorded will have max_statement_time skipped, re-exposing the exact timeout risk this feature aims to prevent. Returning None and handling the sentinel explicitly in the caller makes the "no data" case visible rather than silently treating it as "small".
| return ( | |
| frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or 0 | |
| ) | |
| return ( | |
| frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or None | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/site/site.py
Line: 1184-1186
Comment:
`database_size` silently returns `0` when no Site Usage row exists yet, causing the `> LARGE_DATABASE_SIZE` guard in `trigger_recovery_job` to be `False` for any site without usage history. A large site whose usage was never recorded will have `max_statement_time` skipped, re-exposing the exact timeout risk this feature aims to prevent. Returning `None` and handling the sentinel explicitly in the caller makes the "no data" case visible rather than silently treating it as "small".
```suggestion
return (
frappe.db.get_value("Site Usage", {"site": self.name}, "database", order_by="creation desc") or None
)
```
How can I resolve this? If you propose a fix, please make it concise.The trigger_recovery_job method is already long; move the recovery-migrate timeout bump into bump_max_statement_time_before_recovery. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
The fix dialog renders messages with whitespace-pre-wrap, so source indentation and mid-paragraph line breaks showed as literal whitespace. Build every message via implicit string concatenation — one clean line per paragraph, matching the existing redis-unpack message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
e6456c5 to
8f309af
Compare
| def restore_tables_after_failed_recovery(failed_job: "AgentJob", site_update_name: str) -> None: | ||
| # The failed recovery already moved the site back, so re-running it would fail at the | ||
| # non-idempotent "Move Site"; just re-issue the leftover table restore (linked below). | ||
| site_update = frappe.get_doc("Site Update", site_update_name) | ||
| restore_job = frappe.get_doc("Site", site_update.site).restore_tables() | ||
| site_update.add_comment( | ||
| text=( | ||
| f"Recover job <a href='/app/agent-job/{failed_job.name}'>{failed_job.name}</a> failed; " | ||
| f"triggered <a href='/app/agent-job/{restore_job}'>Restore Site Tables</a> to recover the site." | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Silent exception swallows the restore-tables fallback
If restore_tables() raises (e.g. the agent is unreachable), the exception propagates out of process_update_site_recover_job_update. The site is already marked Broken with fatal_site_update set (the two frappe.db.set_value calls on lines 1214-1215 run before this function), but no Restore Site Tables job is created and no comment is added. The fallback silently disappears: the operator sees a broken site with no indication that an automatic restore was even attempted or failed. Wrapping the call in a try/except and logging (or adding a comment on failure) would make the failure visible and leave the site in a clearly actionable state.
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/site_update/site_update.py
Line: 1172-1182
Comment:
**Silent exception swallows the restore-tables fallback**
If `restore_tables()` raises (e.g. the agent is unreachable), the exception propagates out of `process_update_site_recover_job_update`. The site is already marked `Broken` with `fatal_site_update` set (the two `frappe.db.set_value` calls on lines 1214-1215 run before this function), but no `Restore Site Tables` job is created and no comment is added. The fallback silently disappears: the operator sees a broken site with no indication that an automatic restore was even attempted or failed. Wrapping the call in a try/except and logging (or adding a comment on failure) would make the failure visible and leave the site in a clearly actionable state.
How can I resolve this? If you propose a fix, please make it concise.Two related hardenings of the failed-migrate recovery path: - Only re-issue Restore Site Tables when the recovery failed due to a transient DB error (connection dropped mid-restore); other failures need manual attention, so leave the site Fatal. - The recovery-migrate max_statement_time bump was never undone, so it grew an hour on the database server every recovery. Stash the pre-bump value on the Site Update and restore it once recovery finishes — at the recover job's terminal state, or, when the fallback table restore runs, at its callback. Also extracts Site.set_max_statement_time and makes increase_max_statement_time take a plain default increment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Only bump max_statement_time before a recovery migrate for databases over 2 GB (was 1 GB); smaller databases finish well within the timeout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Update the recovery runbook and the Site Update README for the transient-DB- error condition on the table-restore fallback, the max_statement_time revert, and the 2 GB threshold. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
Load the Site Update found by recover_job as a full doc named site_update (lowercase doctype slug) instead of a separate `recovery` var, and rename the `restored` flag to `fallback_triggered`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
The early commits of this PR (roughly They built a full retry mechanism around recovery: up to So most of those starting commits are effectively superseded. The parts that carried through unchanged are the tests (the scaffolding/helpers in |
| # A failed recovery fallback ends here; put back any bumped max_statement_time. | ||
| if job.status == "Failure" and ( | ||
| fatal_update := frappe.db.get_value("Site", job.site, "fatal_site_update") | ||
| ): | ||
| frappe.get_doc("Site Update", fatal_update).restore_max_statement_time() |
There was a problem hiding this comment.
"Delivery Failure" skips restore_max_statement_time
The guard job.status == "Failure" misses "Delivery Failure". The pre-existing updated_status dict also lacks a "Delivery Failure" key (so it would KeyError before reaching this line today), but once that pre-existing mapping gap is fixed, restore_max_statement_time would still be silently skipped — leaving max_statement_time permanently bumped on the database server. Every analogous handler in this file (process_update_site_recover_job_update, etc.) handles "Delivery Failure" explicitly. Change to job.status in ("Failure", "Delivery Failure").
| # A failed recovery fallback ends here; put back any bumped max_statement_time. | |
| if job.status == "Failure" and ( | |
| fatal_update := frappe.db.get_value("Site", job.site, "fatal_site_update") | |
| ): | |
| frappe.get_doc("Site Update", fatal_update).restore_max_statement_time() | |
| # A failed recovery fallback ends here; put back any bumped max_statement_time. | |
| if job.status in ("Failure", "Delivery Failure") and ( | |
| fatal_update := frappe.db.get_value("Site", job.site, "fatal_site_update") | |
| ): | |
| frappe.get_doc("Site Update", fatal_update).restore_max_statement_time() |
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/site/site.py
Line: 5043-5047
Comment:
**`"Delivery Failure"` skips `restore_max_statement_time`**
The guard `job.status == "Failure"` misses `"Delivery Failure"`. The pre-existing `updated_status` dict also lacks a `"Delivery Failure"` key (so it would `KeyError` before reaching this line today), but once that pre-existing mapping gap is fixed, `restore_max_statement_time` would still be silently skipped — leaving `max_statement_time` permanently bumped on the database server. Every analogous handler in this file (`process_update_site_recover_job_update`, etc.) handles `"Delivery Failure"` explicitly. Change to `job.status in ("Failure", "Delivery Failure")`.
```suggestion
# A failed recovery fallback ends here; put back any bumped max_statement_time.
if job.status in ("Failure", "Delivery Failure") and (
fatal_update := frappe.db.get_value("Site", job.site, "fatal_site_update")
):
frappe.get_doc("Site Update", fatal_update).restore_max_statement_time()
```
How can I resolve this? If you propose a fix, please make it concise.The transient-error check keys off the recover job, not the failed update job. Cover that a transient error during the update, with a non-transient recovery failure, leaves the site Fatal without a Restore Site Tables job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RMz9YYdEHMUreMzDkRZDJp
| if updated_status == "Active": | ||
| frappe.get_doc("Site", job.site).reset_previous_status(fix_broken=True) | ||
| frappe.db.set_value("Site", job.site, "fatal_site_update", None) | ||
| site = Site("Site", job.site) | ||
| fatal_update = site.fatal_site_update | ||
| site.reset_previous_status(fix_broken=True) | ||
| if fatal_update: | ||
| # The site is back up, but the update itself failed for good. Keep it Fatal and | ||
| # just mark the cause resolved (this also clears the site's fatal_site_update). | ||
| site_update = frappe.get_doc("Site Update", fatal_update) | ||
| site_update.restore_max_statement_time() | ||
| site_update.set_cause_of_failure_is_resolved() |
There was a problem hiding this comment.
Unconditional
fatal_site_update clear replaced by error-prone conditional
The old code always ran frappe.db.set_value("Site", job.site, "fatal_site_update", None) after reset_previous_status. Now the clear depends on set_cause_of_failure_is_resolved() completing successfully. If frappe.get_doc("Site Update", fatal_update) raises (e.g. the SiteUpdate was deleted) or restore_max_statement_time() raises (e.g. the database server is unreachable), the site will be left Active (status already persisted by reset_previous_status → self.save()) with fatal_site_update still set and cause_of_failure_is_resolved still 0. Adding a try/finally that calls frappe.db.set_value("Site", job.site, "fatal_site_update", None) — mirroring the old unconditional clear — would prevent this inconsistency.
Prompt To Fix With AI
This is a comment left during a code review.
Path: press/press/doctype/site/site.py
Line: 5029-5038
Comment:
**Unconditional `fatal_site_update` clear replaced by error-prone conditional**
The old code always ran `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` after `reset_previous_status`. Now the clear depends on `set_cause_of_failure_is_resolved()` completing successfully. If `frappe.get_doc("Site Update", fatal_update)` raises (e.g. the SiteUpdate was deleted) or `restore_max_statement_time()` raises (e.g. the database server is unreachable), the site will be left `Active` (status already persisted by `reset_previous_status` → `self.save()`) with `fatal_site_update` still set and `cause_of_failure_is_resolved` still `0`. Adding a `try/finally` that calls `frappe.db.set_value("Site", job.site, "fatal_site_update", None)` — mirroring the old unconditional clear — would prevent this inconsistency.
How can I resolve this? If you propose a fix, please make it concise.
Problem
When a site update (
Pull/Migrateonto a new bench) fails, recovery could end upFatalwith no automatic path back and no clear signal to the user:max_statement_timeand get killed mid-query, turning a recoverable failure into a fatal one.Brokeneven though its tables could still be restored from the backup. Re-running the full recovery doesn't work: the agent'smove_siteis not idempotent (the site has already been moved back), so it fails at "Move Site".Changes
max_statement_timebump before a recovery migrate (and revert after)LARGE_DATABASE_SIZE, 2 GB), bumpmax_statement_timeby an hour (dynamic MariaDB variable, no restart), recorded as a comment on the Site Update. Smaller databases finish well within the timeout and are skipped.previous_max_statement_time) and restored once recovery finishes — at the recover job's terminal state, or, when the fallback table restore runs, at that job's callback. Without this the value ratcheted up an hour on every recovery.One-shot
Restore Site Tablesfallback when a migrate recovery fails — but only when safePress makes one automatic attempt to bring the site back up, only when all hold:
Recover Failed Site Migrate;MySQL server has gone away/Lost connection to MySQL server, detected from the job output/traceback and step output). Other failures are genuine problems leftFatalfor manual attention.Since the failed recovery has already moved the site back, only the table restore is left undone, so Press re-issues just that job (linked on the Site Update as a comment for traceability).
On a successful fallback
Restore Site Tables, the update staysFatalThe site becomes
Activeagain and the fatal update's cause of failure is marked resolved (set_cause_of_failure_is_resolved, which also clears the site'sfatal_site_update). The update itself still failed and is recorded as such, rather than being silently flipped toRecovered.Actionable notification for skipped-backups failures
Replaces the generic message with one telling the user the site can't be recovered automatically, linking to the SSH docs to fix it manually. (Skipped-backups failures go straight to
Fatal— no recovery, no table restore, since there's no backup.)Notes
Site Usagestores sizes in MB, not bytes (despite unitlessIntfields) — so the 2 GB threshold is2048, not2048**3. Documented in a newSite UsageREADME.While here, all agent-job failure notification messages were flattened to single lines via implicit string concatenation — the fix dialog renders them with
whitespace-pre-wrap, so source indentation and mid-paragraph line breaks were showing as literal whitespace.Docs
docs/code/site-update/index.md— the recovery flow, status lifecycle, the transient-only restore fallback, themax_statement_timebump/revert, and the constants.press/press/doctype/site_update/README.md— Fatal-state behavior with the transient-error fallback.press/press/doctype/site_usage/README.md— the MB-not-bytes gotcha.guide-to-testing.mdin favour ofdocs/code/testing/.Tests
bench --site <site> run-tests --app press --module press.press.doctype.site_update.test_site_update— passing, including:Restore Site Tablesand, on success, leaves the updateFatalwith its cause of failure resolved and the siteActiveFatalwithfatal_site_updatesetFatal)Recovered)Fatalwith no recovery/restore, and surfaces the actionable SSH notificationmax_statement_timebump is stashed and restored to its pre-bump valueincrease_max_statement_timebumps the DB server variable by an hour;Site.database_sizereturns the latest Site Usage value (in MB)Also verified end-to-end on a local Vagrant Frappe Cloud: real
Update Site Migrate→Recover Failed Site Migrate(killed mid-restore) →Restore Site Tablessuccess, endingFatalwith cause resolved and the siteActive.🤖 Generated with Claude Code