Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562
Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562danielaskdd merged 3 commits intoHKUDS:mainfrom
Conversation
- Increase default retries and backoff - Raise connection retry parameter caps - Update env example with HA defaults - Extend frontend timeouts for updates - Update integration test limits
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "connection_retry_attempts": min( | ||
| 10, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3")) | ||
| 100, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3")) | ||
| ), | ||
| "connection_retry_backoff": min( | ||
| 5.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5")) | ||
| 300.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5")) |
There was a problem hiding this comment.
Sync integration test defaults with retry config
The db_config fixture still seeds connection retries with the old defaults (3 attempts with 0.5s backoff) even though ClientManager.get_config now defaults to 10 attempts with a 3s backoff for HA tolerance. Running the integration suite without overriding env vars will continue exercising the legacy, non-HA settings and won’t detect regressions in the new default behavior described in this commit.
Useful? React with 👍 / 👎.
- Increase retry count to 10 - Raise initial backoff to 3.0s - Raise max backoff to 30.0s - Remove obsolete test_env fixture - Align tests with HA config
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Cherry-picked from HKUDS/LightRAG PR HKUDS#2562. Backend changes: - connection_retry_attempts: max 10 → 100, default 3 → 10 - connection_retry_backoff: max 5s → 300s, default 0.5s → 3s - connection_retry_backoff_max: max 60s → 600s, default 5s → 30s Frontend changes: - Context-aware timeouts: default 30s, uploads 120s, scans 90s This enables seamless recovery during ~30s PostgreSQL HA switchovers without manual intervention. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments
Problem Statement
When deploying LightRAG with PostgreSQL in Kubernetes using high availability (HA) configurations (primary-replica replication), database switchovers caused service disruptions:
Root Cause: Backend retry configuration had artificial caps (max 60s backoff, max 10 attempts) that prevented proper reconnection during 30-second database switchovers. Frontend timeout (30s) was also too aggressive for recovery scenarios.
Solution
Backend Changes (
lightrag/kg/postgres_impl.py)Removed artificial caps on retry configuration in
ClientManager.get_config():connection_retry_attempts: Max increased from 10 → 100connection_retry_backoff: Max increased from 5.0s → 300.0s (5 minutes)connection_retry_backoff_max: Max increased from 60.0s → 600.0s (10 minutes)Updated Default Values (
lightrag/kg/postgres_impl.py)POSTGRES_CONNECTION_RETRIES: 3 → 10POSTGRES_CONNECTION_RETRY_BACKOFF: 0.5s → 3.0sPOSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 5.0s → 30.0sNew behavior: 10 retries with exponential backoff (3s, 6s, 12s, 24s, 30s×6) = ~225s total ✅
Frontend Changes (
lightrag_webui/src/features/DocumentManager.tsx)Implemented context-aware timeout strategy:
Implementation: Enhanced
handleIntelligentRefresh()with optionalcustomTimeoutparameter, allowing different operations to specify appropriate timeout values while keeping default operations fast.Documentation (
env.example)🔧 How to Use
Option A: Use New Defaults (Recommended for most HA setups)
# No configuration needed - defaults now support ~30s switchoverOption B: Custom Configuration (Your specific case)
📊 Expected Behavior
Before Fix:
After Fix:
🎯 Key Benefits
Breaking Changes
None. All changes are backward compatible with existing configurations.