Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments by danielaskdd · Pull Request #2562 · HKUDS/LightRAG

danielaskdd · 2025-12-31T08:04:55Z

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments

Problem Statement

When deploying LightRAG with PostgreSQL in Kubernetes using high availability (HA) configurations (primary-replica replication), database switchovers caused service disruptions:

Large file uploads (>10MB) failed during PostgreSQL primary-replica switchover (~30 seconds)
UI displayed "Document fetch timeout" errors and required pod restarts to recover
Retry mechanism stuck at (1/3) because retry backoff caps were too restrictive for switchover duration
Files remained stuck in processing status after switchover events

Root Cause: Backend retry configuration had artificial caps (max 60s backoff, max 10 attempts) that prevented proper reconnection during 30-second database switchovers. Frontend timeout (30s) was also too aggressive for recovery scenarios.

Solution

Backend Changes (`lightrag/kg/postgres_impl.py`)

Removed artificial caps on retry configuration in ClientManager.get_config():

connection_retry_attempts: Max increased from 10 → 100
connection_retry_backoff: Max increased from 5.0s → 300.0s (5 minutes)
connection_retry_backoff_max: Max increased from 60.0s → 600.0s (10 minutes)

Updated Default Values (lightrag/kg/postgres_impl.py)

POSTGRES_CONNECTION_RETRIES: 3 → 10
POSTGRES_CONNECTION_RETRY_BACKOFF: 0.5s → 3.0s
POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 5.0s → 30.0s

New behavior: 10 retries with exponential backoff (3s, 6s, 12s, 24s, 30s×6) = ~225s total ✅

Frontend Changes (`lightrag_webui/src/features/DocumentManager.tsx`)

Implemented context-aware timeout strategy:

Default timeout: Remains 30s for normal operations (fast failure detection)
File upload: Uses 120s timeout (tolerates large file processing + DB switchover)
Document scan: Uses 90s timeout (tolerates backend processing + DB switchover)
Maintains intelligent polling and circuit breaker for error handling

Implementation: Enhanced handleIntelligentRefresh() with optional customTimeout parameter, allowing different operations to specify appropriate timeout values while keeping default operations fast.

Documentation (`env.example`)

Explained new defaults are optimized for HA deployments
Provided clear guidance for different scenarios
Maintained backward compatibility

🔧 How to Use

Option A: Use New Defaults (Recommended for most HA setups)

# No configuration needed - defaults now support ~30s switchover

Option B: Custom Configuration (Your specific case)

# k8s-deploy/lightrag/values.yaml
env:
  POSTGRES_CONNECTION_RETRIES: 10           # Default now
  POSTGRES_CONNECTION_RETRY_BACKOFF: 15.0   # If you want faster initial retry
  POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 60.0  # If switchover takes longer

📊 Expected Behavior

Before Fix:

Backend retries exhaust in ~3.5s (too fast for HA switchover)
Frontend timeout after 30s regardless of operation type
Documents stuck in "processing" after switchover
Manual pod restart required
UI shows "Document fetch timeout" errors

After Fix:

Backend: System tolerates up to ~225s of downtime with intelligent retry (default)
Frontend: Context-aware timeouts
- Normal operations: 30s (fast failure detection)
- File uploads: 120s (tolerates processing + switchover)
- Document scans: 90s (tolerates processing + switchover)
Automatic reconnection after switchover completes
Documents resume processing seamlessly
No manual intervention needed

🎯 Key Benefits

Out-of-the-box HA support: No configuration needed for standard HA deployments
Intelligent timeout strategy: Context-aware timeouts balance fast failure detection (30s) with operational tolerance (90-120s for critical operations)
Backward compatible: Non-HA deployments only see slightly longer failure timeouts (acceptable)
Production-ready defaults: Aligns with Kubernetes best practices
User-configurable: Can still override for extreme cases
Optimized user experience: Fast feedback for normal operations, seamless recovery for critical operations

Breaking Changes

None. All changes are backward compatible with existing configurations.

- Increase default retries and backoff - Raise connection retry parameter caps - Update env example with HA defaults - Extend frontend timeouts for updates - Update integration test limits

danielaskdd · 2025-12-31T08:06:13Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-31T08:12:24Z

tests/test_postgres_retry_integration.py

            "connection_retry_attempts": min(
-                10, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3"))
+                100, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3"))
            ),
            "connection_retry_backoff": min(
-                5.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5"))
+                300.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5"))


Sync integration test defaults with retry config

The db_config fixture still seeds connection retries with the old defaults (3 attempts with 0.5s backoff) even though ClientManager.get_config now defaults to 10 attempts with a 3s backoff for HA tolerance. Running the integration suite without overriding env vars will continue exercising the legacy, non-HA settings and won’t detect regressions in the new default behavior described in this commit.

Useful? React with 👍 / 👎.

- Increase retry count to 10 - Raise initial backoff to 3.0s - Raise max backoff to 30.0s - Remove obsolete test_env fixture - Align tests with HA config

danielaskdd · 2025-12-31T11:52:06Z

@codex review

chatgpt-codex-connector · 2025-12-31T11:56:36Z

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Cherry-picked from HKUDS/LightRAG PR HKUDS#2562. Backend changes: - connection_retry_attempts: max 10 → 100, default 3 → 10 - connection_retry_backoff: max 5s → 300s, default 0.5s → 3s - connection_retry_backoff_max: max 60s → 600s, default 5s → 30s Frontend changes: - Context-aware timeouts: default 30s, uploads 120s, scans 90s This enables seamless recovery during ~30s PostgreSQL HA switchovers without manual intervention. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Optimize Postgres retry logic for HA switchover

484e441

- Increase default retries and backoff - Raise connection retry parameter caps - Update env example with HA defaults - Extend frontend timeouts for updates - Update integration test limits

chatgpt-codex-connector bot reviewed Dec 31, 2025

View reviewed changes

danielaskdd added 2 commits December 31, 2025 16:37

Update Postgres integration tests to use new HA retry defaults

adb4eac

- Increase retry count to 10 - Raise initial backoff to 3.0s - Raise max backoff to 30.0s - Remove obsolete test_env fixture - Align tests with HA config

Fix linting

5ccb5ec

danielaskdd merged commit 5590a20 into HKUDS:main Dec 31, 2025
3 checks passed

danielaskdd mentioned this pull request Dec 31, 2025

[Bug]: Postgre SQL is not stable when upload large document in k8s deployment #2561

Closed

2 tasks

danielaskdd deleted the fix-pg-timetout branch January 15, 2026 11:49

cleo-ia mentioned this pull request Jan 16, 2026

fix: cherry-pick critical fixes from HKUDS/LightRAG v1.4.9.11 cleo-intelligence/LightRAG-MT#12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-pg-timetout

danielaskdd commented Dec 31, 2025

Uh oh!

danielaskdd commented Dec 31, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 31, 2025

Uh oh!

danielaskdd commented Dec 31, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented Dec 31, 2025