Skip to content

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562

Merged
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-pg-timetout
Dec 31, 2025
Merged

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments#2562
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-pg-timetout

Conversation

@danielaskdd
Copy link
Collaborator

Fix: Enhance PostgreSQL Reconnection Tolerance for HA Deployments

Problem Statement

When deploying LightRAG with PostgreSQL in Kubernetes using high availability (HA) configurations (primary-replica replication), database switchovers caused service disruptions:

  • Large file uploads (>10MB) failed during PostgreSQL primary-replica switchover (~30 seconds)
  • UI displayed "Document fetch timeout" errors and required pod restarts to recover
  • Retry mechanism stuck at (1/3) because retry backoff caps were too restrictive for switchover duration
  • Files remained stuck in processing status after switchover events

Root Cause: Backend retry configuration had artificial caps (max 60s backoff, max 10 attempts) that prevented proper reconnection during 30-second database switchovers. Frontend timeout (30s) was also too aggressive for recovery scenarios.

Solution

Backend Changes (lightrag/kg/postgres_impl.py)

Removed artificial caps on retry configuration in ClientManager.get_config():

  • connection_retry_attempts: Max increased from 10 → 100
  • connection_retry_backoff: Max increased from 5.0s → 300.0s (5 minutes)
  • connection_retry_backoff_max: Max increased from 60.0s → 600.0s (10 minutes)

Updated Default Values (lightrag/kg/postgres_impl.py)

  • POSTGRES_CONNECTION_RETRIES: 3 → 10
  • POSTGRES_CONNECTION_RETRY_BACKOFF: 0.5s → 3.0s
  • POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 5.0s → 30.0s

New behavior: 10 retries with exponential backoff (3s, 6s, 12s, 24s, 30s×6) = ~225s total

Frontend Changes (lightrag_webui/src/features/DocumentManager.tsx)

Implemented context-aware timeout strategy:

  • Default timeout: Remains 30s for normal operations (fast failure detection)
  • File upload: Uses 120s timeout (tolerates large file processing + DB switchover)
  • Document scan: Uses 90s timeout (tolerates backend processing + DB switchover)
  • Maintains intelligent polling and circuit breaker for error handling

Implementation: Enhanced handleIntelligentRefresh() with optional customTimeout parameter, allowing different operations to specify appropriate timeout values while keeping default operations fast.

Documentation (env.example)

  • Explained new defaults are optimized for HA deployments
  • Provided clear guidance for different scenarios
  • Maintained backward compatibility

🔧 How to Use

Option A: Use New Defaults (Recommended for most HA setups)

# No configuration needed - defaults now support ~30s switchover

Option B: Custom Configuration (Your specific case)

# k8s-deploy/lightrag/values.yaml
env:
  POSTGRES_CONNECTION_RETRIES: 10           # Default now
  POSTGRES_CONNECTION_RETRY_BACKOFF: 15.0   # If you want faster initial retry
  POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 60.0  # If switchover takes longer

📊 Expected Behavior

Before Fix:

  • Backend retries exhaust in ~3.5s (too fast for HA switchover)
  • Frontend timeout after 30s regardless of operation type
  • Documents stuck in "processing" after switchover
  • Manual pod restart required
  • UI shows "Document fetch timeout" errors

After Fix:

  • Backend: System tolerates up to ~225s of downtime with intelligent retry (default)
  • Frontend: Context-aware timeouts
    • Normal operations: 30s (fast failure detection)
    • File uploads: 120s (tolerates processing + switchover)
    • Document scans: 90s (tolerates processing + switchover)
  • Automatic reconnection after switchover completes
  • Documents resume processing seamlessly
  • No manual intervention needed

🎯 Key Benefits

  1. Out-of-the-box HA support: No configuration needed for standard HA deployments
  2. Intelligent timeout strategy: Context-aware timeouts balance fast failure detection (30s) with operational tolerance (90-120s for critical operations)
  3. Backward compatible: Non-HA deployments only see slightly longer failure timeouts (acceptable)
  4. Production-ready defaults: Aligns with Kubernetes best practices
  5. User-configurable: Can still override for extreme cases
  6. Optimized user experience: Fast feedback for normal operations, seamless recovery for critical operations

Breaking Changes

None. All changes are backward compatible with existing configurations.

- Increase default retries and backoff
- Raise connection retry parameter caps
- Update env example with HA defaults
- Extend frontend timeouts for updates
- Update integration test limits
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 44 to 48
"connection_retry_attempts": min(
10, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3"))
100, int(os.getenv("POSTGRES_CONNECTION_RETRIES", "3"))
),
"connection_retry_backoff": min(
5.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5"))
300.0, float(os.getenv("POSTGRES_CONNECTION_RETRY_BACKOFF", "0.5"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sync integration test defaults with retry config

The db_config fixture still seeds connection retries with the old defaults (3 attempts with 0.5s backoff) even though ClientManager.get_config now defaults to 10 attempts with a 3s backoff for HA tolerance. Running the integration suite without overriding env vars will continue exercising the legacy, non-HA settings and won’t detect regressions in the new default behavior described in this commit.

Useful? React with 👍 / 👎.

- Increase retry count to 10
- Raise initial backoff to 3.0s
- Raise max backoff to 30.0s
- Remove obsolete test_env fixture
- Align tests with HA config
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 5590a20 into HKUDS:main Dec 31, 2025
3 checks passed
@danielaskdd danielaskdd deleted the fix-pg-timetout branch January 15, 2026 11:49
cleo-ia added a commit to cleo-intelligence/LightRAG-MT that referenced this pull request Jan 16, 2026
Cherry-picked from HKUDS/LightRAG PR HKUDS#2562.

Backend changes:
- connection_retry_attempts: max 10 → 100, default 3 → 10
- connection_retry_backoff: max 5s → 300s, default 0.5s → 3s
- connection_retry_backoff_max: max 60s → 600s, default 5s → 30s

Frontend changes:
- Context-aware timeouts: default 30s, uploads 120s, scans 90s

This enables seamless recovery during ~30s PostgreSQL HA switchovers
without manual intervention.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant