Skip to content

[Bug]: Postgre SQL is not stable when upload large document in k8s deployment #2561

@chikenGhost

Description

@chikenGhost

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

I deployed Neo4j and PostgreSQL as databases using the files in ./k8s_deploy/. The PostgreSQL configuration is as follows:

# k8s-deploy/databases/postgresql/values.yaml
version: 16.4.0
mode: replication
replicas: 2
cpu: 4
memory: 8
storage: 5
terminationPolicy: Delete
componentSpecs:
  - componentName: postgresql
    probes:
      readinessProbe:
        failureThreshold: 10
        periodSeconds: 10
        timeoutSeconds: 5
    configuration:
      idle_in_transaction_session_timeout: "0" 
      max_connections: "200"
      statement_timeout: "0"

These parameters seem to work fine during normal operation. However, when uploading larger files (>10MB), PG disconnects after a period of processing. The main UI then fails to display any file information and pop Error: Failed to load documents Document fetch timeout. Once this happens, the only way to reconnect is through a rollout restart of the pods. Furthermore, the status of the interrupted file remains stuck in processing. This issue has occurred multiple times.

Monitoring shows that pod resources are more than sufficient, consuming less than 30m CPU and less than 1GB of memory. I have also checked both the PG logs and LightRAG pod logs; they indicate that the PG cluster underwent a successful primary-replica switchover before each disconnection. However, the PostgreSQL implementation in LightRAG does not seem to be able to reconnect correctly after the switchover.

I tried modifying the following parameters in values.yaml to increase tolerance for disconnections (in my case, the PG switchover takes about 30 seconds). However, the logs show that all retries remain stuck at (1/3). It seems the retry mechanism is not taking effect, or perhaps the reconnection was successful but the files are simply not displaying on the UI?

# k8s-deploy/lightrag/values.yaml

...

POSTGRES_CONNECTION_RETRIES: 3
POSTGRES_CONNECTION_RETRY_BACKOFF: 30
POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 30

Incidentally, the Knowledge Graph page still displays correctly when the PG goes offline.

Steps to reproduce

After deploying LightRAG, PostgreSQL, and Neo4j to Kubernetes according to the documentation, upload large files >10MB.

Expected Behavior

Processing large files should not cause disconnections, or it should have a reconnection mechanism after a disconnection.

LightRAG Config Used

# k8s-deploy/databases/postgresql/values.yaml
version: 16.4.0
mode: replication
replicas: 2
cpu: 4
memory: 8
storage: 5
terminationPolicy: Delete
componentSpecs:
  - componentName: postgresql
    probes:
      readinessProbe:
        failureThreshold: 10
        periodSeconds: 10
        timeoutSeconds: 5
    configuration:
      idle_in_transaction_session_timeout: "0" 
      max_connections: "200"
      statement_timeout: "0"
# k8s-deploy/lightrag/values.yaml
replicaCount: 1

image:
  repository: ghcr.io/hkuds/lightrag
  tag: latest
  imagePullSecrets: []
updateStrategy:
  type: Recreate

service:
  type: ClusterIP
  port: 9621

resources:
  requests:
    cpu: 2000m
    memory: 8Gi
  limits:
    cpu: 2000m
    memory: 8Gi

persistence:
  enabled: true
  ragStorage:
    size: 15Gi
  inputs:
    size: 5Gi

envFrom:
  configmaps: []
  secrets: []

env:
  HOST: 0.0.0.0
  PORT: 9621
  WEBUI_TITLE: Light RAG - test
  WEBUI_DESCRIPTION: Simple and Fast Graph Based RAG System
  WORKERS: 4
  SUMMARY_LANGUAGE: Traditional Chinese
  LLM_BINDING: openai
  LLM_MODEL: gpt-oss-120b
  EMBEDDING_BINDING: openai
  EMBEDDING_MODEL: jina-embeddings-v3
  EMBEDDING_DIM: 1024
  EMBEDDING_BINDING_HOST:
  EMBEDDING_BINDING_API_KEY:
  RERANK_MODEL: BGE-Reranker-V2-M3
  RERANK_BINDING: cohere
  RERANK_BINDING_HOST:
  RERANK_BINDING_API_KEY:
  LIGHTRAG_KV_STORAGE: PGKVStorage
  LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
  LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
  LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage

  POSTGRES_HOST: pg-cluster-postgresql-postgresql
  POSTGRES_PORT: 5432
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD:
  POSTGRES_DATABASE: postgres
  POSTGRES_WORKSPACE: default
  POSTGRES_CONNECTION_RETRIES: 3
  POSTGRES_CONNECTION_RETRY_BACKOFF: 30
  POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 30
  NEO4J_URI: neo4j://neo4j-cluster-neo4j:7687
  NEO4J_USERNAME: neo4j
  NEO4J_PASSWORD:

Logs and screenshots

Image
2025-12-31T11:16:25.228207108+08:00 INFO:  == LLM cache == saving: default:extract:87e47fdc0de8c8217b88e550e2ab8ea3
2025-12-31T11:16:35.834567870+08:00 INFO:  == LLM cache == saving: default:extract:5230eca2f38d5f0b5985a87a88b1e728
2025-12-31T11:16:35.939793474+08:00 INFO: Chunk 27 of 149 extracted 30 Ent + 21 Rel chunk-e277f5544c185b83295c49aeaee0e951
2025-12-31T11:16:58.853712145+08:00 INFO:  == LLM cache == saving: default:extract:b35f27bbae8e468b471068f48dca6c91
2025-12-31T11:17:05.751044513+08:00 INFO: Chunk 28 of 149 extracted 50 Ent + 38 Rel chunk-529fe60247bbb1766ddbe1f090053460
2025-12-31T11:17:14.506551293+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:17:14.509317848+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:17:15.215915547+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:19:51.997848913+08:00 INFO: 10.0.4.219:60882 - "GET /docs HTTP/1.1" 200
2025-12-31T11:19:52.109178765+08:00 INFO: 10.0.4.219:60902 - "GET /openapi.json HTTP/1.1" 200

PermissionError is just one type of error; other different types of errors may occur.

Additional Information

  • LightRAG Version: v1.4.9.8/0251
  • Operating System: Ubuntu 22.04.5 LTS / k8s v1.33.6
  • Python Version: following image
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions