-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Do you need to file an issue?
- I have searched the existing issues and this bug is not already filed.
- I believe this is a legitimate bug, not just a question or feature request.
Describe the bug
I deployed Neo4j and PostgreSQL as databases using the files in ./k8s_deploy/. The PostgreSQL configuration is as follows:
# k8s-deploy/databases/postgresql/values.yaml
version: 16.4.0
mode: replication
replicas: 2
cpu: 4
memory: 8
storage: 5
terminationPolicy: Delete
componentSpecs:
- componentName: postgresql
probes:
readinessProbe:
failureThreshold: 10
periodSeconds: 10
timeoutSeconds: 5
configuration:
idle_in_transaction_session_timeout: "0"
max_connections: "200"
statement_timeout: "0"These parameters seem to work fine during normal operation. However, when uploading larger files (>10MB), PG disconnects after a period of processing. The main UI then fails to display any file information and pop Error: Failed to load documents Document fetch timeout. Once this happens, the only way to reconnect is through a rollout restart of the pods. Furthermore, the status of the interrupted file remains stuck in processing. This issue has occurred multiple times.
Monitoring shows that pod resources are more than sufficient, consuming less than 30m CPU and less than 1GB of memory. I have also checked both the PG logs and LightRAG pod logs; they indicate that the PG cluster underwent a successful primary-replica switchover before each disconnection. However, the PostgreSQL implementation in LightRAG does not seem to be able to reconnect correctly after the switchover.
I tried modifying the following parameters in values.yaml to increase tolerance for disconnections (in my case, the PG switchover takes about 30 seconds). However, the logs show that all retries remain stuck at (1/3). It seems the retry mechanism is not taking effect, or perhaps the reconnection was successful but the files are simply not displaying on the UI?
# k8s-deploy/lightrag/values.yaml
...
POSTGRES_CONNECTION_RETRIES: 3
POSTGRES_CONNECTION_RETRY_BACKOFF: 30
POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 30Incidentally, the Knowledge Graph page still displays correctly when the PG goes offline.
Steps to reproduce
After deploying LightRAG, PostgreSQL, and Neo4j to Kubernetes according to the documentation, upload large files >10MB.
Expected Behavior
Processing large files should not cause disconnections, or it should have a reconnection mechanism after a disconnection.
LightRAG Config Used
# k8s-deploy/databases/postgresql/values.yaml
version: 16.4.0
mode: replication
replicas: 2
cpu: 4
memory: 8
storage: 5
terminationPolicy: Delete
componentSpecs:
- componentName: postgresql
probes:
readinessProbe:
failureThreshold: 10
periodSeconds: 10
timeoutSeconds: 5
configuration:
idle_in_transaction_session_timeout: "0"
max_connections: "200"
statement_timeout: "0"# k8s-deploy/lightrag/values.yaml
replicaCount: 1
image:
repository: ghcr.io/hkuds/lightrag
tag: latest
imagePullSecrets: []
updateStrategy:
type: Recreate
service:
type: ClusterIP
port: 9621
resources:
requests:
cpu: 2000m
memory: 8Gi
limits:
cpu: 2000m
memory: 8Gi
persistence:
enabled: true
ragStorage:
size: 15Gi
inputs:
size: 5Gi
envFrom:
configmaps: []
secrets: []
env:
HOST: 0.0.0.0
PORT: 9621
WEBUI_TITLE: Light RAG - test
WEBUI_DESCRIPTION: Simple and Fast Graph Based RAG System
WORKERS: 4
SUMMARY_LANGUAGE: Traditional Chinese
LLM_BINDING: openai
LLM_MODEL: gpt-oss-120b
EMBEDDING_BINDING: openai
EMBEDDING_MODEL: jina-embeddings-v3
EMBEDDING_DIM: 1024
EMBEDDING_BINDING_HOST:
EMBEDDING_BINDING_API_KEY:
RERANK_MODEL: BGE-Reranker-V2-M3
RERANK_BINDING: cohere
RERANK_BINDING_HOST:
RERANK_BINDING_API_KEY:
LIGHTRAG_KV_STORAGE: PGKVStorage
LIGHTRAG_VECTOR_STORAGE: PGVectorStorage
LIGHTRAG_GRAPH_STORAGE: Neo4JStorage
LIGHTRAG_DOC_STATUS_STORAGE: PGDocStatusStorage
POSTGRES_HOST: pg-cluster-postgresql-postgresql
POSTGRES_PORT: 5432
POSTGRES_USER: postgres
POSTGRES_PASSWORD:
POSTGRES_DATABASE: postgres
POSTGRES_WORKSPACE: default
POSTGRES_CONNECTION_RETRIES: 3
POSTGRES_CONNECTION_RETRY_BACKOFF: 30
POSTGRES_CONNECTION_RETRY_BACKOFF_MAX: 30
NEO4J_URI: neo4j://neo4j-cluster-neo4j:7687
NEO4J_USERNAME: neo4j
NEO4J_PASSWORD:
Logs and screenshots
2025-12-31T11:16:25.228207108+08:00 INFO: == LLM cache == saving: default:extract:87e47fdc0de8c8217b88e550e2ab8ea3
2025-12-31T11:16:35.834567870+08:00 INFO: == LLM cache == saving: default:extract:5230eca2f38d5f0b5985a87a88b1e728
2025-12-31T11:16:35.939793474+08:00 INFO: Chunk 27 of 149 extracted 30 Ent + 21 Rel chunk-e277f5544c185b83295c49aeaee0e951
2025-12-31T11:16:58.853712145+08:00 INFO: == LLM cache == saving: default:extract:b35f27bbae8e468b471068f48dca6c91
2025-12-31T11:17:05.751044513+08:00 INFO: Chunk 28 of 149 extracted 50 Ent + 38 Rel chunk-529fe60247bbb1766ddbe1f090053460
2025-12-31T11:17:14.506551293+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:17:14.509317848+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:17:15.215915547+08:00 WARNING: PostgreSQL transient connection issue on attempt 1/3: PermissionError(1, 'Operation not permitted')
2025-12-31T11:19:51.997848913+08:00 INFO: 10.0.4.219:60882 - "GET /docs HTTP/1.1" 200
2025-12-31T11:19:52.109178765+08:00 INFO: 10.0.4.219:60902 - "GET /openapi.json HTTP/1.1" 200PermissionError is just one type of error; other different types of errors may occur.
Additional Information
- LightRAG Version: v1.4.9.8/0251
- Operating System: Ubuntu 22.04.5 LTS / k8s v1.33.6
- Python Version: following image
- Related Issues: