Skip to content

Refact: Limit Vector Database Metadata Size#2240

Merged
danielaskdd merged 29 commits intoHKUDS:mainfrom
danielaskdd:limit-vdb-metadata-size
Oct 21, 2025
Merged

Refact: Limit Vector Database Metadata Size#2240
danielaskdd merged 29 commits intoHKUDS:mainfrom
danielaskdd:limit-vdb-metadata-size

Conversation

@danielaskdd
Copy link
Collaborator

@danielaskdd danielaskdd commented Oct 21, 2025

Refact: Limit Vector Database Metadata Size

Problem Statement

In production deployments, entity and relation metadata can grow unbounded as documents are continuously ingested. The source_id (chunk IDs) and file_path fields in entities and relations can accumulate thousands of entries, leading to:

  • Performance degradation in vector database operations
  • Increased storage costs
  • Memory pressure during query operations
  • Slower merge operations when processing new documents

Solution Overview

This PR implements a configurable metadata size control system with two key features:

  1. Source ID limiting: Controls the maximum number of chunk IDs stored per entity/relation
  2. File path limiting: Controls the maximum number of file paths displayed in metadata (display-only, doesn't affect query performance)

Both features support two strategies:

  • FIFO (First In First Out): Removes oldest entries when limit is reached. Best for evolving knowledge bases, keeps most recent information.
  • KEEP: Keeps oldest entries, skips new ones when limit is reached. Best for stable knowledge bases, faster (fewer merge operations)

Key Features

1. Chunk Tracking System

  • New storage fields: entity_chunks and relation_chunks
  • Tracks which chunks reference each entity/relation
  • Enables accurate deduplication during merging
  • Supports data migration from existing deployments

2. Configurable Limits

New environment variables:

# Source ID limits (affects query performance)
MAX_SOURCE_IDS_PER_ENTITY=300
MAX_SOURCE_IDS_PER_RELATION=300
SOURCE_IDS_LIMIT_METHOD=FIFO

# File path limits (display only)
MAX_FILE_PATHS=100

3. Visual Indicators

  • Truncation indicator (†) in graph UI properties view
  • Tooltip showing truncation method (FIFO/KEEP)

4. Storage Backend Support

Implemented across all storage backends:

  • PostgreSQL
  • MongoDB
  • Redis
  • JSON (default)
  • All support data migration and backward compatibility

5. Enhanced Logging

  • Shows source ID ratios when skipping entities/edges
  • Informative messages about limit methods during rebuild
  • Better debugging information for production deployments

Configuration Details

Default Values

DEFAULT_MAX_SOURCE_IDS_PER_ENTITY = 300
DEFAULT_MAX_SOURCE_IDS_PER_RELATION = 300
DEFAULT_SOURCE_IDS_LIMIT_METHOD = "FIFO"
DEFAULT_MAX_FILE_PATHS = 100

Limit Strategies

  • FIFO: Best for evolving knowledge bases, keeps most recent information
  • KEEP: Best for stable knowledge bases, faster (fewer merge operations)

Breaking Changes

None. The changes are fully backward compatible:

  • Existing data is automatically migrated on first access
  • Default limits are high enough to not affect small deployments
  • Old behavior can be approximated by setting very high limits

Performance Impact

Positive impacts:

  • Reduced vector database query overhead
  • Lower memory usage during entity/relation merging
  • Faster metadata deserialization

Considerations:

  • Initial migration may take time on large existing databases
  • FIFO strategy requires more merge operations than KEEP

Migration Path

For existing deployments:

  1. Update code to this branch
  2. Update .env with new configuration variables (optional)
  3. Restart LightRAG service
  4. Storage backends will auto-migrate on first access
  5. Monitor logs for migration status

divineslight and others added 29 commits October 14, 2025 14:47
- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage
• Add MAX_FILE_PATHS env variable
• Implement file path count limiting
• Support KEEP/FIFO strategies
• Add truncation placeholder
• Remove old build_file_path function
• Add truncate tooltip to source_id field
• Add visual truncation indicator (†)
• Bump API version to 0242
• Add has_placeholder tracking variable
• Detect placeholder patterns in paths
• Show + sign for truncated counts
…unctions

• Move VDB upserts into merge functions
• Fix early return data structure issues
• Update status messages (IGNORE_NEW → KEEP)
• Consolidate error handling paths
• Improve relationship content format
• Entity source IDs: 3 → 300
• Relation source IDs: 3 → 300
• File paths: 2 → 30
• Add numbered steps for clarity
• Improve early return handling
• Enhance file path limiting logic
• Standardize FIFO/KEEP truncation labels
• Update UI truncation text format
• Use proper Redis connection context
• Fix namespace pattern for key scanning
• Propagate storage check exceptions
• Remove defensive error swallowing
- Bump DEFAULT_MAX_FILE_PATHS to 100
- Add clarifying comment about display
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit aee0afd into HKUDS:main Oct 21, 2025
1 check passed
@danielaskdd danielaskdd deleted the limit-vdb-metadata-size branch October 22, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants