Skip to content

Refact: Add Embedding Dimension Validation in EmbeddingFunc#2368

Merged
danielaskdd merged 2 commits intoHKUDS:mainfrom
danielaskdd:milvus-vector-batching
Nov 17, 2025
Merged

Refact: Add Embedding Dimension Validation in EmbeddingFunc#2368
danielaskdd merged 2 commits intoHKUDS:mainfrom
danielaskdd:milvus-vector-batching

Conversation

@danielaskdd
Copy link
Collaborator

🎯 Add Embedding Dimension Validation in EmbeddingFunc

Problem Statement

When using custom OpenAI-compatible embedding endpoints (or other embedding providers), dimension mismatches between expected and actual embedding outputs could cause runtime errors at the storage layer. This was particularly problematic with Milvus, where errors like:

MilvusException: (code=65535, message=the length(106496) of float data should divide the dim(3072))

would occur only when data reached the vector database, making debugging difficult and potentially causing data corruption.

Root Cause

Embedding providers may return results in varying formats or with incorrect dimensions due to:

  • API misconfiguration
  • Model version mismatches
  • Data format inconsistencies (base64 vs raw arrays)
  • Batch processing errors during concatenation

Previously, dimension validation only happened implicitly at the storage layer, meaning invalid embeddings could propagate through the system before being detected.

Solution

This PR implements centralized dimension validation in the EmbeddingFunc class (lightrag/utils.py), ensuring all embedding results are validated immediately after generation, before reaching any storage backend.

Implementation Details

Validation Logic (in EmbeddingFunc.__call__):

# Validate using total element count (efficient O(1) check)
total_elements = result.size
if total_elements % expected_dim != 0:
    raise ValueError(
        f"Embedding dimension mismatch detected: "
        f"total elements ({total_elements}) cannot be evenly divided by "
        f"expected dimension ({expected_dim})."
    )

# Optional: Verify vector count matches input text count
actual_vectors = total_elements // expected_dim
if actual_vectors != expected_vectors:
    raise ValueError(
        f"Vector count mismatch: expected {expected_vectors} vectors "
        f"but got {actual_vectors} vectors."
    )

Key Advantages

  1. Early Detection - Catches dimension errors at the source, not at storage time
  2. Universal Coverage - Applies to ALL storage backends (Milvus, PostgreSQL, MongoDB, Neo4j, etc.)
  3. Performance - Single modulo operation (O(1)) instead of shape inspections
  4. Shape Agnostic - Works with any array shape (1D, 2D, flattened)
  5. Clear Error Messages - Provides actionable debugging information

Changes Made

Modified Files

  1. lightrag/utils.py

    • Added dimension validation in EmbeddingFunc.__call__ method
    • Validates total elements divisibility by expected dimension
    • Validates vector count matches input text count
  2. lightrag/kg/milvus_impl.py

    • Reverted to original np.concatenate() implementation
    • Removed temporary workarounds that were masking the root cause

Benefits

Data Integrity - Prevents invalid embeddings from entering the system
Better Debugging - Clear error messages at the point of failure
Storage Agnostic - Protects all vector storage implementations
Zero Performance Impact - Minimal overhead (single modulo check)
Backward Compatible - No breaking changes to existing functionality

Testing Recommendations

  • ✅ Test with various embedding providers (OpenAI, custom endpoints, local models)
  • ✅ Verify error messages are clear when dimension mismatches occur
  • ✅ Confirm no performance regression in embedding generation
  • ✅ Test with different batch sizes to ensure validation works correctly

Breaking Changes

None - This change is fully backward compatible. Valid embeddings continue to work as before; only invalid embeddings now fail with clearer error messages.

Migration Guide

No migration required. This change adds validation without modifying any APIs or data formats.


Example Error Output

Before this PR:

MilvusException: (code=65535, message=the length(106496) of float data should divide the dim(3072))

After this PR:

ValueError: Embedding dimension mismatch detected: total elements (106496) 
cannot be evenly divided by expected dimension (3072).

Much clearer! 🎉


Related Issues: #2365
Type: Bug Fix / Improvement
Component: Core - Embedding System

• Validate total elements divisibility
• Check vector count matches input count
• Raise clear error messages on mismatch
• Ensure embedding output correctness
• Add docstring for EmbeddingFunc class
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 8bb5483 into HKUDS:main Nov 17, 2025
1 check passed
@danielaskdd danielaskdd deleted the milvus-vector-batching branch November 17, 2025 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant