Skip to content

[Bug]:chunk size too large causes embedding API call to time out. #2387

@EightyOliveira

Description

@EightyOliveira

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

When a document is too large and isn't chunked appropriately, it results in excessively long embedding API calls.

Steps to reproduce

Reproduction Steps:

  • Create a 300 KB .txt file that does not contain \n\n.
  • Call: await rag.ainsert(f.read(), split_by_character="\n\n", split_by_character_only=True)

Cause:

if split_by_character:
    raw_chunks = content.split(split_by_character)
    new_chunks = []
    if split_by_character_only:
        for chunk in raw_chunks:
            _tokens = tokenizer.encode(chunk)
            new_chunks.append((len(_tokens), chunk))
    else:
        # ... (token-based splitting logic)

When the function enters the split_by_character_only=True branch, it does not enforce any maximum token limit on individual chunks. As a result, if the input contains no \n\n, the entire document becomes a single chunk—potentially far exceeding the allowed token size. This causes the subsequent embedding API call to fail (e.g., due to timeout or payload size limits).

Expected Behavior

Fixes:

  1. Truncate and warn:
    If a single chunk exceeds chunk_token_size, automatically truncate it to the maximum allowed size and output a warning message.

  2. Fail fast:
    Immediately raise an exception to abort the current pipeline when a chunk exceeds the token limit.

finally:
I want to try fixing this issue.

LightRAG Config Used

Paste your config here

Logs and screenshots

INFO: Processing 1 document(s)
INFO: Extracting stage 1/1: unknown_source
INFO: Processing d-id: doc-4a637c82b5089e335b8b003b3abe2faf
WARNING: Embedding func: Worker timeout for task 2031004401728_528076.468 after 60s
ERROR: Traceback (most recent call last):
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 907, in wait_func
    return await future
           ^^^^^^^^^^^^
lightrag.utils.WorkerTimeoutError: Worker execution timeout after 60s

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\lightrag.py", line 1886, in process_document
    await asyncio.gather(*first_stage_tasks)
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\kg\nano_vector_db_impl.py", line 123, in upsert
    embeddings_list = await asyncio.gather(*embedding_tasks)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 933, in wait_func
    raise TimeoutError(f"{queue_name}: {str(e)}")
TimeoutError: Embedding func: Worker execution timeout after 60s

ERROR: Failed to extract document 1/1: unknown_source
INFO: Enqueued document processing pipeline stopped

Additional Information

  • LightRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtrackedIssue is tracked by project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions