-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Do you need to file an issue?
- I have searched the existing issues and this bug is not already filed.
- I believe this is a legitimate bug, not just a question or feature request.
Describe the bug
When a document is too large and isn't chunked appropriately, it results in excessively long embedding API calls.
Steps to reproduce
Reproduction Steps:
- Create a 300 KB
.txtfile that does not contain\n\n. - Call:
await rag.ainsert(f.read(), split_by_character="\n\n", split_by_character_only=True)
Cause:
if split_by_character:
raw_chunks = content.split(split_by_character)
new_chunks = []
if split_by_character_only:
for chunk in raw_chunks:
_tokens = tokenizer.encode(chunk)
new_chunks.append((len(_tokens), chunk))
else:
# ... (token-based splitting logic)When the function enters the split_by_character_only=True branch, it does not enforce any maximum token limit on individual chunks. As a result, if the input contains no \n\n, the entire document becomes a single chunk—potentially far exceeding the allowed token size. This causes the subsequent embedding API call to fail (e.g., due to timeout or payload size limits).
Expected Behavior
Fixes:
-
Truncate and warn:
If a single chunk exceedschunk_token_size, automatically truncate it to the maximum allowed size and output a warning message. -
Fail fast:
Immediately raise an exception to abort the current pipeline when a chunk exceeds the token limit.
finally:
I want to try fixing this issue.
LightRAG Config Used
Paste your config here
Logs and screenshots
INFO: Processing 1 document(s)
INFO: Extracting stage 1/1: unknown_source
INFO: Processing d-id: doc-4a637c82b5089e335b8b003b3abe2faf
WARNING: Embedding func: Worker timeout for task 2031004401728_528076.468 after 60s
ERROR: Traceback (most recent call last):
File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 907, in wait_func
return await future
^^^^^^^^^^^^
lightrag.utils.WorkerTimeoutError: Worker execution timeout after 60s
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\PythonWorkSpace\00code\LightRAG\lightrag\lightrag.py", line 1886, in process_document
await asyncio.gather(*first_stage_tasks)
File "D:\PythonWorkSpace\00code\LightRAG\lightrag\kg\nano_vector_db_impl.py", line 123, in upsert
embeddings_list = await asyncio.gather(*embedding_tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 933, in wait_func
raise TimeoutError(f"{queue_name}: {str(e)}")
TimeoutError: Embedding func: Worker execution timeout after 60s
ERROR: Failed to extract document 1/1: unknown_source
INFO: Enqueued document processing pipeline stoppedAdditional Information
- LightRAG Version:
- Operating System:
- Python Version:
- Related Issues: