[Bug]:chunk size too large causes embedding API call to time out.

### Do you need to file an issue?

- [x] I have searched the existing issues and this bug is not already filed.
- [x] I believe this is a legitimate bug, not just a question or feature request.

### Describe the bug

When a document is too large and isn't chunked appropriately, it results in excessively long embedding API calls.


### Steps to reproduce

**Reproduction Steps:**  
- Create a 300 KB `.txt` file that does **not** contain `\n\n`.  
- Call: `await rag.ainsert(f.read(), split_by_character="\n\n", split_by_character_only=True)`

**Cause:**  
```python
if split_by_character:
    raw_chunks = content.split(split_by_character)
    new_chunks = []
    if split_by_character_only:
        for chunk in raw_chunks:
            _tokens = tokenizer.encode(chunk)
            new_chunks.append((len(_tokens), chunk))
    else:
        # ... (token-based splitting logic)
```

When the function enters the `split_by_character_only=True` branch, it **does not enforce any maximum token limit** on individual chunks. As a result, if the input contains no `\n\n`, the entire document becomes a single chunk—potentially far exceeding the allowed token size. This causes the subsequent embedding API call to fail (e.g., due to timeout or payload size limits).

### Expected Behavior

**Fixes:**

1. **Truncate and warn**:  
   If a single chunk exceeds `chunk_token_size`, automatically truncate it to the maximum allowed size and output a warning message.

2. **Fail fast**:  
   Immediately raise an exception to abort the current pipeline when a chunk exceeds the token limit.

**finally:**
**I want to try fixing this issue.**

### LightRAG Config Used

# Paste your config here


### Logs and screenshots

```bash
INFO: Processing 1 document(s)
INFO: Extracting stage 1/1: unknown_source
INFO: Processing d-id: doc-4a637c82b5089e335b8b003b3abe2faf
WARNING: Embedding func: Worker timeout for task 2031004401728_528076.468 after 60s
ERROR: Traceback (most recent call last):
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 907, in wait_func
    return await future
           ^^^^^^^^^^^^
lightrag.utils.WorkerTimeoutError: Worker execution timeout after 60s

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\lightrag.py", line 1886, in process_document
    await asyncio.gather(*first_stage_tasks)
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\kg\nano_vector_db_impl.py", line 123, in upsert
    embeddings_list = await asyncio.gather(*embedding_tasks)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonWorkSpace\00code\LightRAG\lightrag\utils.py", line 933, in wait_func
    raise TimeoutError(f"{queue_name}: {str(e)}")
TimeoutError: Embedding func: Worker execution timeout after 60s

ERROR: Failed to extract document 1/1: unknown_source
INFO: Enqueued document processing pipeline stopped
```

### Additional Information

- LightRAG Version:
- Operating System:
- Python Version:
- Related Issues:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]:chunk size too large causes embedding API call to time out. #2387

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

LightRAG Config Used

Paste your config here

Logs and screenshots

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]:chunk size too large causes embedding API call to time out. #2387

Description

Do you need to file an issue?

Describe the bug

Steps to reproduce

Expected Behavior

LightRAG Config Used

Paste your config here

Logs and screenshots

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions