Checked other resources
Package (Required)
Feature Description
I would like LangChain to support approximate token counting for image and multimodal content blocks in the
count_tokens_approximately function.
Currently, when count_tokens_approximately encounters a message with a list of content blocks (standard for multimodal models), it falls back to len(repr(message.content)). For messages containing base64-encoded images, this results in counting tens of thousands of characters as if they were text tokens, leading to massive overestimation of the token count.
This feature would allow users to get a much more accurate (though still approximate) token count for multimodal messages without needing a model-specific tokenizer.
Use Case
I'm trying to build an application that uses trim_messages or other context window management tools with multimodal models.
Currently, I have to work around this by either stripping images from my trimming logic or implementing a custom token counter. If the default "approximate" counter is used, a single base64-encoded image can appear to consume 25,000+ tokens simply because of its string length, causing trim_messages to aggressively discard virtually all other conversation history.
This feature would help users manage context windows for multimodal models more effectively using built-in LangChain utilities.
Proposed Solution
The count_tokens_approximately function in libs/core/langchain_core/messages/utils.py should be updated to:
- Iterate through content blocks when message.content is a list.
- Identify image_url or image blocks (and other multimodal data blocks).
- Apply a fixed "token penalty" for images (e.g., 85 tokens per image, aligned with OpenAI's low-res base penalty) instead of counting characters.
- Sum these penalties along with the character counts of standard text blocks.
Alternatives Considered
I've tried using model-specific tokenizers, but these often add unnecessary dependency complexity for a use case where a rough "approximate" count is sufficient.
Alternative approaches considered:
- Ignoring image blocks entirely.
- Modifying chars_per_token specifically for image data.
These don't work well because ignoring them leads to context window overflows, and base64 string length has no correlation with actual model tokenization of visual data.
Additional Context
I have already implemented this fix in a local fork, verified it with a new test suite handling 100k+ character base64 strings, and ensured zero regressions by running the existing 141 message utility tests. I am ready to submit the PR as soon as this is reviewed.
Checked other resources
Package (Required)
Feature Description
I would like LangChain to support approximate token counting for image and multimodal content blocks in the
count_tokens_approximately function.
Currently, when count_tokens_approximately encounters a message with a list of content blocks (standard for multimodal models), it falls back to len(repr(message.content)). For messages containing base64-encoded images, this results in counting tens of thousands of characters as if they were text tokens, leading to massive overestimation of the token count.
This feature would allow users to get a much more accurate (though still approximate) token count for multimodal messages without needing a model-specific tokenizer.
Use Case
I'm trying to build an application that uses trim_messages or other context window management tools with multimodal models.
Currently, I have to work around this by either stripping images from my trimming logic or implementing a custom token counter. If the default "approximate" counter is used, a single base64-encoded image can appear to consume 25,000+ tokens simply because of its string length, causing trim_messages to aggressively discard virtually all other conversation history.
This feature would help users manage context windows for multimodal models more effectively using built-in LangChain utilities.
Proposed Solution
The count_tokens_approximately function in libs/core/langchain_core/messages/utils.py should be updated to:
Alternatives Considered
I've tried using model-specific tokenizers, but these often add unnecessary dependency complexity for a use case where a rough "approximate" count is sufficient.
Alternative approaches considered:
These don't work well because ignoring them leads to context window overflows, and base64 string length has no correlation with actual model tokenization of visual data.
Additional Context
I have already implemented this fix in a local fork, verified it with a new test suite handling 100k+ character base64 strings, and ensured zero regressions by running the existing 141 message utility tests. I am ready to submit the PR as soon as this is reviewed.