Skip to content

Support approximate token counting for image blocks in langchain-core #34873

@subhashyadavon

Description

Checked other resources

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-perplexity
  • langchain-prompty
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

I would like LangChain to support approximate token counting for image and multimodal content blocks in the
count_tokens_approximately function.

Currently, when count_tokens_approximately encounters a message with a list of content blocks (standard for multimodal models), it falls back to len(repr(message.content)). For messages containing base64-encoded images, this results in counting tens of thousands of characters as if they were text tokens, leading to massive overestimation of the token count.

This feature would allow users to get a much more accurate (though still approximate) token count for multimodal messages without needing a model-specific tokenizer.

Use Case

I'm trying to build an application that uses trim_messages or other context window management tools with multimodal models.

Currently, I have to work around this by either stripping images from my trimming logic or implementing a custom token counter. If the default "approximate" counter is used, a single base64-encoded image can appear to consume 25,000+ tokens simply because of its string length, causing trim_messages to aggressively discard virtually all other conversation history.

This feature would help users manage context windows for multimodal models more effectively using built-in LangChain utilities.

Proposed Solution

The count_tokens_approximately function in libs/core/langchain_core/messages/utils.py should be updated to:

  1. Iterate through content blocks when message.content is a list.
  2. Identify image_url or image blocks (and other multimodal data blocks).
  3. Apply a fixed "token penalty" for images (e.g., 85 tokens per image, aligned with OpenAI's low-res base penalty) instead of counting characters.
  4. Sum these penalties along with the character counts of standard text blocks.

Alternatives Considered

I've tried using model-specific tokenizers, but these often add unnecessary dependency complexity for a use case where a rough "approximate" count is sufficient.

Alternative approaches considered:

  1. Ignoring image blocks entirely.
  2. Modifying chars_per_token specifically for image data.

These don't work well because ignoring them leads to context window overflows, and base64 string length has no correlation with actual model tokenization of visual data.

Additional Context

I have already implemented this fix in a local fork, verified it with a new test suite handling 100k+ character base64 strings, and ensured zero regressions by running the existing 141 message utility tests. I am ready to submit the PR as soon as this is reviewed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    core`langchain-core` package issues & PRsexternalfeature requestRequest for an enhancement / additional functionality
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions