feat(core): add multimodal support to count_tokens_approximately#34883
Conversation
- Add tokens_per_image parameter for fixed image token penalty - Handle list-based content blocks (text, image, image_url) - Prevent massive overestimation from base64-encoded images - Maintain backward compatibility with string content - Add comprehensive test coverage for multimodal scenarios Fixes token counting for multimodal messages where base64 images were previously counted as 25,000+ tokens instead of ~85 tokens.
- Add tokens_per_image parameter for fixed image token penalty - Handle list-based content blocks (text, image, image_url) - Prevent massive overestimation from base64-encoded images - Maintain backward compatibility with string content - Add comprehensive test coverage for multimodal scenarios
Merging this PR will not alter performance
|
ccurme (ccurme)
left a comment
There was a problem hiding this comment.
Nice, thank you!
Summary
This PR adds multimodal support to
count_tokens_approximatelyto properly handle image content blocks. Previously, base64-encoded images were counted as ~25,000 tokens; now they use a fixed penalty of ~85 tokens, providing a more accurate approximation.Fixes #34873
Fixes the issue where
trim_messagesand other context management tools fail with multimodal content due to massive token overestimation.Changes
tokens_per_imageparameter tocount_tokens_approximately(default: 85)Testing
Ran the following commands from
libs/core:make format✅make test✅ (1635 passed, 3 skipped)Note:
make lintshows 1 error inscripts/check_version.py(line too long), but this is a pre-existing issue in a file not modified by this PR.All 145 tests in
test_utils.pypass, including 4 new multimodal tests.Breaking Changes
None. This change is fully backward compatible.
Example