-
Notifications
You must be signed in to change notification settings - Fork 67
count prompt tokens #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
count prompt tokens #366
Conversation
@@ -178,6 +266,82 @@ | |||
DOWNSTREAM_REQUEST_TIMEOUT_SECONDS = 5 * 60 # 5 minutes | |||
|
|||
|
|||
# Hack to count prompt tokens | |||
tokenizer_cache: Dict[str, AutoTokenizer] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would we want to make this an lru cache or something? idk how big the tokenizers can get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~3MB per model. good call to add lru cache
integration_tests/test_endpoints.py
Outdated
@@ -234,3 +239,10 @@ def test_sync_streaming_model_endpoint(capsys): | |||
) | |||
finally: | |||
delete_model_endpoint(create_endpoint_request["name"], user) | |||
|
|||
|
|||
@pytest.mark.skip(reason="test doesn't currently work, needs to figure out s3 fallback") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use moto? https://github.com/getmoto/moto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the integration tests before deploy, I'd like to check s3 as well. I had changed this skip to since the comment to a skipif to just skip for the circleci env
model-engine/model_engine_server/domain/gateways/llm_artifact_gateway.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
|
||
|
||
def mock_boto3_session(fake_files: List[str]): | ||
mock_session = mock.Mock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moto
could be an option here, though this mocking can be fine too since you're already at the last layer, where you're in the S3 artifact gateway (want to avoid mocking across multiple layers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah moto seems cool, useful to know for the future. wanted to create a custom side effect so this leans itself more easily to that
@@ -1471,6 +1508,11 @@ async def execute( | |||
args["parameters"]["do_sample"] = False | |||
if request.return_token_log_probs: | |||
args["parameters"]["return_details"] = True | |||
num_prompt_tokens = count_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm okay we do tokenization for less used frameworks, but for more important models can we move tokenization into the framework itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving this fall-back tokenization into the forwarder? That would then have less overhead in the gateway, which also supports high QPS routes like get/post tasks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah IMO the forwarder feels like the more "natural" place to put counting tokens, this way we'd only have to download one tokenizer in the forwarder, and we offload the computation to something that scales up more in proportion with load
this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upstreaming all changes to the framework will be the goal, but this is a temporary stopgap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)
thought about this for emitting token metrics in gateway vs forwarder as well. think its worth a larger discussion after this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the current ephemeral disk size for model engine pods? should we add more?
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
we've set it to 128Mi, this should be sufficient atm |
unclear to me whether this is enough since i think previously model engine got out of disk due to usage of 100MB space? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for a V0 of token counting. If needed later on, we can consider in-framework tokenization and/or doing tokenization in the forwarder.
We should carefully monitor error rates and token latency/throughput along this rollout
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
"model_engine_server.infra.gateways.s3_llm_artifact_gateway.os.makedirs", | ||
lambda *args, **kwargs: None, # noqa | ||
) | ||
def test_s3_llm_artifact_gateway_download_folder(llm_artifact_gateway, fake_files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bumping up our test coverage! 🤜🏻 🤛🏻
Pull Request Summary
For all completions, return the number of tokens in the prompt as
num_prompt_tokens
.We get the count for prompt tokens in a waterfall method of fallbacks:
Test Plan and Usage Guide
Ran skipped integration test on local test. Need to figure out how to enable for CircleCI.
Compared token counts to https://huggingface.co/spaces/Xenova/the-tokenizer-playground