Skip to content

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Sep 22, 2025

Purpose

After #21227 landed, we hope that pooling models can always use all pooling, and users don’t need to manually set using all pooling.

The current encode api (/pooling api ) mainly targets the classify for each token scenario (e.g. TokenClassification #24872 & reward models), overlooked the embed for each token scenario.

Let's support embed for each token scenario (multi-vector retrieval)

Partial_Fix #25165

We are stepping closer to support ColBERT & ColPali

cc @DarkLight1337 @maxdebayser

(Slight) Break change

  • Split the encode task into two tasks: token_embed and token_classify
    • token_embed is the same as embed, using normalize as activation
    • token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
    • Use the following code for compatibility:
def encode2pooling_task(supported_tasks):
    # Currently no model supports both token_embed and token_classify.
    if "token_embed" in supported_tasks:
        return "token_embed"
    elif "token_classify" in supported_tasks:
        return "token_classify"
    else:
        raise ValueError(f"pooling_task must be one of {supported_tasks}.")
  • Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.

By the way, in #20538:

Consider only the activation parameter for classification (score) models
Consider only the softmax parameter for PoolingType.ALL reward models

Test Plan

tests/models/language/pooling/test_multi_vector_retrieval.py
tests/test_pooling_params.py

Test Result

pass

Known Issues

  • Maybe we should find a way to support chunked prefill + all pooling (and mean pooling))
  • support ColBERT & ColPali

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added documentation Improvements or additions to documentation qwen Related to Qwen models labels Sep 22, 2025
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
@noooop
Copy link
Contributor Author

noooop commented Sep 22, 2025

@jupyterjazz

try:

from vllm import LLM

llm = LLM(
    model="jinaai/jina-embeddings-v4-vllm-text-matching",
    enforce_eager=True,
    max_model_len=1024,
    enable_chunked_prefill=False,  # <- In order to use the encode api
    runner="pooling")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.embed(prompts)

for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    print(len(embeds))

outputs = llm.encode(prompts, pooling_task="encode")

for prompt, output in zip(prompts, outputs):
    multi_vector = output.outputs.data
    print(multi_vector.shape)

Do you ok with this api and outputs?


There are still some broken features that need to be fixed. But the multi_vector feature is now testable.

Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
@mergify mergify bot added the frontend label Sep 23, 2025
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
@mergify mergify bot added the v1 label Sep 23, 2025
@noooop
Copy link
Contributor Author

noooop commented Sep 23, 2025

@DarkLight1337

Ready for review

(Slight) Break change

  • Split the encode task into two tasks: token_embed and token_classify
    • token_embed is the same as embed, using normalize as activation
    • token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
    • Use the following code for compatibility:
def encode2pooling_task(supported_tasks):
    # Currently no model supports both token_embed and token_classify.
    if "token_embed" in supported_tasks:
        return "token_embed"
    elif "token_classify" in supported_tasks:
        return "token_classify"
    else:
        raise ValueError(f"pooling_task must be one of {supported_tasks}.")
  • Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.

Signed-off-by: wang.yuqi <[email protected]>
@jupyterjazz
Copy link

jupyterjazz commented Sep 23, 2025

Hi @noooop,

I just tested and it works fine. The only thing it was missing is override_pooler_config=PoolerConfig(normalize=False). I don't have a strong opinion on this but setting normalization to False by default during encode could make sense because oftentimes you would want the actual last hidden layer at this step. Other than this, the API looks good to me. Thank you for such a quick fix!

Comment on lines 140 to 143
def _set_default_parameters(self, model_config: Optional["ModelConfig"]):
if self.task == "embed":
if self.task in ["embed", "token_embed"]:
if self.normalize is None:
self.normalize = True
Copy link
Contributor Author

@noooop noooop Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337

but setting normalization to False by default during encode could make sense because oftentimes you would want the actual last hidden layer at this step.

Do you prefer to have both embed and token_embed use normalize = True by default, or token_embed use normalize = False by default?

Personally, I hope that all similar APIs behave consistently to reduce the learning burden on users.

Or we can give get last hidden state a new api, with default normalize = False. (Would adding too many specialized APIs also increase the user burden and maintenance costs?)


@jupyterjazz

After #20538 landed,

You can control whether to use normalize by pooling_params = PoolingParams(normalize=False) offline or using normalize parameter online per request.

So you don't need to use override_pooler_config=PoolerConfig(normalize=False) when setup, you can set it anytime.

Comment on lines 1726 to +1735
engine_client,
vllm_config,
state.openai_serving_models,
pooling_task=encode2pooling_task(supported_tasks),
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
log_error_stack=args.log_error_stack,
) if "encode" in supported_tasks else None
) if ("token_embed" in supported_tasks
or "token_classify" in supported_tasks) else None
Copy link
Contributor Author

@noooop noooop Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337

IMO we should treat each task separately. Having sub-tasks just makes things more confusing
The user should pass in the task explicitly

For offline scenarios, we can add llm.token_embed and llm.token_classify. And gradually prevent users from directly using llm.encode

For online scenarios, we need to adaptively select token_embed or token_classify using something like encode2pooling_task method somewhere , unless you want to split the /pooling api into /pooling_token_embed and /pooling_token_classify.

I personally feel that /pooling_token_embed and /pooling_token_classify looks terrible, the online /pooling API is not suitable for major changes yet. We can collect usage scenarios for a while.


This PR adds 800 and changes 37 files.

If only the API needs to be modified, can we merge this PR first and discuss the changes to Frontend and documentation in #25524?

Otherwise, I would have to deal with conflicts and test every day.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this page https://docs.vllm.ai/en/latest/models/pooling_models.html#model-conversion to not use encode task anymore

@DarkLight1337
Copy link
Member

I'll delay the merge of this PR until after the release so we don't have to worry about back-compatibility issues which further complicate future PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend qwen Related to Qwen models v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants