[Model] Support multi-vector retrieval #25370

noooop · 2025-09-22T08:29:56Z

Purpose

After #21227 landed, we hope that pooling models can always use all pooling, and users don’t need to manually set using all pooling.

The current encode api (/pooling api ) mainly targets the classify for each token scenario (e.g. TokenClassification #24872 & reward models), overlooked the embed for each token scenario.

Let's support embed for each token scenario (multi-vector retrieval)

Partial_Fix #25165

We are stepping closer to support ColBERT & ColPali

cc @DarkLight1337 @maxdebayser

(Slight) Break change

Split the encode task into two tasks: token_embed and token_classify
- token_embed is the same as embed, using normalize as activation
- token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
- Use the following code for compatibility:

def encode2pooling_task(supported_tasks):
    # Currently no model supports both token_embed and token_classify.
    if "token_embed" in supported_tasks:
        return "token_embed"
    elif "token_classify" in supported_tasks:
        return "token_classify"
    else:
        raise ValueError(f"pooling_task must be one of {supported_tasks}.")

Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.

By the way, in #20538:

Consider only the activation parameter for classification (score) models
Consider only the softmax parameter for PoolingType.ALL reward models

Test Plan

tests/models/language/pooling/test_multi_vector_retrieval.py
tests/test_pooling_params.py

Test Result

pass

Known Issues

Maybe we should find a way to support chunked prefill + all pooling (and mean pooling))
support ColBERT & ColPali

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <[email protected]>

noooop · 2025-09-22T09:26:34Z

@jupyterjazz

try:

from vllm import LLM

llm = LLM(
    model="jinaai/jina-embeddings-v4-vllm-text-matching",
    enforce_eager=True,
    max_model_len=1024,
    enable_chunked_prefill=False,  # <- In order to use the encode api
    runner="pooling")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.embed(prompts)

for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    print(len(embeds))

outputs = llm.encode(prompts, pooling_task="encode")

for prompt, output in zip(prompts, outputs):
    multi_vector = output.outputs.data
    print(multi_vector.shape)

Do you ok with this api and outputs?

There are still some broken features that need to be fixed. But the multi_vector feature is now testable.

Signed-off-by: wang.yuqi <[email protected]>

noooop · 2025-09-23T11:00:06Z

@DarkLight1337

Ready for review

(Slight) Break change

Split the encode task into two tasks: token_embed and token_classify
- token_embed is the same as embed, using normalize as activation
- token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
- Use the following code for compatibility:

def encode2pooling_task(supported_tasks):
    # Currently no model supports both token_embed and token_classify.
    if "token_embed" in supported_tasks:
        return "token_embed"
    elif "token_classify" in supported_tasks:
        return "token_classify"
    else:
        raise ValueError(f"pooling_task must be one of {supported_tasks}.")

Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.

Signed-off-by: wang.yuqi <[email protected]>

vllm/tasks.py

jupyterjazz · 2025-09-23T21:55:58Z

Hi @noooop,

I just tested and it works fine. The only thing it was missing is override_pooler_config=PoolerConfig(normalize=False). I don't have a strong opinion on this but setting normalization to False by default during encode could make sense because oftentimes you would want the actual last hidden layer at this step. Other than this, the API looks good to me. Thank you for such a quick fix!

Signed-off-by: wang.yuqi <[email protected]>

noooop · 2025-09-24T00:03:29Z

vllm/pooling_params.py

    def _set_default_parameters(self, model_config: Optional["ModelConfig"]):
-        if self.task == "embed":
+        if self.task in ["embed", "token_embed"]:
            if self.normalize is None:
                self.normalize = True


@DarkLight1337

but setting normalization to False by default during encode could make sense because oftentimes you would want the actual last hidden layer at this step.

Do you prefer to have both embed and token_embed use normalize = True by default, or token_embed use normalize = False by default?

Personally, I hope that all similar APIs behave consistently to reduce the learning burden on users.

Or we can give get last hidden state a new api, with default normalize = False. (Would adding too many specialized APIs also increase the user burden and maintenance costs?)

@jupyterjazz

After #20538 landed,

You can control whether to use normalize by pooling_params = PoolingParams(normalize=False) offline or using normalize parameter online per request.

So you don't need to use override_pooler_config=PoolerConfig(normalize=False) when setup, you can set it anytime.

noooop · 2025-09-24T00:11:39Z

vllm/entrypoints/openai/api_server.py

        engine_client,
        vllm_config,
        state.openai_serving_models,
+        pooling_task=encode2pooling_task(supported_tasks),
        request_logger=request_logger,
        chat_template=resolved_chat_template,
        chat_template_content_format=args.chat_template_content_format,
        log_error_stack=args.log_error_stack,
-    ) if "encode" in supported_tasks else None
+    ) if ("token_embed" in supported_tasks
+          or "token_classify" in supported_tasks) else None


@DarkLight1337

IMO we should treat each task separately. Having sub-tasks just makes things more confusing
The user should pass in the task explicitly

For offline scenarios, we can add llm.token_embed and llm.token_classify. And gradually prevent users from directly using llm.encode

For online scenarios, we need to adaptively select token_embed or token_classify using something like encode2pooling_task method somewhere , unless you want to split the /pooling api into /pooling_token_embed and /pooling_token_classify.

I personally feel that /pooling_token_embed and /pooling_token_classify looks terrible, the online /pooling API is not suitable for major changes yet. We can collect usage scenarios for a while.

This PR adds 800 and changes 37 files.

If only the API needs to be modified, can we merge this PR first and discuss the changes to Frontend and documentation in #25524?

Otherwise, I would have to deal with conflicts and test every day.

DarkLight1337

Please update this page https://docs.vllm.ai/en/latest/models/pooling_models.html#model-conversion to not use encode task anymore

DarkLight1337 · 2025-09-24T03:25:31Z

I'll delay the merge of this PR until after the release so we don't have to worry about back-compatibility issues which further complicate future PRs

support multi-vector retrieval

373595d

Signed-off-by: wang.yuqi <[email protected]>

mergify bot added documentation Improvements or additions to documentation qwen Related to Qwen models labels Sep 22, 2025

noooop added 2 commits September 22, 2025 16:47

fix

d89726f

Signed-off-by: wang.yuqi <[email protected]>

+ examples

a1bc4d7

Signed-off-by: wang.yuqi <[email protected]>

noooop added 4 commits September 22, 2025 17:27

fix

4f87c71

Signed-off-by: wang.yuqi <[email protected]>

fix

5ec48a5

Signed-off-by: wang.yuqi <[email protected]>

fix

b1fbdc4

Signed-off-by: wang.yuqi <[email protected]>

encode -> ["token_embed", "token_classify"]

620c180

Signed-off-by: wang.yuqi <[email protected]>

mergify bot added the frontend label Sep 23, 2025

noooop added 4 commits September 23, 2025 14:27

fix

1ca79c4

Signed-off-by: wang.yuqi <[email protected]>

mypy

bd8ae09

Signed-off-by: wang.yuqi <[email protected]>

mypy

3ed675c

Signed-off-by: wang.yuqi <[email protected]>

fix entrypoints/pooling/llm/

d4bb52f

Signed-off-by: wang.yuqi <[email protected]>

mergify bot added the v1 label Sep 23, 2025

noooop added 4 commits September 23, 2025 15:49

fix entrypoints/pooling/openai/

b4d6d37

Signed-off-by: wang.yuqi <[email protected]>

Merge branch 'main' into multi_vector_retrieval

e52228e

fix tests/models/pooling

6435696

Signed-off-by: wang.yuqi <[email protected]>

fix

23dc0ad

Signed-off-by: wang.yuqi <[email protected]>

noooop marked this pull request as ready for review September 23, 2025 10:59

noooop requested review from sighingnow, DarkLight1337, ywang96, robertgshaw2-redhat, simon-mo, aarnphm, NickLucche, WoosukKwon and njhill as code owners September 23, 2025 10:59

noooop requested review from comaniac, alexm-redhat and chaunceyjiang as code owners September 23, 2025 10:59

Merge branch 'main' into multi_vector_retrieval

d2217f7

fix StepPooler

fc52f9d

Signed-off-by: wang.yuqi <[email protected]>

DarkLight1337 reviewed Sep 23, 2025

View reviewed changes

vllm/tasks.py Outdated Show resolved Hide resolved

noooop added 3 commits September 24, 2025 07:29

Merge branch 'main' into multi_vector_retrieval

cfaaa9f

fix

58ae563

Signed-off-by: wang.yuqi <[email protected]>

fix

1b5f31f

Signed-off-by: wang.yuqi <[email protected]>

noooop commented Sep 24, 2025

View reviewed changes

DarkLight1337 reviewed Sep 24, 2025

View reviewed changes

noooop mentioned this pull request Sep 24, 2025

[Frontend][Doc] Consolidate encode (pooling) api & Document. #25524

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Support multi-vector retrieval #25370

[Model] Support multi-vector retrieval #25370

noooop commented Sep 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

noooop commented Sep 22, 2025

Uh oh!

noooop commented Sep 23, 2025

Uh oh!

Uh oh!

jupyterjazz commented Sep 23, 2025 •

edited

Loading

Uh oh!

noooop Sep 24, 2025 •

edited

Loading

Uh oh!

noooop Sep 24, 2025 •

edited

Loading

Uh oh!

DarkLight1337 left a comment

Uh oh!

DarkLight1337 commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

[Model] Support multi-vector retrieval #25370

Are you sure you want to change the base?

[Model] Support multi-vector retrieval #25370

Conversation

noooop commented Sep 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

(Slight) Break change

Test Plan

Test Result

Known Issues

Uh oh!

noooop commented Sep 22, 2025

Uh oh!

noooop commented Sep 23, 2025

(Slight) Break change

Uh oh!

Uh oh!

jupyterjazz commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noooop Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Sep 24, 2025

Uh oh!

Uh oh!

noooop commented Sep 22, 2025 •

edited by github-actions bot

Loading

jupyterjazz commented Sep 23, 2025 •

edited

Loading

noooop Sep 24, 2025 •

edited

Loading

noooop Sep 24, 2025 •

edited

Loading