feat: Add `LLMDocumentContentExtractor` to enable Vision-based LLMs to describe/convert an image into text #338

sjrl · 2025-06-27T12:43:04Z

Related Issues

partially addresses Create Image Indexing Example haystack#9321
fixes Add DocumentCaptioner: takes in Image Documents and returns same Documents with an image enhanced with text description haystack#9516

Proposed Changes:

Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text.
This enables retrieval to be done using only text-based retrieval (e.g. BM25 + Text Embedders) since images now of a textual representation.
At query time the attached example uses both the Text and Image of the Image-based docs and only the Text of the Text-Based docs. This highlights that at query time we can still be flexible and choose whether we still want to use the Image of the Image-based docs to answer a users question.

Indexing Example

import os
from pathlib import Path
from typing import List

# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from haystack import Pipeline
from haystack.components.converters.docx import DOCXToDocument
from haystack.components.converters.pypdf import PyPDFToDocument
from haystack.components.embedders.sentence_transformers_document_embedder import SentenceTransformersDocumentEmbedder
from haystack.components.embedders.sentence_transformers_text_embedder import SentenceTransformersTextEmbedder
from haystack.components.joiners import DocumentJoiner, ListJoiner
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.components.writers.document_writer import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

from haystack_experimental.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack_experimental.components.converters.image.document_to_image import DocumentToImageContent
from haystack_experimental.components.converters.image.image_to_document import ImageFileToDocument
from haystack_experimental.components.extractors.llm_document_content_extractor import LLMDocumentContentExtractor
from haystack_experimental.components.generators.chat.openai import OpenAIChatGenerator
from haystack_experimental.components.routers.document_length_router import DocumentLengthRouter
from haystack_experimental.components.routers.document_type_router import DocumentTypeRouter

#
# Indexing documents
#

document_store = InMemoryDocumentStore()

file_type_router = FileTypeRouter(
    mime_types=[
        "application/pdf",
        "image/jpeg",
        "image/png",
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    ]
)
pdf_converter = PyPDFToDocument(store_full_path=True)
pdf_splitter = DocumentSplitter(split_by="page", split_length=1)
doc_length_router = DocumentLengthRouter(threshold=10)
docx_converter = DOCXToDocument(store_full_path=True)
image_source_joiner = ListJoiner(list_type_=List[Path])
image_converter = ImageFileToDocument(store_full_path=True)
image_doc_joiner = DocumentJoiner(sort_by_score=False)
content_extractor = LLMDocumentContentExtractor(chat_generator=OpenAIChatGenerator(model="gpt-4.1-mini"))
text_doc_embedder = SentenceTransformersDocumentEmbedder(
    model="mixedbread-ai/mxbai-embed-large-v1",
    progress_bar=False,
)
final_doc_joiner = DocumentJoiner(sort_by_score=False)
document_writer = DocumentWriter(document_store=document_store)

# Create the Indexing pipeline
indexing_pipe = Pipeline()
indexing_pipe.add_component("file_type_router", file_type_router)
indexing_pipe.add_component("pdf_converter", pdf_converter)
indexing_pipe.add_component("pdf_splitter", pdf_splitter)
indexing_pipe.add_component("doc_length_router", doc_length_router)
indexing_pipe.add_component("docx_converter", docx_converter)
indexing_pipe.add_component("image_source_joiner", image_source_joiner)
indexing_pipe.add_component("image_converter", image_converter)
indexing_pipe.add_component("image_doc_joiner", image_doc_joiner)
indexing_pipe.add_component("content_extractor", content_extractor)
indexing_pipe.add_component("text_doc_embedder", text_doc_embedder)
indexing_pipe.add_component("final_doc_joiner", final_doc_joiner)
indexing_pipe.add_component("document_writer", document_writer)

indexing_pipe.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
indexing_pipe.connect("pdf_splitter.documents", "doc_length_router.documents")
# The short PDF pages will be enriched/captioned along with the other images
indexing_pipe.connect("doc_length_router.short_documents", "image_doc_joiner.documents")
indexing_pipe.connect("doc_length_router.long_documents", "final_doc_joiner.documents")
indexing_pipe.connect(
    "file_type_router.application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "docx_converter.sources"
)
indexing_pipe.connect("docx_converter.documents", "final_doc_joiner.documents")
indexing_pipe.connect("file_type_router.image/jpeg", "image_source_joiner.values")
indexing_pipe.connect("file_type_router.image/png", "image_source_joiner.values")
indexing_pipe.connect("image_source_joiner.values", "image_converter.sources")
indexing_pipe.connect("image_converter.documents", "image_doc_joiner.documents")
indexing_pipe.connect("image_doc_joiner.documents", "content_extractor.documents")
indexing_pipe.connect("content_extractor.documents", "final_doc_joiner.documents")
indexing_pipe.connect("final_doc_joiner.documents", "text_doc_embedder.documents")
indexing_pipe.connect("text_doc_embedder.documents", "document_writer.documents")

# Run the indexing pipeline with sources
# NOTE: We expect no outputs since the pipeline ends with a DocumentWriter component.
indexing_result = indexing_pipe.run(
    data={
        "file_type_router": {
            # These can be strings, Paths, or ByteStreams.
            "sources": [
                "test/test_files/pdf/sample_pdf_1.pdf",
                "test/test_files/images/apple.jpg",
                "test/test_files/docx/sample_docx.docx",
                "test/test_files/images/haystack-logo.png",
            ]
        }
    },
)

# Inspect the documents
indexed_documents = document_store.filter_documents()
print(f"Indexed {len(indexed_documents)} documents:")
print(indexed_documents)


#
# Querying the documents
#

# Create the Retrieval + Query pipeline
text_embedder = SentenceTransformersTextEmbedder(
    model="mixedbread-ai/mxbai-embed-large-v1",
    progress_bar=False
)
retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=2)
doc_type_router = DocumentTypeRouter(
    file_path_meta_field="file_path",
    mime_types=["image/jpeg", "application/pdf"]
)
doc_joiner = DocumentJoiner(sort_by_score=False)
doc_to_image = DocumentToImageContent(detail="auto")
chat_prompt_builder = ChatPromptBuilder(
    required_variables=["question"],
    template="""{% message role="system" %}
You are a friendly assistant that answers questions based on provided documents.
{% endmessage %}

{%- message role="user" -%}
Only provide an answer to the question using the images and text passages provided.

These are the text-only documents that have no image counterpart:
{%- if text_documents|length > 0 %}
{%- for doc in text_documents %}
Text Document [{{ loop.index }}] :
{{ doc.content }}
{% endfor -%}
{%- else %}
No relevant text documents were found.
{% endif %}
End of text documents.


These are the text version of the documents that also have an image counterpart:
{%- if documents|length > 0 %}
{%- for doc in documents %}
Image Document [{{ loop.index }}] :
Relates to image: [{{ loop.index }}]
{{ doc.content }}
{% endfor -%}
{%- else %}
No relevant image documents were found.
{% endif %}
End of text version of documents that also have an image counterpart.


Question: {{ question }}
Answer:

{%- if image_contents|length > 0 %}
{%- for img in image_contents -%}
  {{ img | templatize_part }}
{%- endfor -%}
{% endif %}
{%- endmessage -%}
""",
)
llm = OpenAIChatGenerator()

# Create the pipeline
pipe = Pipeline()
pipe.add_component("text_embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("doc_type_router", doc_type_router)
pipe.add_component("doc_joiner", doc_joiner)
pipe.add_component("doc_to_image", doc_to_image)
pipe.add_component("chat_prompt_builder", chat_prompt_builder)
pipe.add_component("llm", llm)

pipe.connect("text_embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever.documents", "doc_type_router.documents")
pipe.connect("doc_type_router.image/jpeg", "doc_joiner.documents")
pipe.connect("doc_type_router.application/pdf", "doc_joiner.documents")
pipe.connect("doc_joiner.documents", "doc_to_image.documents")
pipe.connect("doc_to_image.image_contents", "chat_prompt_builder.image_contents")
pipe.connect("doc_joiner.documents", "chat_prompt_builder.documents")
pipe.connect("doc_type_router.unclassified", "chat_prompt_builder.text_documents")
pipe.connect("chat_prompt_builder.prompt", "llm.messages")

# Run the pipeline with a query about the apple image
query = "What is the color of the background of the image with an apple in it?"
result = pipe.run(
    data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}},
    include_outputs_from={"chat_prompt_builder"},
)
print(result["llm"]["replies"][0].text)

# Run the pipeline with a query about the docx document
query = "How many confirmed corona cases are there in the US?"
result = pipe.run(
    data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}},
    include_outputs_from={"chat_prompt_builder"},
)
print(result["llm"]["replies"][0].text)

Sample output from Indexing Example

# Indexed 7 documents:
# [Document(id=cb824059353053a4ed7796e49c5511e13fecf14a3908b569d7fba1441c394696, content: 'Sample Docx File
# 
# The US has "passed the peak" on new coronavirus cases, President Donald Trump said...', meta: {'file_path': 'test/test_files/docx/sample_docx.docx', 'docx': {'author': 'Saha, Anirban', 'category': '', 'comments': '', 'content_status': '', 'created': '2020-07-14T08:14:00+00:00', 'identifier': '', 'keywords': '', 'language': '', 'last_modified_by': 'Saha, Anirban', 'last_printed': None, 'modified': '2020-07-14T08:16:00+00:00', 'revision': 1, 'subject': '', 'title': '', 'version': ''}}, embedding: vector of size 1024), Document(id=7613e6af838096c1623ed9505f2211bcd2458879826d6d5b152c610bf17f6099, content: 'A sample PDF ﬁle 
# History and standardization
# Format (PDF) Adobe Systems made the PDF speciﬁcation a...', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}, embedding: vector of size 1024), Document(id=e5def1cf1a24027b36d41165c146c2f6fc13903139bf8ad3579aaf3d2623cfe9, content: 'Page 2 of Sample PDF  
#                                                                   ', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 2, 'split_id': 1, 'split_idx_start': 1587}, embedding: vector of size 1024), Document(id=6764ebdef00a84ec84156167e378af3f9413029aff69df3250780e232e1bff7f, content: 'Page 4 of Sample PDF 
# … the page 3 is empty.', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 4, 'split_id': 3, 'split_idx_start': 1611}, embedding: vector of size 1024), Document(id=398ad15cc64aa559c908750e338b99006cea1e09b3f16158b981204ce6fbf408, content: '[img-caption]Close-up photo of an apple with a reddish-pink and green skin, lying on a bed of straw....', meta: {'file_path': 'test/test_files/images/apple.jpg'}, embedding: vector of size 1024), Document(id=0fe1205479c2a925f9ca8e7a788db1c64db50f9aae2f254bdb988ddfefa0b530, content: '```
# haystack
# by deepset
# ```
# 
# [img-caption]A logo consisting of a teal rectangular icon with a styliz...', meta: {'file_path': 'test/test_files/images/haystack-logo.png'}, embedding: vector of size 1024), Document(id=529edd58acb4592cdb86cdccb53a6ce4aad5802d1638d5de9c3bce40ab90802d, content: '[img-caption]Blank white page with no visible text or visual content.[/img-caption]', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 3, 'split_id': 2, 'split_idx_start': 1610}, embedding: vector of size 1024)]
# The background color of the image with the apple in it is beige or light brown, resembling straw.
# There are over 637,000 confirmed COVID-19 cases in the US.

Indexing Pipeline Graph

Query Pipeline Graph

How did you test it?

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2025-06-27T12:45:27Z

Pull Request Test Coverage Report for Build 16006161725

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
35 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.5%) to 88.671%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/breakpoint.py	16	58.24%
core/pipeline/pipeline.py	19	66.23%

Totals
Change from base Build 15960921868:	0.5%
Covered Lines:	1448
Relevant Lines:	1633

💛 - Coveralls

anakin87

The overall design looks good.

I left two initial minor comments.

haystack_experimental/components/extractors/llm_document_content_extractor.py

anakin87

I left a few minor comments

haystack_experimental/components/extractors/llm_document_content_extractor.py

anakin87

Looks good!

sjrl added 5 commits June 27, 2025 14:18

Start adding document enricher/captioner

a77e4dc

Add init file

605078c

Work on document content extractor

f6fc589

More changes

f071e67

Typing and linting

54b118f

sjrl mentioned this pull request Jun 26, 2025

Create Image Indexing Example deepset-ai/haystack#9321

Closed

sjrl changed the title ~~feat: Add LLMContentExtractor to enable Vision-based LLMs to describe/convert an image into text~~ feat: Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text Jun 27, 2025

sjrl added 3 commits June 27, 2025 15:30

Use document to image content internally

6933c83

Refactor

ba0ec2e

Improve docstrings

079ddfb

sjrl requested a review from anakin87 June 30, 2025 11:45

anakin87 reviewed Jun 30, 2025

View reviewed changes

haystack_experimental/components/extractors/llm_document_content_extractor.py Outdated Show resolved Hide resolved

haystack_experimental/components/extractors/llm_document_content_extractor.py Show resolved Hide resolved

sjrl added 5 commits July 1, 2025 10:14

PR comments

f18255a

Add tests

4f2aa1d

Fix lint

ccb2601

Add pydocs

90d4360

update pyproject.toml

22f24f1

sjrl marked this pull request as ready for review July 1, 2025 09:08

sjrl requested a review from a team as a code owner July 1, 2025 09:08

sjrl requested review from Amnah199 and anakin87 and removed request for a team and Amnah199 July 1, 2025 09:08

anakin87 reviewed Jul 1, 2025

View reviewed changes

PR comments

e1cea51

anakin87 approved these changes Jul 2, 2025

View reviewed changes

sjrl merged commit 09f68a9 into main Jul 2, 2025
10 checks passed

sjrl deleted the document-captioner branch July 2, 2025 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add `LLMDocumentContentExtractor` to enable Vision-based LLMs to describe/convert an image into text #338

feat: Add `LLMDocumentContentExtractor` to enable Vision-based LLMs to describe/convert an image into text #338

Uh oh!

sjrl commented Jun 27, 2025 •

edited

Loading

Uh oh!

coveralls commented Jun 27, 2025 •

edited

Loading

Uh oh!

anakin87 left a comment

Uh oh!

Uh oh!

Uh oh!

anakin87 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 left a comment

Uh oh!

Uh oh!

Uh oh!

feat: Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text #338

feat: Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text #338

Uh oh!

Conversation

sjrl commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

coveralls commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16006161725

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat: Add `LLMDocumentContentExtractor` to enable Vision-based LLMs to describe/convert an image into text #338

feat: Add `LLMDocumentContentExtractor` to enable Vision-based LLMs to describe/convert an image into text #338

sjrl commented Jun 27, 2025 •

edited

Loading

coveralls commented Jun 27, 2025 •

edited

Loading