Skip to content

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Jun 27, 2025

Related Issues

Proposed Changes:

  • Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text.
  • This enables retrieval to be done using only text-based retrieval (e.g. BM25 + Text Embedders) since images now of a textual representation.
  • At query time the attached example uses both the Text and Image of the Image-based docs and only the Text of the Text-Based docs. This highlights that at query time we can still be flexible and choose whether we still want to use the Image of the Image-based docs to answer a users question.
Indexing Example
import os
from pathlib import Path
from typing import List

# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from haystack import Pipeline
from haystack.components.converters.docx import DOCXToDocument
from haystack.components.converters.pypdf import PyPDFToDocument
from haystack.components.embedders.sentence_transformers_document_embedder import SentenceTransformersDocumentEmbedder
from haystack.components.embedders.sentence_transformers_text_embedder import SentenceTransformersTextEmbedder
from haystack.components.joiners import DocumentJoiner, ListJoiner
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.components.writers.document_writer import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

from haystack_experimental.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack_experimental.components.converters.image.document_to_image import DocumentToImageContent
from haystack_experimental.components.converters.image.image_to_document import ImageFileToDocument
from haystack_experimental.components.extractors.llm_document_content_extractor import LLMDocumentContentExtractor
from haystack_experimental.components.generators.chat.openai import OpenAIChatGenerator
from haystack_experimental.components.routers.document_length_router import DocumentLengthRouter
from haystack_experimental.components.routers.document_type_router import DocumentTypeRouter

#
# Indexing documents
#

document_store = InMemoryDocumentStore()

file_type_router = FileTypeRouter(
    mime_types=[
        "application/pdf",
        "image/jpeg",
        "image/png",
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    ]
)
pdf_converter = PyPDFToDocument(store_full_path=True)
pdf_splitter = DocumentSplitter(split_by="page", split_length=1)
doc_length_router = DocumentLengthRouter(threshold=10)
docx_converter = DOCXToDocument(store_full_path=True)
image_source_joiner = ListJoiner(list_type_=List[Path])
image_converter = ImageFileToDocument(store_full_path=True)
image_doc_joiner = DocumentJoiner(sort_by_score=False)
content_extractor = LLMDocumentContentExtractor(chat_generator=OpenAIChatGenerator(model="gpt-4.1-mini"))
text_doc_embedder = SentenceTransformersDocumentEmbedder(
    model="mixedbread-ai/mxbai-embed-large-v1",
    progress_bar=False,
)
final_doc_joiner = DocumentJoiner(sort_by_score=False)
document_writer = DocumentWriter(document_store=document_store)

# Create the Indexing pipeline
indexing_pipe = Pipeline()
indexing_pipe.add_component("file_type_router", file_type_router)
indexing_pipe.add_component("pdf_converter", pdf_converter)
indexing_pipe.add_component("pdf_splitter", pdf_splitter)
indexing_pipe.add_component("doc_length_router", doc_length_router)
indexing_pipe.add_component("docx_converter", docx_converter)
indexing_pipe.add_component("image_source_joiner", image_source_joiner)
indexing_pipe.add_component("image_converter", image_converter)
indexing_pipe.add_component("image_doc_joiner", image_doc_joiner)
indexing_pipe.add_component("content_extractor", content_extractor)
indexing_pipe.add_component("text_doc_embedder", text_doc_embedder)
indexing_pipe.add_component("final_doc_joiner", final_doc_joiner)
indexing_pipe.add_component("document_writer", document_writer)

indexing_pipe.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
indexing_pipe.connect("pdf_splitter.documents", "doc_length_router.documents")
# The short PDF pages will be enriched/captioned along with the other images
indexing_pipe.connect("doc_length_router.short_documents", "image_doc_joiner.documents")
indexing_pipe.connect("doc_length_router.long_documents", "final_doc_joiner.documents")
indexing_pipe.connect(
    "file_type_router.application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "docx_converter.sources"
)
indexing_pipe.connect("docx_converter.documents", "final_doc_joiner.documents")
indexing_pipe.connect("file_type_router.image/jpeg", "image_source_joiner.values")
indexing_pipe.connect("file_type_router.image/png", "image_source_joiner.values")
indexing_pipe.connect("image_source_joiner.values", "image_converter.sources")
indexing_pipe.connect("image_converter.documents", "image_doc_joiner.documents")
indexing_pipe.connect("image_doc_joiner.documents", "content_extractor.documents")
indexing_pipe.connect("content_extractor.documents", "final_doc_joiner.documents")
indexing_pipe.connect("final_doc_joiner.documents", "text_doc_embedder.documents")
indexing_pipe.connect("text_doc_embedder.documents", "document_writer.documents")

# Run the indexing pipeline with sources
# NOTE: We expect no outputs since the pipeline ends with a DocumentWriter component.
indexing_result = indexing_pipe.run(
    data={
        "file_type_router": {
            # These can be strings, Paths, or ByteStreams.
            "sources": [
                "test/test_files/pdf/sample_pdf_1.pdf",
                "test/test_files/images/apple.jpg",
                "test/test_files/docx/sample_docx.docx",
                "test/test_files/images/haystack-logo.png",
            ]
        }
    },
)

# Inspect the documents
indexed_documents = document_store.filter_documents()
print(f"Indexed {len(indexed_documents)} documents:")
print(indexed_documents)


#
# Querying the documents
#

# Create the Retrieval + Query pipeline
text_embedder = SentenceTransformersTextEmbedder(
    model="mixedbread-ai/mxbai-embed-large-v1",
    progress_bar=False
)
retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=2)
doc_type_router = DocumentTypeRouter(
    file_path_meta_field="file_path",
    mime_types=["image/jpeg", "application/pdf"]
)
doc_joiner = DocumentJoiner(sort_by_score=False)
doc_to_image = DocumentToImageContent(detail="auto")
chat_prompt_builder = ChatPromptBuilder(
    required_variables=["question"],
    template="""{% message role="system" %}
You are a friendly assistant that answers questions based on provided documents.
{% endmessage %}

{%- message role="user" -%}
Only provide an answer to the question using the images and text passages provided.

These are the text-only documents that have no image counterpart:
{%- if text_documents|length > 0 %}
{%- for doc in text_documents %}
Text Document [{{ loop.index }}] :
{{ doc.content }}
{% endfor -%}
{%- else %}
No relevant text documents were found.
{% endif %}
End of text documents.


These are the text version of the documents that also have an image counterpart:
{%- if documents|length > 0 %}
{%- for doc in documents %}
Image Document [{{ loop.index }}] :
Relates to image: [{{ loop.index }}]
{{ doc.content }}
{% endfor -%}
{%- else %}
No relevant image documents were found.
{% endif %}
End of text version of documents that also have an image counterpart.


Question: {{ question }}
Answer:

{%- if image_contents|length > 0 %}
{%- for img in image_contents -%}
  {{ img | templatize_part }}
{%- endfor -%}
{% endif %}
{%- endmessage -%}
""",
)
llm = OpenAIChatGenerator()

# Create the pipeline
pipe = Pipeline()
pipe.add_component("text_embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("doc_type_router", doc_type_router)
pipe.add_component("doc_joiner", doc_joiner)
pipe.add_component("doc_to_image", doc_to_image)
pipe.add_component("chat_prompt_builder", chat_prompt_builder)
pipe.add_component("llm", llm)

pipe.connect("text_embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever.documents", "doc_type_router.documents")
pipe.connect("doc_type_router.image/jpeg", "doc_joiner.documents")
pipe.connect("doc_type_router.application/pdf", "doc_joiner.documents")
pipe.connect("doc_joiner.documents", "doc_to_image.documents")
pipe.connect("doc_to_image.image_contents", "chat_prompt_builder.image_contents")
pipe.connect("doc_joiner.documents", "chat_prompt_builder.documents")
pipe.connect("doc_type_router.unclassified", "chat_prompt_builder.text_documents")
pipe.connect("chat_prompt_builder.prompt", "llm.messages")

# Run the pipeline with a query about the apple image
query = "What is the color of the background of the image with an apple in it?"
result = pipe.run(
    data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}},
    include_outputs_from={"chat_prompt_builder"},
)
print(result["llm"]["replies"][0].text)

# Run the pipeline with a query about the docx document
query = "How many confirmed corona cases are there in the US?"
result = pipe.run(
    data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}},
    include_outputs_from={"chat_prompt_builder"},
)
print(result["llm"]["replies"][0].text)
Sample output from Indexing Example
# Indexed 7 documents:
# [Document(id=cb824059353053a4ed7796e49c5511e13fecf14a3908b569d7fba1441c394696, content: 'Sample Docx File
# 
# The US has "passed the peak" on new coronavirus cases, President Donald Trump said...', meta: {'file_path': 'test/test_files/docx/sample_docx.docx', 'docx': {'author': 'Saha, Anirban', 'category': '', 'comments': '', 'content_status': '', 'created': '2020-07-14T08:14:00+00:00', 'identifier': '', 'keywords': '', 'language': '', 'last_modified_by': 'Saha, Anirban', 'last_printed': None, 'modified': '2020-07-14T08:16:00+00:00', 'revision': 1, 'subject': '', 'title': '', 'version': ''}}, embedding: vector of size 1024), Document(id=7613e6af838096c1623ed9505f2211bcd2458879826d6d5b152c610bf17f6099, content: 'A sample PDF file 
# History and standardization
# Format (PDF) Adobe Systems made the PDF specification a...', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}, embedding: vector of size 1024), Document(id=e5def1cf1a24027b36d41165c146c2f6fc13903139bf8ad3579aaf3d2623cfe9, content: 'Page 2 of Sample PDF
 
#                                                                   ', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 2, 'split_id': 1, 'split_idx_start': 1587}, embedding: vector of size 1024), Document(id=6764ebdef00a84ec84156167e378af3f9413029aff69df3250780e232e1bff7f, content: 'Page 4 of Sample PDF 
# … the page 3 is empty.', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 4, 'split_id': 3, 'split_idx_start': 1611}, embedding: vector of size 1024), Document(id=398ad15cc64aa559c908750e338b99006cea1e09b3f16158b981204ce6fbf408, content: '[img-caption]Close-up photo of an apple with a reddish-pink and green skin, lying on a bed of straw....', meta: {'file_path': 'test/test_files/images/apple.jpg'}, embedding: vector of size 1024), Document(id=0fe1205479c2a925f9ca8e7a788db1c64db50f9aae2f254bdb988ddfefa0b530, content: '```
# haystack
# by deepset
# ```
# 
# [img-caption]A logo consisting of a teal rectangular icon with a styliz...', meta: {'file_path': 'test/test_files/images/haystack-logo.png'}, embedding: vector of size 1024), Document(id=529edd58acb4592cdb86cdccb53a6ce4aad5802d1638d5de9c3bce40ab90802d, content: '[img-caption]Blank white page with no visible text or visual content.[/img-caption]', meta: {'file_path': 'test/test_files/pdf/sample_pdf_1.pdf', 'source_id': 'a9beeb8448bbf5e11081296d68f9cba13b407a913a14f4ce5fe033599200c936', 'page_number': 3, 'split_id': 2, 'split_idx_start': 1610}, embedding: vector of size 1024)]
# The background color of the image with the apple in it is beige or light brown, resembling straw.
# There are over 637,000 confirmed COVID-19 cases in the US.
Indexing Pipeline Graph

indexing2

Query Pipeline Graph

query2

How did you test it?

Notes for the reviewer

Checklist

@coveralls
Copy link

coveralls commented Jun 27, 2025

Pull Request Test Coverage Report for Build 16006161725

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 35 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.5%) to 88.671%

Files with Coverage Reduction New Missed Lines %
core/pipeline/breakpoint.py 16 58.24%
core/pipeline/pipeline.py 19 66.23%
Totals Coverage Status
Change from base Build 15960921868: 0.5%
Covered Lines: 1448
Relevant Lines: 1633

💛 - Coveralls

@sjrl sjrl changed the title feat: Add LLMContentExtractor to enable Vision-based LLMs to describe/convert an image into text feat: Add LLMDocumentContentExtractor to enable Vision-based LLMs to describe/convert an image into text Jun 27, 2025
@sjrl sjrl requested a review from anakin87 June 30, 2025 11:45
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design looks good.

I left two initial minor comments.

@sjrl sjrl marked this pull request as ready for review July 1, 2025 09:08
@sjrl sjrl requested a review from a team as a code owner July 1, 2025 09:08
@sjrl sjrl requested review from Amnah199 and anakin87 and removed request for a team and Amnah199 July 1, 2025 09:08
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@sjrl sjrl merged commit 09f68a9 into main Jul 2, 2025
10 checks passed
@sjrl sjrl deleted the document-captioner branch July 2, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DocumentCaptioner: takes in Image Documents and returns same Documents with an image enhanced with text description
3 participants