Skip to content

Commit 9ccc135

Browse files
committed
Merge remote-tracking branch 'origin/main' into feat/indexname
2 parents 2e8db70 + 32732c3 commit 9ccc135

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+2511
-771
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ repos:
5757
"types-requests",
5858
"sqlmodel",
5959
"types-Markdown",
60+
types-tzlocal,
6061
]
6162
args: ["--check-untyped-defs", "--ignore-missing-imports"]
6263
exclude: "^templates/"

Dockerfile

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ RUN --mount=type=ssh \
4646

4747
RUN --mount=type=ssh \
4848
--mount=type=cache,target=/root/.cache/pip \
49-
if [ "$TARGETARCH" = "amd64" ]; then pip install graphrag future; fi
49+
if [ "$TARGETARCH" = "amd64" ]; then pip install "graphrag<=0.3.6" future; fi
5050

5151
# Clean up
5252
RUN apt-get autoremove \
@@ -81,6 +81,16 @@ RUN --mount=type=ssh \
8181
pip install -e "libs/kotaemon[adv]" \
8282
&& pip install unstructured[all-docs]
8383

84+
# Install lightRAG
85+
ENV USE_LIGHTRAG=true
86+
RUN --mount=type=ssh \
87+
--mount=type=cache,target=/root/.cache/pip \
88+
pip install aioboto3 nano-vectordb ollama xxhash "lightrag-hku<=0.0.8"
89+
90+
RUN --mount=type=ssh \
91+
--mount=type=cache,target=/root/.cache/pip \
92+
pip install "docling<=2.5.2"
93+
8494
# Clean up
8595
RUN apt-get autoremove \
8696
&& apt-get clean \

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ developers in mind.
2626

2727
</div>
2828

29+
<!-- start-intro -->
30+
2931
## Introduction
3032

3133
This project serves as a functional RAG UI for both end users who want to do QA on their
@@ -187,12 +189,24 @@ documents and developers who want to build their own RAG pipeline.
187189

188190
<details>
189191

192+
<summary>Setup LIGHTRAG</summary>
193+
194+
- Install LightRAG: `pip install git+https://github.com/HKUDS/LightRAG.git`
195+
- `LightRAG` install might introduce version conflicts, see [this issue](https://github.com/Cinnamon/kotaemon/issues/440)
196+
- To quickly fix: `pip uninstall hnswlib chroma-hnswlib && pip install chroma-hnswlib`
197+
- Launch Kotaemon with `USE_LIGHTRAG=true` environment variable.
198+
- Set your default LLM & Embedding models in Resources setting and it will be recognized automatically from LightRAG.
199+
200+
</details>
201+
202+
<details>
203+
190204
<summary>Setup MS GRAPHRAG</summary>
191205

192206
- **Non-Docker Installation**: If you are not using Docker, install GraphRAG with the following command:
193207

194208
```shell
195-
pip install graphrag future
209+
pip install "graphrag<=0.3.6" future
196210
```
197211

198212
- **Setting Up API KEY**: To use the GraphRAG retriever feature, ensure you set the `GRAPHRAG_API_KEY` environment variable. You can do this directly in your environment or by adding it to a `.env` file.
@@ -204,6 +218,17 @@ documents and developers who want to build their own RAG pipeline.
204218

205219
See [Local model setup](docs/local_model.md).
206220

221+
### Setup multimodal document parsing (OCR, table parsing, figure extraction)
222+
223+
These options are available:
224+
225+
- [Azure Document Intelligence (API)](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence)
226+
- [Adobe PDF Extract (API)](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/)
227+
- [Docling (local, open-source)](https://github.com/DS4SD/docling)
228+
- To use Docling, first install required dependencies: `pip install docling`
229+
230+
Select corresponding loaders in `Settings -> Retrieval Settings -> File loader`
231+
207232
### Customize your application
208233

209234
- By default, all application data is stored in the `./ktem_app_data` folder. You can back up or copy this folder to transfer your installation to a new machine.
@@ -332,6 +357,8 @@ This file provides another way to configure your models and credentials.
332357

333358
> (more instruction WIP).
334359

360+
<!-- end-intro -->
361+
335362
## Star History
336363

337364
<a href="https://star-history.com/#Cinnamon/kotaemon&Date">

doc_env_reqs.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ mkdocstrings[python]
33
mkdocs-material
44
mkdocs-gen-files
55
mkdocs-literate-nav
6-
mkdocs-video
76
mkdocs-git-revision-date-localized-plugin
87
mkdocs-section-index
8+
mkdocs-include-markdown-plugin[cache]
99
mdx_truly_sane_lists

docs/about.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,3 @@ developers in mind.
99
[User Guide](https://cinnamon.github.io/kotaemon/) |
1010
[Developer Guide](https://cinnamon.github.io/kotaemon/development/) |
1111
[Feedback](https://github.com/Cinnamon/kotaemon/issues)
12-
13-
[Dark Mode](?__theme=dark) |
14-
[Light Mode](?__theme=light)

docs/development/index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,5 @@
1-
--8<-- "README.md"
1+
{%
2+
include-markdown "../../README.md"
3+
start="<!-- start-intro -->"
4+
end="<!-- end-intro -->"
5+
%}

flowsettings.py

Lines changed: 18 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@
255255
"ktem.reasoning.react.ReactAgentPipeline",
256256
"ktem.reasoning.rewoo.RewooAgentPipeline",
257257
]
258-
KH_REASONINGS_USE_MULTIMODAL = False
258+
KH_REASONINGS_USE_MULTIMODAL = config("USE_MULTIMODAL", default=False, cast=bool)
259259
KH_VLM_ENDPOINT = "{0}/openai/deployments/{1}/chat/completions?api-version={2}".format(
260260
config("AZURE_OPENAI_ENDPOINT", default=""),
261261
config("OPENAI_VISION_DEPLOYMENT_NAME", default="gpt-4o"),
@@ -287,41 +287,35 @@
287287
}
288288

289289
USE_NANO_GRAPHRAG = config("USE_NANO_GRAPHRAG", default=False, cast=bool)
290-
GRAPHRAG_INDEX_TYPE = (
291-
"ktem.index.file.graph.GraphRAGIndex"
292-
if not USE_NANO_GRAPHRAG
293-
else "ktem.index.file.graph.NanoGraphRAGIndex"
294-
)
290+
USE_LIGHTRAG = config("USE_LIGHTRAG", default=False, cast=bool)
291+
292+
GRAPHRAG_INDEX_TYPES = ["ktem.index.file.graph.GraphRAGIndex"]
293+
294+
if USE_NANO_GRAPHRAG:
295+
GRAPHRAG_INDEX_TYPES.append("ktem.index.file.graph.NanoGraphRAGIndex")
296+
elif USE_LIGHTRAG:
297+
GRAPHRAG_INDEX_TYPES.append("ktem.index.file.graph.LightRAGIndex")
298+
295299
KH_INDEX_TYPES = [
296300
"ktem.index.file.FileIndex",
297-
GRAPHRAG_INDEX_TYPE,
301+
*GRAPHRAG_INDEX_TYPES,
298302
]
299303

300-
GRAPHRAG_INDEX = (
304+
GRAPHRAG_INDICES = [
301305
{
302-
"name": "GraphRAG Collection",
303-
"config": {
304-
"supported_file_types": (
305-
".png, .jpeg, .jpg, .tiff, .tif, .pdf, .xls, .xlsx, .doc, .docx, "
306-
".pptx, .csv, .html, .mhtml, .txt, .md, .zip"
307-
),
308-
"private": False,
309-
},
310-
"index_type": "ktem.index.file.graph.GraphRAGIndex",
311-
}
312-
if not USE_NANO_GRAPHRAG
313-
else {
314-
"name": "NanoGraphRAG Collection",
306+
"name": graph_type.split(".")[-1].replace("Index", "")
307+
+ " Collection", # get last name
315308
"config": {
316309
"supported_file_types": (
317310
".png, .jpeg, .jpg, .tiff, .tif, .pdf, .xls, .xlsx, .doc, .docx, "
318311
".pptx, .csv, .html, .mhtml, .txt, .md, .zip"
319312
),
320313
"private": False,
321314
},
322-
"index_type": "ktem.index.file.graph.NanoGraphRAGIndex",
315+
"index_type": graph_type,
323316
}
324-
)
317+
for graph_type in GRAPHRAG_INDEX_TYPES
318+
]
325319

326320
KH_INDICES = [
327321
{
@@ -335,5 +329,5 @@
335329
},
336330
"index_type": "ktem.index.file.FileIndex",
337331
},
338-
GRAPHRAG_INDEX,
332+
*GRAPHRAG_INDICES,
339333
]

libs/kotaemon/kotaemon/indices/ingests/files.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
AdobeReader,
1414
AzureAIDocumentIntelligenceLoader,
1515
DirectoryReader,
16+
DoclingReader,
1617
HtmlReader,
1718
MathpixPDFReader,
1819
MhtmlReader,
@@ -32,9 +33,10 @@
3233
credential=str(config("AZURE_DI_CREDENTIAL", default="")),
3334
cache_dir=getattr(flowsettings, "KH_MARKDOWN_OUTPUT_DIR", None),
3435
)
35-
adobe_reader.vlm_endpoint = azure_reader.vlm_endpoint = getattr(
36-
flowsettings, "KH_VLM_ENDPOINT", ""
37-
)
36+
docling_reader = DoclingReader()
37+
adobe_reader.vlm_endpoint = (
38+
azure_reader.vlm_endpoint
39+
) = docling_reader.vlm_endpoint = getattr(flowsettings, "KH_VLM_ENDPOINT", "")
3840

3941

4042
KH_DEFAULT_FILE_EXTRACTORS: dict[str, BaseReader] = {
Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
from .citation import CitationPipeline
2-
from .text_based import CitationQAPipeline
32

43
__all__ = [
54
"CitationPipeline",
6-
"CitationQAPipeline",
75
]

0 commit comments

Comments
 (0)