Text file encoding issue #1126

kaustubh-darekar · 2025-02-25T08:07:45Z

Resolved
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xe1 in position 214845: illegal multibyte sequence

For some of the text files

backend/src/document_sources/local_file.py


+def detect_encoding(file_path):
+   """Detects the file encoding to avoid UnicodeDecodeError."""
+   with open(file_path, 'rb') as f:


To fix the problem, we need to validate the file_path to ensure it does not lead to unauthorized file access. We can achieve this by normalizing the path and ensuring it is contained within a predefined safe directory. This involves the following steps:

Define a safe root directory.

Normalize the file_path using os.path.normpath.

Check that the normalized path starts with the safe root directory.

backend/src/document_sources/local_file.py

+            loader = UnstructuredFileLoader(file_path, mode="elements",autodetect_encoding=True)
+            return loader,encoding_flag
+        else:
+            with open(file_path, encoding=encoding, errors="replace") as f:


To fix the problem, we need to validate and sanitize the file_name parameter before using it to construct file paths. We can use the os.path.normpath function to normalize the path and ensure it does not contain any path traversal characters. Additionally, we should check that the resulting path is within a designated safe directory.

Normalize the file_name using os.path.normpath.

Ensure the normalized path is within a safe root directory.

Update the get_documents_from_file_by_path function to include these checks.

This reverts commit 3a222cc.

kaustubh-darekar added 2 commits February 24, 2025 12:11

Resolved UnicodeDecodeError for some txt files

ee43349

changes done to process utf-8 file with unstructured only

ae05c56

kaustubh-darekar added the bug Something isn't working label Feb 25, 2025

kaustubh-darekar requested review from praveshkumar1988 and karanchellani February 25, 2025 08:07

kaustubh-darekar self-assigned this Feb 25, 2025

github-advanced-security bot found potential problems Feb 25, 2025

View reviewed changes

karanchellani approved these changes Feb 25, 2025

View reviewed changes

Merge branch 'dev' into text_file_encoding_issue

3259116

karanchellani merged commit 3a222cc into dev Feb 25, 2025
2 of 3 checks passed

kaustubh-darekar added a commit that referenced this pull request Feb 26, 2025

Revert "Text file encoding issue (#1126)"

bd143ee

This reverts commit 3a222cc.

@@ -7,4 +7,14 @@
+            SAFE_ROOT_DIR = "/path/to/safe/root"  # Define the safe root directory
+            def validate_file_path(file_path):
+                """Validates that the file path is within the safe root directory."""
+                normalized_path = os.path.normpath(file_path)
+                if not normalized_path.startswith(SAFE_ROOT_DIR):
+                    raise Exception("Access to the specified file path is not allowed.")
+                return normalized_path
             def detect_encoding(file_path):
                """Detects the file encoding to avoid UnicodeDecodeError."""
+               file_path = validate_file_path(file_path)
                with open(file_path, 'rb') as f:
@@ -15,2 +25,3 @@
             def load_document_content(file_path):
+                file_path = validate_file_path(file_path)
                 file_extension = Path(file_path).suffix.lower()
@@ -37,2 +48,3 @@
             def get_documents_from_file_by_path(file_path,file_name):
+                file_path = validate_file_path(file_path)
                 file_path = Path(file_path)

@@ -2,2 +2,3 @@
             from pathlib import Path
+            import os
             from langchain_community.document_loaders import PyMuPDFLoader
@@ -36,4 +37,7 @@
-            def get_documents_from_file_by_path(file_path,file_name):
-                file_path = Path(file_path)
+            def get_documents_from_file_by_path(file_path, file_name):
+                base_path = Path("/safe/root/directory")  # Define your safe root directory here
+                file_path = base_path / Path(os.path.normpath(file_path))
+                if not str(file_path).startswith(str(base_path)):
+                    raise Exception(f'Access to the file {file_name} is not allowed')
                 if file_path.exists():
@@ -42,3 +46,3 @@
                     try:
-                        loader,encoding_flag = load_document_content(file_path)
+                        loader, encoding_flag = load_document_content(file_path)
                         if file_extension == ".pdf":
@@ -50,6 +54,6 @@
                                 unstructured_pages = loader.load()
-                                pages= get_pages_with_page_numbers(unstructured_pages)
+                                pages = get_pages_with_page_numbers(unstructured_pages)
                         else:
                             unstructured_pages = loader.load()
-                            pages= get_pages_with_page_numbers(unstructured_pages)
+                            pages = get_pages_with_page_numbers(unstructured_pages)
                     except Exception as e:
@@ -59,3 +63,3 @@
                     raise Exception(f'File {file_name} does not exist')
-                return file_name, pages , file_extension
+                return file_name, pages, file_extension

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text file encoding issue #1126

Text file encoding issue #1126

Uh oh!

kaustubh-darekar commented Feb 25, 2025

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Uh oh!

Uh oh!

Text file encoding issue #1126

Text file encoding issue #1126

Uh oh!

Conversation

kaustubh-darekar commented Feb 25, 2025

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Uh oh!

Uh oh!