Skip to content

Text file encoding issue #1126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 25, 2025
Merged

Text file encoding issue #1126

merged 3 commits into from
Feb 25, 2025

Conversation

kaustubh-darekar
Copy link
Collaborator

Resolved
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xe1 in position 214845: illegal multibyte sequence

For some of the text files

@kaustubh-darekar kaustubh-darekar added the bug Something isn't working label Feb 25, 2025
@kaustubh-darekar kaustubh-darekar self-assigned this Feb 25, 2025

def detect_encoding(file_path):
"""Detects the file encoding to avoid UnicodeDecodeError."""
with open(file_path, 'rb') as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 6 months ago

To fix the problem, we need to validate the file_path to ensure it does not lead to unauthorized file access. We can achieve this by normalizing the path and ensuring it is contained within a predefined safe directory. This involves the following steps:

  1. Define a safe root directory.
  2. Normalize the file_path using os.path.normpath.
  3. Check that the normalized path starts with the safe root directory.
Suggested changeset 1
backend/src/document_sources/local_file.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/backend/src/document_sources/local_file.py b/backend/src/document_sources/local_file.py
--- a/backend/src/document_sources/local_file.py
+++ b/backend/src/document_sources/local_file.py
@@ -7,4 +7,14 @@
 
+SAFE_ROOT_DIR = "/path/to/safe/root"  # Define the safe root directory
+
+def validate_file_path(file_path):
+    """Validates that the file path is within the safe root directory."""
+    normalized_path = os.path.normpath(file_path)
+    if not normalized_path.startswith(SAFE_ROOT_DIR):
+        raise Exception("Access to the specified file path is not allowed.")
+    return normalized_path
+
 def detect_encoding(file_path):
    """Detects the file encoding to avoid UnicodeDecodeError."""
+   file_path = validate_file_path(file_path)
    with open(file_path, 'rb') as f:
@@ -15,2 +25,3 @@
 def load_document_content(file_path):
+    file_path = validate_file_path(file_path)
     file_extension = Path(file_path).suffix.lower()
@@ -37,2 +48,3 @@
 def get_documents_from_file_by_path(file_path,file_name):
+    file_path = validate_file_path(file_path)
     file_path = Path(file_path)
EOF
@@ -7,4 +7,14 @@

SAFE_ROOT_DIR = "/path/to/safe/root" # Define the safe root directory

def validate_file_path(file_path):
"""Validates that the file path is within the safe root directory."""
normalized_path = os.path.normpath(file_path)
if not normalized_path.startswith(SAFE_ROOT_DIR):
raise Exception("Access to the specified file path is not allowed.")
return normalized_path

def detect_encoding(file_path):
"""Detects the file encoding to avoid UnicodeDecodeError."""
file_path = validate_file_path(file_path)
with open(file_path, 'rb') as f:
@@ -15,2 +25,3 @@
def load_document_content(file_path):
file_path = validate_file_path(file_path)
file_extension = Path(file_path).suffix.lower()
@@ -37,2 +48,3 @@
def get_documents_from_file_by_path(file_path,file_name):
file_path = validate_file_path(file_path)
file_path = Path(file_path)
Copilot is powered by AI and may make mistakes. Always verify output.
loader = UnstructuredFileLoader(file_path, mode="elements",autodetect_encoding=True)
return loader,encoding_flag
else:
with open(file_path, encoding=encoding, errors="replace") as f:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 6 months ago

To fix the problem, we need to validate and sanitize the file_name parameter before using it to construct file paths. We can use the os.path.normpath function to normalize the path and ensure it does not contain any path traversal characters. Additionally, we should check that the resulting path is within a designated safe directory.

  1. Normalize the file_name using os.path.normpath.
  2. Ensure the normalized path is within a safe root directory.
  3. Update the get_documents_from_file_by_path function to include these checks.
Suggested changeset 1
backend/src/document_sources/local_file.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/backend/src/document_sources/local_file.py b/backend/src/document_sources/local_file.py
--- a/backend/src/document_sources/local_file.py
+++ b/backend/src/document_sources/local_file.py
@@ -2,2 +2,3 @@
 from pathlib import Path
+import os
 from langchain_community.document_loaders import PyMuPDFLoader
@@ -36,4 +37,7 @@
     
-def get_documents_from_file_by_path(file_path,file_name):
-    file_path = Path(file_path)
+def get_documents_from_file_by_path(file_path, file_name):
+    base_path = Path("/safe/root/directory")  # Define your safe root directory here
+    file_path = base_path / Path(os.path.normpath(file_path))
+    if not str(file_path).startswith(str(base_path)):
+        raise Exception(f'Access to the file {file_name} is not allowed')
     if file_path.exists():
@@ -42,3 +46,3 @@
         try:
-            loader,encoding_flag = load_document_content(file_path)
+            loader, encoding_flag = load_document_content(file_path)
             if file_extension == ".pdf":
@@ -50,6 +54,6 @@
                     unstructured_pages = loader.load()   
-                    pages= get_pages_with_page_numbers(unstructured_pages)   
+                    pages = get_pages_with_page_numbers(unstructured_pages)   
             else:
                 unstructured_pages = loader.load()   
-                pages= get_pages_with_page_numbers(unstructured_pages)      
+                pages = get_pages_with_page_numbers(unstructured_pages)      
         except Exception as e:
@@ -59,3 +63,3 @@
         raise Exception(f'File {file_name} does not exist')
-    return file_name, pages , file_extension
+    return file_name, pages, file_extension
 
EOF
@@ -2,2 +2,3 @@
from pathlib import Path
import os
from langchain_community.document_loaders import PyMuPDFLoader
@@ -36,4 +37,7 @@

def get_documents_from_file_by_path(file_path,file_name):
file_path = Path(file_path)
def get_documents_from_file_by_path(file_path, file_name):
base_path = Path("/safe/root/directory") # Define your safe root directory here
file_path = base_path / Path(os.path.normpath(file_path))
if not str(file_path).startswith(str(base_path)):
raise Exception(f'Access to the file {file_name} is not allowed')
if file_path.exists():
@@ -42,3 +46,3 @@
try:
loader,encoding_flag = load_document_content(file_path)
loader, encoding_flag = load_document_content(file_path)
if file_extension == ".pdf":
@@ -50,6 +54,6 @@
unstructured_pages = loader.load()
pages= get_pages_with_page_numbers(unstructured_pages)
pages = get_pages_with_page_numbers(unstructured_pages)
else:
unstructured_pages = loader.load()
pages= get_pages_with_page_numbers(unstructured_pages)
pages = get_pages_with_page_numbers(unstructured_pages)
except Exception as e:
@@ -59,3 +63,3 @@
raise Exception(f'File {file_name} does not exist')
return file_name, pages , file_extension
return file_name, pages, file_extension

Copilot is powered by AI and may make mistakes. Always verify output.
@karanchellani karanchellani merged commit 3a222cc into dev Feb 25, 2025
2 of 3 checks passed
kaustubh-darekar added a commit that referenced this pull request Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants