-
Notifications
You must be signed in to change notification settings - Fork 666
Text file encoding issue #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text file encoding issue #1126
Conversation
|
||
def detect_encoding(file_path): | ||
"""Detects the file encoding to avoid UnicodeDecodeError.""" | ||
with open(file_path, 'rb') as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 months ago
To fix the problem, we need to validate the file_path
to ensure it does not lead to unauthorized file access. We can achieve this by normalizing the path and ensuring it is contained within a predefined safe directory. This involves the following steps:
- Define a safe root directory.
- Normalize the
file_path
usingos.path.normpath
. - Check that the normalized path starts with the safe root directory.
-
Copy modified lines R8-R16 -
Copy modified line R19 -
Copy modified line R26 -
Copy modified line R49
@@ -7,4 +7,14 @@ | ||
|
||
SAFE_ROOT_DIR = "/path/to/safe/root" # Define the safe root directory | ||
|
||
def validate_file_path(file_path): | ||
"""Validates that the file path is within the safe root directory.""" | ||
normalized_path = os.path.normpath(file_path) | ||
if not normalized_path.startswith(SAFE_ROOT_DIR): | ||
raise Exception("Access to the specified file path is not allowed.") | ||
return normalized_path | ||
|
||
def detect_encoding(file_path): | ||
"""Detects the file encoding to avoid UnicodeDecodeError.""" | ||
file_path = validate_file_path(file_path) | ||
with open(file_path, 'rb') as f: | ||
@@ -15,2 +25,3 @@ | ||
def load_document_content(file_path): | ||
file_path = validate_file_path(file_path) | ||
file_extension = Path(file_path).suffix.lower() | ||
@@ -37,2 +48,3 @@ | ||
def get_documents_from_file_by_path(file_path,file_name): | ||
file_path = validate_file_path(file_path) | ||
file_path = Path(file_path) |
loader = UnstructuredFileLoader(file_path, mode="elements",autodetect_encoding=True) | ||
return loader,encoding_flag | ||
else: | ||
with open(file_path, encoding=encoding, errors="replace") as f: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 months ago
To fix the problem, we need to validate and sanitize the file_name
parameter before using it to construct file paths. We can use the os.path.normpath
function to normalize the path and ensure it does not contain any path traversal characters. Additionally, we should check that the resulting path is within a designated safe directory.
- Normalize the
file_name
usingos.path.normpath
. - Ensure the normalized path is within a safe root directory.
- Update the
get_documents_from_file_by_path
function to include these checks.
-
Copy modified line R3 -
Copy modified lines R38-R42 -
Copy modified line R47 -
Copy modified line R55 -
Copy modified line R58 -
Copy modified line R64
@@ -2,2 +2,3 @@ | ||
from pathlib import Path | ||
import os | ||
from langchain_community.document_loaders import PyMuPDFLoader | ||
@@ -36,4 +37,7 @@ | ||
|
||
def get_documents_from_file_by_path(file_path,file_name): | ||
file_path = Path(file_path) | ||
def get_documents_from_file_by_path(file_path, file_name): | ||
base_path = Path("/safe/root/directory") # Define your safe root directory here | ||
file_path = base_path / Path(os.path.normpath(file_path)) | ||
if not str(file_path).startswith(str(base_path)): | ||
raise Exception(f'Access to the file {file_name} is not allowed') | ||
if file_path.exists(): | ||
@@ -42,3 +46,3 @@ | ||
try: | ||
loader,encoding_flag = load_document_content(file_path) | ||
loader, encoding_flag = load_document_content(file_path) | ||
if file_extension == ".pdf": | ||
@@ -50,6 +54,6 @@ | ||
unstructured_pages = loader.load() | ||
pages= get_pages_with_page_numbers(unstructured_pages) | ||
pages = get_pages_with_page_numbers(unstructured_pages) | ||
else: | ||
unstructured_pages = loader.load() | ||
pages= get_pages_with_page_numbers(unstructured_pages) | ||
pages = get_pages_with_page_numbers(unstructured_pages) | ||
except Exception as e: | ||
@@ -59,3 +63,3 @@ | ||
raise Exception(f'File {file_name} does not exist') | ||
return file_name, pages , file_extension | ||
return file_name, pages, file_extension | ||
|
This reverts commit 3a222cc.
Resolved
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xe1 in position 214845: illegal multibyte sequence
For some of the text files