-
Notifications
You must be signed in to change notification settings - Fork 770
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Hi there, great code!
However, I have recently run into an error when parsing some new PDFs (research papers).
Think this might be a simple error handling for ignoring this by switching data.encode("utf-8")
to data.encode("utf-8", errors="replace")
, but wanted to make sure I wasn't missing anything else here. Thanks!
/home/ray/anaconda3/lib/python3.12/asyncio/base_events.py:1984: RuntimeWarning: coroutine 'OpenAIChatCompletion.acompletion' was never awaited
handle = self._ready.popleft()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
+ Exception Group Traceback (most recent call last):
| File "/data/xxx/paper-qa/index_papers.py", line 51, in <module>
| main()
| File "/data/xxx/paper-qa/index_papers.py", line 48, in main
| index_with_default_setting(args.paper_dir)
| File "/data/xxx/paper-qa/index_papers.py", line 32, in index_with_default_setting
| build_index(settings=settings)
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/__init__.py", line 150, in build_index
| return run_or_ensure(coro=get_directory_index(settings=settings))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/utils.py", line 201, in run_or_ensure
| return loop.run_until_complete(coro)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/ray/anaconda3/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
| return future.result()
| ^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 715, in get_directory_index
| async with anyio.create_task_group() as tg:
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
| raise BaseExceptionGroup(
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 552, in process_file
| await search_index.add_document(
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 321, in add_document
| await _add_document() # If this runs, we succeeded
| ^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
| return await copy(fn, *args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
| do = await self.iter(retry_state=retry_state)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
| result = await action(retry_state)
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
| return call(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/__init__.py", line 400, in <lambda>
| self._add_action_func(lambda rs: rs.outcome.result())
| ^^^^^^^^^^^^^^^^^^^
| File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
| return self.__get_result()
| ^^^^^^^^^^^^^^^^^^^
| File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
| raise self._exception
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
| result = await fn(*args, **kwargs)
| ^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 297, in _add_document
| if not await self.filecheck(index_doc["file_location"], index_doc["body"]):
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 263, in filecheck
| filehash: str | None = self.filehash(body) if body else None
| ^^^^^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 259, in filehash
| return hexdigest(body)
| ^^^^^^^^^^^^^^^
| File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/utils.py", line 100, in hexdigest
| return hashlib.md5(data.encode("utf-8")).hexdigest() # noqa: S324
| ^^^^^^^^^^^^^^^^^^^^
| UnicodeEncodeError: 'utf-8' codec can't encode characters in position 15500-15501: surrogates not allowed
+------------------------------------
dosubot
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working