Skip to content

Error Parsing PDF #1068

@dmcgrath19

Description

@dmcgrath19

Hi there, great code!

However, I have recently run into an error when parsing some new PDFs (research papers).

Think this might be a simple error handling for ignoring this by switching data.encode("utf-8") to data.encode("utf-8", errors="replace"), but wanted to make sure I wasn't missing anything else here. Thanks!

/home/ray/anaconda3/lib/python3.12/asyncio/base_events.py:1984: RuntimeWarning: coroutine 'OpenAIChatCompletion.acompletion' was never awaited
  handle = self._ready.popleft()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
  + Exception Group Traceback (most recent call last):
  |   File "/data/xxx/paper-qa/index_papers.py", line 51, in <module>
  |     main()
  |   File "/data/xxx/paper-qa/index_papers.py", line 48, in main
  |     index_with_default_setting(args.paper_dir)
  |   File "/data/xxx/paper-qa/index_papers.py", line 32, in index_with_default_setting
  |     build_index(settings=settings)
  |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/__init__.py", line 150, in build_index
  |     return run_or_ensure(coro=get_directory_index(settings=settings))
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/utils.py", line 201, in run_or_ensure
  |     return loop.run_until_complete(coro)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/ray/anaconda3/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
  |     return future.result()
  |            ^^^^^^^^^^^^^^^
  |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 715, in get_directory_index
  |     async with anyio.create_task_group() as tg:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 772, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 552, in process_file
    |     await search_index.add_document(
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 321, in add_document
    |     await _add_document()  # If this runs, we succeeded
    |     ^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    |     return await copy(fn, *args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    |     do = await self.iter(retry_state=retry_state)
    |          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    |     result = await action(retry_state)
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    |     return call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/__init__.py", line 400, in <lambda>
    |     self._add_action_func(lambda rs: rs.outcome.result())
    |                                      ^^^^^^^^^^^^^^^^^^^
    |   File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    |     return self.__get_result()
    |            ^^^^^^^^^^^^^^^^^^^
    |   File "/home/ray/anaconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    |     raise self._exception
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    |     result = await fn(*args, **kwargs)
    |              ^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 297, in _add_document
    |     if not await self.filecheck(index_doc["file_location"], index_doc["body"]):
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 263, in filecheck
    |     filehash: str | None = self.filehash(body) if body else None
    |                            ^^^^^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/agents/search.py", line 259, in filehash
    |     return hexdigest(body)
    |            ^^^^^^^^^^^^^^^
    |   File "/data/xxx/paper-qa/.venv/lib/python3.12/site-packages/paperqa/utils.py", line 100, in hexdigest
    |     return hashlib.md5(data.encode("utf-8")).hexdigest()  # noqa: S324
    |                        ^^^^^^^^^^^^^^^^^^^^
    | UnicodeEncodeError: 'utf-8' codec can't encode characters in position 15500-15501: surrogates not allowed
    +------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions