Skip to content

BUG: to_parquet (pyarrow engine) opens the local file path twice #65810

@mkviatkovskii

Description

@mkviatkovskii

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import builtins
import os

import pandas as pd

# Spy on Python-level opens of the destination path.
target = "test.parquet"
opens = []
real_open = builtins.open

def spy_open(file, *args, **kwargs):
    if os.path.abspath(os.fspath(file)) == os.path.abspath(target):
        opens.append(args)
    return real_open(file, *args, **kwargs)

builtins.open = spy_open
try:
    pd.DataFrame({"a": [1, 2, 3]}).to_parquet(target, engine="pyarrow")
finally:
    builtins.open = real_open

print(f"pandas opened the destination {len(opens)} time(s) at the Python level")

# Observed: 1  -> pandas opens the file itself via get_handle...
# ...but pyarrow ALSO opens the same path through its C++ layer and does the
# actual writing, so the path is opened twice per to_parquet call.

# On Linux this is also visible at the syscall level:
# strace -f -e trace=openat python -c \
#  "import pandas as pd; pd.DataFrame({'a':[1]}).to_parquet('x.parquet')" 2>&1 \
#  | grep -c 'x.parquet'
# two openat() calls for the same path

Issue Description

For a local filesystem path, to_parquet with the pyarrow engine resolves the path through pandas.io.common.get_handle inside _get_path_or_handle, then unwraps the handle's .name back to a string and hands that string to pyarrow. pyarrow then opens the same path a second time through its own (memory-mapped, multithreaded) C++ I/O layer, which is where the read/write actually happens. So every local-path call opens the destination/source twice.

Consequences:

  • On POSIX: a wasted open()/close() syscall pair on every call (minor, but pointless).
  • On filesystems that finalize a file's contents when the descriptor is closed (e.g. certain write-once / object-store backed filesystems): the empty pandas-side descriptor is closed after pyarrow has written and closed its own, so the empty one wins and the file is silently truncated to 0 bytes - data loss with no error raised.

Expected Behavior

A local path should be opened exactly once. pandas should hand the string path directly to pyarrow (which opens it itself), without first opening it via get_handle. The repro above should print 0 Python-level opens of the destination, and strace should show a single openat() for the path.

(Non-fsspec URLs such as http(s):// still need get_handle, since pyarrow can't fetch those - only genuine local paths should skip it.)

Installed Versions

Details INSTALLED VERSIONS ------------------ commit : 72f2fea python : 3.11.2 python-bits : 64 OS : Linux OS-release : 6.1.0-48-amd64 Version : #1 SMP PREEMPT_DYNAMIC Debian 6.1.172-1 (2026-05-15) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 3.0.3
numpy : 2.4.6
dateutil : 2.9.0.post0
pip : None
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyiceberg : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions