Open
Description
I noticed that tpch query 1 is spending only about half it's time in parquet IO when using the dataset that's been produced by pyarrow
However, the run is heavily GIL congested and another GIL+native profile reveals that actually very few things (in python) are holding on to the GIL (native-only threads, e.g. of pyarrow are not tracked by py-spy so we don't see how arrow holds the gil to produce the dataframe)
which points to the take pandas function. There's already been a recent fix to this code area (see pandas-dev/pandas#54483) for axis0
Metadata
Metadata
Assignees
Labels
No labels