-
Notifications
You must be signed in to change notification settings - Fork 990
Open
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently we provide ParquetRecordBatchReader
and ParquetRecordBatchStream
as interfaces to read parquet data.
These interfaces are relatively straightforward to use, but have limitations:
- Cannot easily support concurrent decode (Parquet: How to do concurrent decoding over columns? #5120)
- Limited flexibility w.r.t how data is fetched or decoded (Support customizing row group reading process in async reader #5141)
- Avoid decoding columns multiple times in the presence of predicates (improve: reuse
Arc<dyn Array>
in parquet record batch reader. #4864)
There is the experimental ArrayReader
interface, however, this is very hard to use correctly and exposes a lot of what should probably remain implementation details.
Describe the solution you'd like
I would like an interface, perhaps similar in spirit to that added to the write side by #4871, that achieves the following:
- Makes it easy to parallelise both:
- The decoding of the parquet leaf columns
- The re-assembly of the arrow data from the dremel encodings
- Facilitates overriding the data source, e.g. by exposing the RowGroups trait
- Avoids exposing too many internal implementation details
Describe alternatives you've considered
Additional context
alamb, wiedld, AdamGS, bbstilson, kylebarron and 3 more
Metadata
Metadata
Assignees
Labels
enhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog