-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current API for the parquet crate is rather large, and exposes quite a lot of implementation detail.
This has a couple of implications:
- It complicates iterating on the crate without making breaking changes to public APIs
- It adds to user's cognitive load as users have to work out what APIs to use
Some examples of this
- The
util
module contains all sorts of random stuff - a hash implementation, maths functions, memory tracking, etc... - The
compression
module data_type::AsBytes
,data_type::SliceAsBytes
,data_type::SliceAsBytesDataType
data_type::DataType
,ColumnReaderImpl
,RecordReader
schema::types::to_thrift
Describe the solution you'd like
I'm not familiar enough with the design or objectives of the crate to authoritatively weigh in on what should or shouldn't be public, however, it is my observation that a number of the APIs don't appear to be optimised for external consumption.
My personal preference would be to make everything lower than the file-level, i.e. SerializedFileReader
, ParquetFileArrowReader
, RowIter
crate-local, as this would have the benefit of being pretty unambiguous and easy to communicate and maintain.
I don't know if there are people making use of the lower-level APIs operating on columns, row groups, column chunks, pages, etc... However, any APIs made private could be made public again in a point-release based on user feedback.
I think this sort of touches on the objectives for the crate, is the intent to provide APIs for manipulating parquet files, or APIs for implementing parquet readers and writers for your own custom in-memory format. If the latter, this change would be at odds with it, but I'm not sure this is the case?
Any changes would obviously need to be made in a major arrow release, the next of which I believe is in January 2022 (@alamb could maybe confirm) and so there is no immediate urgency, but I wanted to maybe start the discussion
Relates to #171