-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Allow subfield rename and deletion for ORC format #15848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Yuhta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in column reader looks good. For the configuration on Hive config and reader options, can we reuse the existing one?
velox/connectors/hive/HiveConfig.h
Outdated
| static constexpr const char* kParquetUseColumnNamesSession = | ||
| "parquet_use_column_names"; | ||
|
|
||
| static constexpr const char* kOrcAllowEnhancedSchemaEvolution = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just name it match_fields_by_name. Also do we really want to differentiate this config by file format? The usual case is the whole data warehouse should have one setting, no matter which file format is used in a particular partition. Otherwise the management of metadata would be very messy, and data migration would be almost impossible.
Also how does it interplay with orc_use_column_names and parquet_use_column_names? Do we want to deprecate these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt review. I’ve updated this PR to reuse the useColumnNamesForColumnMapping configuration. Previously, subfield rename and deletion would raise errors. They are now supported when use_column_names mode is enabled.
Regarding file format differentiation, it has been observed that Spark behaves differently for Parquet and ORC formats (see #5962 (comment)). To ensure compatibility with both Presto and Spark when using the Parquet format, I propose adding an extra Parquet-specific configuration to handle this special NULL logic in the follow-up PR. For example, when reading a column of type row({"a", "c"}) with a schema of row({"b"}), the default result is row(null). With this configuration enabled, the result would instead be null. Does this approach make sense? Thanks.
| TypePtr fieldType = nullptr; | ||
| if (baseReaderOpts_.allowEnhancedSchemaEvolution()) { | ||
| auto outputTypeIdx = | ||
| readerOutputType_->getChildIdxIfExists(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this, readerOutputType_ is just a subset of tableSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tableSchema may be nullptr in some cases, so I updated the logic to use readerOutputType_ for child access when tableSchema is unavailable. Please kindly let me know if this looks reasonable, thanks.
velox/dwio/common/Options.h
Outdated
| return useColumnNamesForColumnMapping_; | ||
| } | ||
|
|
||
| bool allowEnhancedSchemaEvolution() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reuse useColumnNamesForColumnMapping()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
The default behavior of the schema evolution for row type is matching by index.
This PR supports subfield rename and deletion for ORC file format, controlled by
configuration
useColumnNamesForColumnMapping.Missing subfields are identified by matching the file type and requested type on
the names of subfileds, and NULL occupies the position of the missing subfields.
Below table summarizes the results for difference cases.
This is a separation PR of #5962.