Skip to content

Conversation

@rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Dec 23, 2025

The default behavior of the schema evolution for row type is matching by index.
This PR supports subfield rename and deletion for ORC file format, controlled by
configuration useColumnNamesForColumnMapping.
Missing subfields are identified by matching the file type and requested type on
the names of subfileds, and NULL occupies the position of the missing subfields.
Below table summarizes the results for difference cases.

Column schema Requested output schema ORC result
row({"a", "c"}) row({"a", "b", "c"}) row(a_val, NULL, c_val)
row({"a", "c"}) row({"b"}) row(NULL)
row({"a", "c"}) row({"b", "d"}) row(NULL, NULL)
row({"a", "c"}) row({}) row()

This is a separation PR of #5962.

@rui-mo rui-mo requested a review from majetideepak as a code owner December 23, 2025 09:51
@netlify
Copy link

netlify bot commented Dec 23, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a3638d6
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/694cbf33da2e5a00089c0c4d

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2025
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in column reader looks good. For the configuration on Hive config and reader options, can we reuse the existing one?

static constexpr const char* kParquetUseColumnNamesSession =
"parquet_use_column_names";

static constexpr const char* kOrcAllowEnhancedSchemaEvolution =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just name it match_fields_by_name. Also do we really want to differentiate this config by file format? The usual case is the whole data warehouse should have one setting, no matter which file format is used in a particular partition. Otherwise the management of metadata would be very messy, and data migration would be almost impossible.

Also how does it interplay with orc_use_column_names and parquet_use_column_names? Do we want to deprecate these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt review. I’ve updated this PR to reuse the useColumnNamesForColumnMapping configuration. Previously, subfield rename and deletion would raise errors. They are now supported when use_column_names mode is enabled.

Regarding file format differentiation, it has been observed that Spark behaves differently for Parquet and ORC formats (see #5962 (comment)). To ensure compatibility with both Presto and Spark when using the Parquet format, I propose adding an extra Parquet-specific configuration to handle this special NULL logic in the follow-up PR. For example, when reading a column of type row({"a", "c"}) with a schema of row({"b"}), the default result is row(null). With this configuration enabled, the result would instead be null. Does this approach make sense? Thanks.

TypePtr fieldType = nullptr;
if (baseReaderOpts_.allowEnhancedSchemaEvolution()) {
auto outputTypeIdx =
readerOutputType_->getChildIdxIfExists(fieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this, readerOutputType_ is just a subset of tableSchema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableSchema may be nullptr in some cases, so I updated the logic to use readerOutputType_ for child access when tableSchema is unavailable. Please kindly let me know if this looks reasonable, thanks.

return useColumnNamesForColumnMapping_;
}

bool allowEnhancedSchemaEvolution() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about reuse useColumnNamesForColumnMapping()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants