feat: Allow subfield rename and deletion for ORC format #15848

rui-mo · 2025-12-23T09:51:30Z

The default behavior of the schema evolution for row type is matching by index.
This PR supports subfield rename and deletion for ORC file format, controlled by
configuration useColumnNamesForColumnMapping.
Missing subfields are identified by matching the file type and requested type on
the names of subfileds, and NULL occupies the position of the missing subfields.
Below table summarizes the results for difference cases.

Column schema	Requested output schema	ORC result
row({"a", "c"})	row({"a", "b", "c"})	row(a_val, NULL, c_val)
row({"a", "c"})	row({"b"})	row(NULL)
row({"a", "c"})	row({"b", "d"})	row(NULL, NULL)
row({"a", "c"})	row({})	row()

This is a separation PR of #5962.

netlify · 2025-12-23T09:51:37Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`a3638d6`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/694cbf33da2e5a00089c0c4d

Yuhta

The changes in column reader looks good. For the configuration on Hive config and reader options, can we reuse the existing one?

Yuhta · 2025-12-23T17:10:52Z

velox/connectors/hive/HiveConfig.h

  static constexpr const char* kParquetUseColumnNamesSession =
      "parquet_use_column_names";

+  static constexpr const char* kOrcAllowEnhancedSchemaEvolution =


Just name it match_fields_by_name. Also do we really want to differentiate this config by file format? The usual case is the whole data warehouse should have one setting, no matter which file format is used in a particular partition. Otherwise the management of metadata would be very messy, and data migration would be almost impossible.

Also how does it interplay with orc_use_column_names and parquet_use_column_names? Do we want to deprecate these?

Thanks for the prompt review. I’ve updated this PR to reuse the useColumnNamesForColumnMapping configuration. Previously, subfield rename and deletion would raise errors. They are now supported when use_column_names mode is enabled.

Regarding file format differentiation, it has been observed that Spark behaves differently for Parquet and ORC formats (see #5962 (comment)). To ensure compatibility with both Presto and Spark when using the Parquet format, I propose adding an extra Parquet-specific configuration to handle this special NULL logic in the follow-up PR. For example, when reading a column of type row({"a", "c"}) with a schema of row({"b"}), the default result is row(null). With this configuration enabled, the result would instead be null. Does this approach make sense? Thanks.

Yuhta · 2025-12-23T17:18:43Z

velox/connectors/hive/SplitReader.cpp

+        TypePtr fieldType = nullptr;
+        if (baseReaderOpts_.allowEnhancedSchemaEvolution()) {
+          auto outputTypeIdx =
+              readerOutputType_->getChildIdxIfExists(fieldName);


We probably don't need this, readerOutputType_ is just a subset of tableSchema

tableSchema may be nullptr in some cases, so I updated the logic to use readerOutputType_ for child access when tableSchema is unavailable. Please kindly let me know if this looks reasonable, thanks.

Yuhta · 2025-12-23T17:19:27Z

velox/dwio/common/Options.h

    return useColumnNamesForColumnMapping_;
  }

+  bool allowEnhancedSchemaEvolution() const {


How about reuse useColumnNamesForColumnMapping()?

Updated, thanks.

rui-mo requested a review from majetideepak as a code owner December 23, 2025 09:51

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2025

Yuhta reviewed Dec 23, 2025

View reviewed changes

Allow rename and deletion of subfields

a3638d6

rui-mo force-pushed the wip_dwrf branch from d2b44a4 to a3638d6 Compare December 25, 2025 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Allow subfield rename and deletion for ORC format #15848

feat: Allow subfield rename and deletion for ORC format #15848

rui-mo commented Dec 23, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

Yuhta left a comment

Uh oh!

Yuhta Dec 23, 2025

Uh oh!

rui-mo Dec 25, 2025

Uh oh!

Yuhta Dec 23, 2025

Uh oh!

rui-mo Dec 25, 2025

Uh oh!

Yuhta Dec 23, 2025

Uh oh!

rui-mo Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Allow subfield rename and deletion for ORC format #15848

Are you sure you want to change the base?

feat: Allow subfield rename and deletion for ORC format #15848

Conversation

rui-mo commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Yuhta left a comment

Choose a reason for hiding this comment

Uh oh!

Yuhta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

rui-mo Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Yuhta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

rui-mo Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Yuhta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

rui-mo Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rui-mo commented Dec 23, 2025 •

edited

Loading

netlify bot commented Dec 23, 2025 •

edited

Loading