feat: Support struct schema evolution matching by name #5962

rui-mo · 2023-08-02T06:51:51Z

The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet and ORC file formats. Missing
subfields are identified by matching the file type and requested type on the
names of subfileds, and 'null' occupies the position of the missing subfields.
Below table summarizes the results for difference cases and file formats.

Column schema	Requested output schema	Parquet result	ORC result
row({"a", "c"})	row({"a", "b", "c"})	row(a_val, null, c_val)	row(a_val, null, c_val)
row({"a", "c"})	row({"b"})	null	row(null)
row({"a", "c"})	row({"b", "d"})	null	row(null, null)
row({"a", "c"})	row({})	null	row()

netlify · 2023-08-02T06:51:56Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`1ccda8f`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/692fd2fd9160ce000821310b

Yuhta · 2023-08-02T17:47:02Z

velox/dwio/common/Options.h

+  /**
+   * Get the output type of row reader.
+   */
+  const RowTypePtr& getOutputType() const {


Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

Yuhta · 2023-08-02T17:51:34Z

velox/dwio/parquet/reader/StructColumnReader.cpp

    }
-    auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
+    const auto& fieldName = childSpecs[i]->fieldName();
+    if (outputType && !fileType_->containsChild(fieldName)) {


We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Thanks. That looks good to me. Convert this PR to draft for now.

…or#5962)

rui-mo · 2025-03-25T22:27:17Z

@majetideepak Understood. I will take a look, thanks.

rui-mo · 2025-06-04T10:40:33Z

@majetideepak @Yuhta The prior refactor has been merged. Would you please take another look? Thanks!

rui-mo · 2025-07-03T14:27:51Z

Hi @Yuhta @majetideepak, could you please spare some time to continue the review process of this change? Thanks!

zhouyuan · 2025-07-28T13:02:57Z

@Yuhta gentle ping

zhouyuan · 2025-08-15T14:13:18Z

@Yuhta gentle ping

zhztheplayer · 2025-08-18T12:05:02Z

May I understand what's blocking on merging this? @Yuhta

The patch could also remove a blocker against building a Velox Parqet reader for Spark + Delta for another workload.

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

rui-mo · 2025-09-03T09:19:36Z

cc: @pedroerp @FelixYBW This is essential for Gluten. I’d be glad to follow up on any additional comments. Thanks!

majetideepak · 2025-12-03T12:07:56Z

@Yuhta can you please take another look at this PR?

Yuhta

This is a very big and risky change and I don't think we can support it without an explicit flag to indicate we want this mode. Let's start by putting all the change behind the a flag, and also breaking down the changes to multiple PRs, along with proper tests for both formats.

rui-mo · 2025-12-05T09:24:27Z

Thanks @Yuhta. Let me follow up on that.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023

Yuhta self-requested a review August 2, 2023 17:44

Yuhta reviewed Aug 2, 2023

View reviewed changes

rui-mo changed the title ~~Support struct column reading with different schemas~~ [GLUTEN] Support struct column reading with different schemas Aug 4, 2023

rui-mo marked this pull request as draft August 9, 2023 02:52

rui-mo changed the title ~~[GLUTEN] Support struct column reading with different schemas~~ Support struct column reading with different schemas Aug 28, 2023

rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28

rui-mo force-pushed the wip_struct branch from c8c5132 to e65f832 Compare September 19, 2023 07:40

rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33

rui-mo force-pushed the wip_struct branch from fda6ff8 to 6dc6b0f Compare October 26, 2023 01:11

rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44

rui-mo force-pushed the wip_struct branch from 7abb820 to e847a3b Compare November 22, 2023 05:59

rui-mo force-pushed the wip_struct branch from e847a3b to 07949bb Compare January 2, 2024 09:48

rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54

rui-mo force-pushed the wip_struct branch from 0364f89 to 12ca41d Compare March 5, 2024 03:34

rui-mo force-pushed the wip_struct branch from 12ca41d to 8af647b Compare March 20, 2024 04:06

rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024

Support struct column reading with different schemas (facebookincubat…

73da57b

…or#5962)

rui-mo force-pushed the wip_struct branch from 1021b22 to f2a890c Compare April 2, 2024 08:32

marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024

Support struct column reading with different schemas (facebookincubat…

86d860c

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024

Support struct column reading with different schemas (facebookincubat…

a9f262b

…or#5962)

GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024

Support struct column reading with different schemas (facebookincubat…

2b4b1a4

…or#5962)

rui-mo mentioned this pull request Apr 24, 2025

feat: Add ColumnReaderOptions #12840

Closed

rui-mo force-pushed the wip_struct branch 3 times, most recently from 3ce8a94 to 7350871 Compare May 31, 2025 01:51

rui-mo force-pushed the wip_struct branch from 7350871 to b70d225 Compare June 3, 2025 09:14

rui-mo force-pushed the wip_struct branch from b70d225 to f473af6 Compare June 24, 2025 13:14

rui-mo force-pushed the wip_struct branch from f473af6 to ea7d603 Compare July 3, 2025 12:11

rui-mo force-pushed the wip_struct branch 2 times, most recently from 2e5fa85 to f1889e5 Compare July 30, 2025 09:52

zhztheplayer mentioned this pull request Aug 18, 2025

fix: Fix incorrect null results when unrelated child fields are missing from the requested type in readers #14504

Closed

rui-mo force-pushed the wip_struct branch from f1889e5 to 7992a56 Compare August 18, 2025 16:58

zhli1142015 reviewed Aug 20, 2025

View reviewed changes

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp Outdated Show resolved Hide resolved

zhztheplayer mentioned this pull request Aug 25, 2025

Update patches boostscale/velox4j#355

Merged

rui-mo force-pushed the wip_struct branch from 7992a56 to 656c5eb Compare August 29, 2025 15:30

rui-mo mentioned this pull request Sep 1, 2025

test: Add test for struct schema evolution matching by index #11885

Closed

zhli1142015 approved these changes Sep 4, 2025

View reviewed changes

rui-mo force-pushed the wip_struct branch from 656c5eb to ed52c8f Compare October 29, 2025 15:14

Support struct schema evolution matching by name

1ccda8f

rui-mo force-pushed the wip_struct branch from ed52c8f to 1ccda8f Compare December 3, 2025 06:04

Yuhta reviewed Dec 5, 2025

View reviewed changes

rui-mo mentioned this pull request Dec 23, 2025

feat: Allow subfield rename and deletion for ORC format #15848

Open

feat: Support struct schema evolution matching by name #5962

Are you sure you want to change the base?

feat: Support struct schema evolution matching by name #5962

Uh oh!

Conversation

rui-mo commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

Yuhta Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

rui-mo Aug 4, 2023

Choose a reason for hiding this comment

Uh oh!

Yuhta Aug 4, 2023

Choose a reason for hiding this comment

Uh oh!

rui-mo Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yuhta Aug 8, 2023

Choose a reason for hiding this comment

Uh oh!

rui-mo Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Mar 25, 2025

Uh oh!

rui-mo commented Jun 4, 2025

Uh oh!

rui-mo commented Jul 3, 2025

Uh oh!

zhouyuan commented Jul 28, 2025

Uh oh!

zhouyuan commented Aug 15, 2025

Uh oh!

zhztheplayer commented Aug 18, 2025

Uh oh!

Uh oh!

rui-mo commented Sep 3, 2025

Uh oh!

majetideepak commented Dec 3, 2025

Uh oh!

Yuhta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rui-mo commented Aug 2, 2023 •

edited

Loading

netlify bot commented Aug 2, 2023 •

edited

Loading

rui-mo Aug 7, 2023 •

edited

Loading

Yuhta left a comment •

edited

Loading