Skip to content

Conversation

@rui-mo
Copy link
Collaborator

@rui-mo rui-mo commented Aug 2, 2023

The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet and ORC file formats. Missing
subfields are identified by matching the file type and requested type on the
names of subfileds, and 'null' occupies the position of the missing subfields.
Below table summarizes the results for difference cases and file formats.

Column schema Requested output schema Parquet result ORC result
row({"a", "c"}) row({"a", "b", "c"}) row(a_val, null, c_val) row(a_val, null, c_val)
row({"a", "c"}) row({"b"}) null row(null)
row({"a", "c"}) row({"b", "d"}) null row(null, null)
row({"a", "c"}) row({}) null row()

@netlify
Copy link

netlify bot commented Aug 2, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 1ccda8f
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/692fd2fd9160ce000821310b

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2023
@Yuhta Yuhta self-requested a review August 2, 2023 17:44
/**
* Get the output type of row reader.
*/
const RowTypePtr& getOutputType() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.

}
auto childDataType = fileType_->childByName(childSpecs[i]->fieldName());
const auto& fieldName = childSpecs[i]->fieldName();
if (outputType && !fileType_->containsChild(fieldName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.

Copy link
Collaborator Author

@rui-mo rui-mo Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That looks good to me. Convert this PR to draft for now.

@rui-mo rui-mo changed the title Support struct column reading with different schemas [GLUTEN] Support struct column reading with different schemas Aug 4, 2023
@rui-mo rui-mo marked this pull request as draft August 9, 2023 02:52
@rui-mo rui-mo changed the title [GLUTEN] Support struct column reading with different schemas Support struct column reading with different schemas Aug 28, 2023
@rui-mo rui-mo force-pushed the wip_struct branch 3 times, most recently from c03152b to c8c5132 Compare September 5, 2023 05:28
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 2168dc9 to fda6ff8 Compare October 13, 2023 02:33
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from a8174d3 to 7abb820 Compare November 7, 2023 01:44
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from d307831 to 0364f89 Compare January 26, 2024 02:54
@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from e7eab9e to 1021b22 Compare April 2, 2024 05:13
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 2, 2024
marin-ma pushed a commit to oap-project/velox that referenced this pull request Apr 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 4, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Apr 5, 2024
@rui-mo
Copy link
Collaborator Author

rui-mo commented Mar 25, 2025

@majetideepak Understood. I will take a look, thanks.

@rui-mo rui-mo force-pushed the wip_struct branch 3 times, most recently from 3ce8a94 to 7350871 Compare May 31, 2025 01:51
@rui-mo
Copy link
Collaborator Author

rui-mo commented Jun 4, 2025

@majetideepak @Yuhta The prior refactor has been merged. Would you please take another look? Thanks!

@rui-mo
Copy link
Collaborator Author

rui-mo commented Jul 3, 2025

Hi @Yuhta @majetideepak, could you please spare some time to continue the review process of this change? Thanks!

@zhouyuan
Copy link
Collaborator

@Yuhta gentle ping

@rui-mo rui-mo force-pushed the wip_struct branch 2 times, most recently from 2e5fa85 to f1889e5 Compare July 30, 2025 09:52
@zhouyuan
Copy link
Collaborator

@Yuhta gentle ping

@zhztheplayer
Copy link
Collaborator

May I understand what's blocking on merging this? @Yuhta

The patch could also remove a blocker against building a Velox Parqet reader for Spark + Delta for another workload.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Sep 3, 2025

cc: @pedroerp @FelixYBW This is essential for Gluten. I’d be glad to follow up on any additional comments. Thanks!

@majetideepak
Copy link
Collaborator

@Yuhta can you please take another look at this PR?

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very big and risky change and I don't think we can support it without an explicit flag to indicate we want this mode. Let's start by putting all the change behind the a flag, and also breaking down the changes to multiple PRs, along with proper tests for both formats.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Dec 5, 2025

Thanks @Yuhta. Let me follow up on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants