-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Support struct schema evolution matching by name #5962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
velox/dwio/common/Options.h
Outdated
| /** | ||
| * Get the output type of row reader. | ||
| */ | ||
| const RowTypePtr& getOutputType() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested type is available as getSelector()->getSchemaWithId()->type. We may want to convert it to a type directly in the future, but for now let's not keep 2 copies of the same thing.
| } | ||
| auto childDataType = fileType_->childByName(childSpecs[i]->fieldName()); | ||
| const auto& fieldName = childSpecs[i]->fieldName(); | ||
| if (outputType && !fileType_->containsChild(fieldName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decide what is the schema evolution strategy we want here. In our data warehouse, columns are not matched by name but by position, so any extra fields added need to be at the end of the children list. This allows column renaming. If we match by name here, we will lose the renaming functionality and this seems quite important in most data warehouse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. Does that mean for a row(a, c) struct schema in parquet, the expected output can only be like row(a, c, xxx, ...)? In Spark, there is no such limitation to extra child fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes new subfields can only be appended. So in plain vanilla Spark, field renaming is not supported? There is also a third way to match by field ID (e.g. Iceberg), we need to start draft some design about this to cover all three cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does field renaming is conducted in the data warehouse you mentioned? In Spark, for query like select a as b, it adds a projection node with Alias expression after scan.
And what do you suggest for the design, should I added some notes in this PR or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With matching by name you need to know all the old field names (a in your query) in all old files, which is not practical in a normal data warehouse. I would suggest we pause this PR for a bit and design the right way to allow matching columns in different ways first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That looks good to me. Convert this PR to draft for now.
c03152b to
c8c5132
Compare
2168dc9 to
fda6ff8
Compare
a8174d3 to
7abb820
Compare
d307831 to
0364f89
Compare
e7eab9e to
1021b22
Compare
|
@majetideepak Understood. I will take a look, thanks. |
3ce8a94 to
7350871
Compare
|
@majetideepak @Yuhta The prior refactor has been merged. Would you please take another look? Thanks! |
|
Hi @Yuhta @majetideepak, could you please spare some time to continue the review process of this change? Thanks! |
|
@Yuhta gentle ping |
2e5fa85 to
f1889e5
Compare
|
@Yuhta gentle ping |
|
May I understand what's blocking on merging this? @Yuhta The patch could also remove a blocker against building a Velox Parqet reader for Spark + Delta for another workload. |
|
@Yuhta can you please take another look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very big and risky change and I don't think we can support it without an explicit flag to indicate we want this mode. Let's start by putting all the change behind the a flag, and also breaking down the changes to multiple PRs, along with proper tests for both formats.
|
Thanks @Yuhta. Let me follow up on that. |
The default behavior of the schema evolution for row type is matching by index.
This PR supports matching by name for Parquet and ORC file formats. Missing
subfields are identified by matching the file type and requested type on the
names of subfileds, and 'null' occupies the position of the missing subfields.
Below table summarizes the results for difference cases and file formats.