-
Notifications
You must be signed in to change notification settings - Fork 133
feat: Add basic partition pruning support #713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #713 +/- ##
==========================================
+ Coverage 84.25% 84.29% +0.04%
==========================================
Files 77 77
Lines 19051 19099 +48
Branches 19051 19099 +48
==========================================
+ Hits 16051 16100 +49
+ Misses 2202 2201 -1
Partials 798 798 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the loooong delay on reviewing this!
Looks great. Nice how simple it was actually to fit it in once you found the right shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great thanks!! just a couple quick things.
And for the TODOs in the description: it sounds like the first is sufficiently covered by #712. should we make an issue or two for the stats_parsed and partitionValues_parsed additions?
|
After offline discussion, we will deal with the schema change issue as a follow-up because it already existed with stats parsing and nobody seems to have hit the corner case yet. |
What changes are proposed in this pull request?
Add basic support for partition pruning by combining two pieces of existing infra:
Result: partition pruning gets applied during log replay, just before deduplication so we don't have to remember pruned files.
WARNING: The implementation currently has a flaw, in case the history contains a table-replace that affected partition columns. For example, changing a value column into a non-nullable partition column, or an incompatible type change to a partition column. In such cases, the remove actions generated by the table-replace operation (for old files) would have the wrong type or even be entirely absent. While the code can handle an absent partition value, an incompatibly typed value would cause a parsing error that fails the whole query. Note that stats-based data skipping already has the same flaw, so we are not making the problem worse. We will fix the problem for both as a follow-up item, tracked by #712
NOTE: While this is a convenient way to achieve partition pruning in the immediate term, Delta checkpoints can provide strongly-typed
stats_parsedandpartitionValues_parsedcolumns which would have a completely different access.statsvs.stats_parsed, the likely solution is simple enough because we already json-parsestatsinto a strongly-typed nested struct in order to evaluate the data skipping predicate over its record batch. We just avoid the parsing overhead ifstats_parsedis already available.partitionValuesfield poses a bigger challenge, because it's a string-string map, not a JSON literal. In order to turn it into a strongly-typed nested struct, we would need a SQL expression that can extract the string values and try-cast them to the desired types. That's ugly enough we might prefer to keep completely different code paths for parsed vs. string partition values, but then there's a risk that partition pruning behavior changes depending on which path got invoked.How was this change tested?
New unit tests, and adjusted one unit test that assumed no partition pruning.