feat: Add basic partition pruning support #713

scovich · 2025-02-21T22:53:01Z

What changes are proposed in this pull request?

Add basic support for partition pruning by combining two pieces of existing infra:

The log replay row visitor already needs to parse partition values and already filters out unwanted rows
The default predicate evaluator works directly with scalars

Result: partition pruning gets applied during log replay, just before deduplication so we don't have to remember pruned files.

WARNING: The implementation currently has a flaw, in case the history contains a table-replace that affected partition columns. For example, changing a value column into a non-nullable partition column, or an incompatible type change to a partition column. In such cases, the remove actions generated by the table-replace operation (for old files) would have the wrong type or even be entirely absent. While the code can handle an absent partition value, an incompatibly typed value would cause a parsing error that fails the whole query. Note that stats-based data skipping already has the same flaw, so we are not making the problem worse. We will fix the problem for both as a follow-up item, tracked by #712

NOTE: While this is a convenient way to achieve partition pruning in the immediate term, Delta checkpoints can provide strongly-typed stats_parsed and partitionValues_parsed columns which would have a completely different access.

For stats vs. stats_parsed, the likely solution is simple enough because we already json-parse stats into a strongly-typed nested struct in order to evaluate the data skipping predicate over its record batch. We just avoid the parsing overhead if stats_parsed is already available.
The partitionValues field poses a bigger challenge, because it's a string-string map, not a JSON literal. In order to turn it into a strongly-typed nested struct, we would need a SQL expression that can extract the string values and try-cast them to the desired types. That's ugly enough we might prefer to keep completely different code paths for parsed vs. string partition values, but then there's a risk that partition pruning behavior changes depending on which path got invoked.

How was this change tested?

New unit tests, and adjusted one unit test that assumed no partition pruning.

codecov · 2025-02-21T22:55:40Z

Codecov Report

Attention: Patch coverage is 86.36364% with 9 lines in your changes missing coverage. Please review.

Project coverage is 84.29%. Comparing base (cd5ea20) to head (30baa42).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/scan/log_replay.rs	86.15%	6 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #713      +/-   ##
==========================================
+ Coverage   84.25%   84.29%   +0.04%     
==========================================
  Files          77       77              
  Lines       19051    19099      +48     
  Branches    19051    19099      +48     
==========================================
+ Hits        16051    16100      +49     
+ Misses       2202     2201       -1     
  Partials      798      798

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nicklan

Sorry for the loooong delay on reviewing this!

Looks great. Nice how simple it was actually to fit it in once you found the right shape.

zachschuermann

Looks great thanks!! just a couple quick things.

And for the TODOs in the description: it sounds like the first is sufficiently covered by #712. should we make an issue or two for the stats_parsed and partitionValues_parsed additions?

kernel/src/predicates/mod.rs

kernel/src/scan/mod.rs

kernel/src/scan/log_replay.rs

scovich · 2025-03-12T15:56:24Z

After offline discussion, we will deal with the schema change issue as a follow-up because it already existed with stats parsing and nobody seems to have hit the corner case yet.

feat: Add basic partition pruning support

c069278

scovich added the merge hold Don't allow the PR to merge label Feb 21, 2025

scovich requested review from hntd187 and nicklan February 21, 2025 22:53

github-actions bot assigned scovich Feb 21, 2025

github-actions bot added the breaking-change Change that require a major version bump label Feb 21, 2025

roeap mentioned this pull request Mar 4, 2025

[tracking] Kernelize! delta-io/delta-rs#3298

Open

nicklan approved these changes Mar 7, 2025

View reviewed changes

zachschuermann approved these changes Mar 10, 2025

View reviewed changes

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

kernel/src/scan/mod.rs Show resolved Hide resolved

kernel/src/scan/log_replay.rs Show resolved Hide resolved

scovich added 2 commits March 12, 2025 07:06

nit

d7eadcd

Merge remote-tracking branch 'oss/main' into HEAD

30baa42

scovich removed the merge hold Don't allow the PR to merge label Mar 12, 2025

scovich mentioned this pull request Mar 12, 2025

Figure out how to handle struct stats and partition values from checkpoints #740

Open

scovich merged commit db86d97 into delta-io:main Mar 12, 2025
21 checks passed

phillipleblanc mentioned this pull request Apr 14, 2025

Convert DataFusion filters to Delta Kernel predicates spiceai/spiceai#5362

Merged

scovich mentioned this pull request Jul 8, 2025

Fail to skip files based on partitions alone #263

Closed

OussamaSaoudi mentioned this pull request Jul 16, 2025

partition skipping filter #624

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add basic partition pruning support #713

feat: Add basic partition pruning support #713

Uh oh!

scovich commented Feb 21, 2025 •

edited

Loading

Uh oh!

codecov bot commented Feb 21, 2025 •

edited

Loading

Uh oh!

nicklan left a comment

Uh oh!

zachschuermann left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add basic partition pruning support #713

feat: Add basic partition pruning support #713

Uh oh!

Conversation

scovich commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

zachschuermann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich commented Mar 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scovich commented Feb 21, 2025 •

edited

Loading

codecov bot commented Feb 21, 2025 •

edited

Loading

zachschuermann left a comment •

edited

Loading