Skip to content

Conversation

CookiePieWw
Copy link
Contributor

@CookiePieWw CookiePieWw commented Jun 14, 2025

Which issue does this PR close?

Rationale for this change

Currently, datafusion will treat all max and min values in column stats as exact, while some of them may be inexact.

What changes are included in this PR?

For each row group, when max or min value is calculated, retrieve its corresponding exactness flag. The final max or min value's exactness represents the final exactness flag. Wrap the max and min stats with Inexact or Exact based on the final exactness flag

Are these changes tested?

Are there any user-facing changes?

Now datafusion will correctly report the exactness of column max and min values.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 14, 2025
@CookiePieWw CookiePieWw changed the title fix: respect inexact flags in row group metadata [WIP] fix: respect inexact flags in row group metadata Jun 14, 2025
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Aug 14, 2025
@CookiePieWw CookiePieWw force-pushed the respect-row-group-exactness-flags branch 2 times, most recently from 27eeff3 to 27b4595 Compare August 14, 2025 15:16
@github-actions github-actions bot added the functions Changes to functions implementation label Aug 14, 2025
@CookiePieWw CookiePieWw changed the title [WIP] fix: respect inexact flags in row group metadata fix: respect inexact flags in row group metadata Aug 14, 2025
@CookiePieWw CookiePieWw force-pushed the respect-row-group-exactness-flags branch 2 times, most recently from 1eaac41 to c43f1de Compare August 15, 2025 07:33
@CookiePieWw
Copy link
Contributor Author

Hi @alamb, this pr tried to extract the exactness flags in row group metadata, could you please take a look :)

@CookiePieWw CookiePieWw force-pushed the respect-row-group-exactness-flags branch from c43f1de to bf10479 Compare August 15, 2025 08:01
/// The value `0` appears at indices `[0, 2, 4]`. The corresponding exactness
/// values are `[true, false, false]`. Since at least one is `true`, the
/// function returns `Some(true)`.
fn has_any_exact_match(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated a unit test with 4 possible scenarios. Also use a struct to make clippy happy, PTAL :)

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a good finding and nice fix!

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Aug 16, 2025
@alamb alamb merged commit afc90f7 into apache:main Aug 18, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

Thank you @xudong963 and @CookiePieWw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate functions Changes to functions implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Treat truncated parquet stats as inexact
3 participants