Skip to content

[Iceberg]Support bucket transform for TimeType#24829

Merged
hantangwangd merged 1 commit into
prestodb:masterfrom
hantangwangd:support_bucket_transform_for_time
Apr 2, 2025
Merged

[Iceberg]Support bucket transform for TimeType#24829
hantangwangd merged 1 commit into
prestodb:masterfrom
hantangwangd:support_bucket_transform_for_time

Conversation

@hantangwangd

@hantangwangd hantangwangd commented Mar 30, 2025

Copy link
Copy Markdown
Member

Description

In the Iceberg specification, columns of type TimeType support being specified bucket transforms. This PR support bucket transform for columns of type TimeType in presto.

Motivation and Context

Support as many types as possible for partition transforms according to Iceberg specification.

Impact

We can now use bucket transform on TimeType columns in Iceberg table.

Test Plan

  • Newly added test case for testing the bucket transform set and used on TimeType columns of the Iceberg table.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Support bucket transform for columns of type `TimeType` in Iceberg table 

@hantangwangd hantangwangd force-pushed the support_bucket_transform_for_time branch from a5bfea8 to a842dc2 Compare March 31, 2025 00:59
@hantangwangd hantangwangd force-pushed the support_bucket_transform_for_time branch from a842dc2 to 3565011 Compare April 1, 2025 14:07
@hantangwangd hantangwangd marked this pull request as ready for review April 1, 2025 15:51
@hantangwangd hantangwangd requested a review from jaystarshot April 1, 2025 15:51

@steveburnett steveburnett left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build. Thanks!

@steveburnett

Copy link
Copy Markdown
Contributor

Thanks for the release note entry! Minor nits:

== RELEASE NOTES ==

Iceberg Connector Changes
* Add support for bucket transform for columns of type ``TimeType`` in Iceberg tables.

@ZacBlanco ZacBlanco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question

Comment thread presto-docs/src/main/sphinx/connector/iceberg.rst

private static Block bucketTime(Block block, int count)
{
return bucketBlock(block, count, position -> bucketHash(MILLISECONDS.toMicros(TIME.getLong(block, position))));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding, we need to do this MILLISECONDS.toMicros here because the bucketing needs to be compatible with the bucketing of other systems? Is ther anything else stopping us from just doing bucketHash directly on the TIME.getLong(...)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be compatible with the bucketing of other systems is one of the reason, but the more important reason is to be compatible with Iceberg's file planning with filter. When we query the partitioned table with a filter on column of type TimeType as follows:

select * from test_bucket_transform_on_time where a = time '01:02:03.123';

We will actually first use the predicate a = time '01:02:03.123' to execute Iceberg's file planning. Iceberg will use this time value internally to calculate partition values, and then scan some specific partitions, if the calculated partition value is not compatible, we won't find the corresponding data. So we have to make sure that the calculation of partition value is exactly the same as Iceberg lib, that is, use the micro seconds value to calculate the partition value.

So if we doing bucketHash directly on the TIME.getLong(...), we will find that the example query above get an empty result.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was roughly my thinking. Thanks for confirming

@hantangwangd hantangwangd merged commit 43a1662 into prestodb:master Apr 2, 2025
@hantangwangd hantangwangd deleted the support_bucket_transform_for_time branch April 2, 2025 06:00
@ZacBlanco ZacBlanco mentioned this pull request May 29, 2025
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants