-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata
#17127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata
#17127
Conversation
DFParquetMetadata
8c2a99f
to
d993b04
Compare
Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()), | ||
) | ||
.await?; | ||
let file_metadata_cache = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shows the key API difference -- instead of calling a bunch of free functions, you now construct a DFParquetMetadata
and call methods on that struct instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks way cleaner now.
// Increases by 3 because cache has no entries yet | ||
fetch_parquet_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new struct makes it much clearer what is being tested vs what is test setup functionality and I find the updated tests to be much easier to read
@@ -306,30 +301,6 @@ fn clear_metadata( | |||
}) | |||
} | |||
|
|||
async fn fetch_schema_with_location( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this PR is moving code in this module into metadata.rs
@@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> { | |||
/// through [`ParquetFileReaderFactory`]. | |||
/// | |||
/// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory | |||
pub async fn fetch_parquet_metadata<F: MetadataFetch>( | |||
fetch: F, | |||
#[deprecated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left all the existing public APIs and deprecated them, and updated them to call the new DFParquetMetadata structure
@@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized( | |||
Ok(file_metadata) | |||
} | |||
|
|||
/// Min/max aggregation can take Dictionary encode input but always produces unpacked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite please that most of the statistics handling is now consolidated into its own module
file_meta.object_meta.location, | ||
)) | ||
}) | ||
// TODO should there be metadata prefetch hint here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata prefetch hint isn't passed here (it isn't on main
either) but this refactor leads me to believe it might be helpful to do so 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user's perspective, I think it makes sense that the metadata prefetch option should apply everywhere metadata is fetched. It can be quite confusing when you change an option and either see no change at all (positive, negative, system resource usage etc.), or perhaps even worse, inconsistent change based on a specific workflow (e.g. "Why do queries for table X use twice the network hops, but table Y uses 50% more bandwidth?")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in theory it should be controlled by a config option: https://datafusion.apache.org/user-guide/configs.html
datafusion.execution.parquet.metadata_size_hint | NULL |
---|
I haven't traced down why that one is not used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I only set the hint in inner = inner.with_footer_size_hint(hint)
, and then in get_metadata
we would read it like so: reader.try_load(&mut self.inner, object_meta.size).await?;
. Yes its better if we pass it to DFParquetMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a change / regression in this PR, so can we open an issue to follow up and let it go here?
I do agree it should be passed down from the config at least in ListingTable and other "default" uses that have access to the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d993b04
to
ef90d05
Compare
LGTM, its a much cleaner API. |
…_metadata_handling
I merged up to fix some conflicts, largely caused by |
@adriangb I wonder if you might be able to review this PR? @jonathanc-n and @nuno-faria have already approved it but neither of them are committers so I can't merge this PR yet unfortunately |
Yes will put it on my queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Some nits and a request for a followup issue.
|
||
let fetch = ObjectStoreFetch::new(*store, object_meta); | ||
|
||
// implementation to fetch parquet metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not reviewing the implementation, I assume it was largely copied over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was entirely copied over
} | ||
|
||
if cache_metadata && file_metadata_cache.is_some() { | ||
// Need to retrieve the entire metadata for the caching to be effective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that even if I have page indexes disabled if I use a metadata cache it will still retrieve (and decode?) them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is exactly what it it means, which is confusing.
/// Read and parse the schema of the Parquet file | ||
pub async fn fetch_schema(&self) -> Result<Schema> { | ||
let metadata = self.fetch_metadata().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to question above: how much work is fetching the schema? Does it also fetch row group stats? Page indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I am not 100% sure, but I am to find out with @BlakeOrth
Ok((loc_path, schema)) | ||
} | ||
|
||
pub async fn fetch_statistics(&self, table_schema: &SchemaRef) -> Result<Statistics> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a docstring, details about what this operation involves
file_meta.object_meta.location, | ||
)) | ||
}) | ||
// TODO should there be metadata prefetch hint here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a change / regression in this PR, so can we open an issue to follow up and let it go here?
I do agree it should be passed down from the config at least in ListingTable and other "default" uses that have access to the config.
…_metadata_handling
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
…alamb/datafusion into alamb/extract_parquet_metadata_handling
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
…alamb/datafusion into alamb/extract_parquet_metadata_handling
Which issue does this PR close?
Rationale for this change
As suggested by @nuno-faria here: #17022 (comment)
The number of options and flags that are being passed around to the various metadata handling
function in the parquet code is getting somewhat out of hand
For example in #17022 from @shehabgamin a significant portion
of the PR is adding new options to existing functions to thread through the new options
and the tests. If we had this code organized better it would be easier to maintain and extend.
Also, as we use the caching more it is important to ensure it is used in all the right places.
What changes are included in this PR?
Proposal:
DFParquetMetadata
Are these changes tested?
yes, it is all covered by existing unit tests (changed in this PR)
Are there any user-facing changes?