Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 11, 2025

Which issue does this PR close?

Rationale for this change

As suggested by @nuno-faria here: #17022 (comment)

The number of options and flags that are being passed around to the various metadata handling
function in the parquet code is getting somewhat out of hand

For example in #17022 from @shehabgamin a significant portion
of the PR is adding new options to existing functions to thread through the new options
and the tests. If we had this code organized better it would be easier to maintain and extend.

Also, as we use the caching more it is important to ensure it is used in all the right places.

What changes are included in this PR?

Proposal:

  1. Extract the options into a struct DFParquetMetadata
  2. Deprecate the old functions
  3. Update the functions / tests to create the struct

Are these changes tested?

yes, it is all covered by existing unit tests (changed in this PR)

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Aug 11, 2025
@alamb alamb changed the title Consolidate Parquet Metadata handling Consolidate Parquet Metadata handling into its own module and struct DFParquetMetadata Aug 11, 2025
@alamb alamb force-pushed the alamb/extract_parquet_metadata_handling branch from 8c2a99f to d993b04 Compare August 11, 2025 18:49
Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()),
)
.await?;
let file_metadata_cache =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows the key API difference -- instead of calling a bunch of free functions, you now construct a DFParquetMetadata and call methods on that struct instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks way cleaner now.

// Increases by 3 because cache has no entries yet
fetch_parquet_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new struct makes it much clearer what is being tested vs what is test setup functionality and I find the updated tests to be much easier to read

@@ -306,30 +301,6 @@ fn clear_metadata(
})
}

async fn fetch_schema_with_location(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this PR is moving code in this module into metadata.rs

@@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> {
/// through [`ParquetFileReaderFactory`].
///
/// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory
pub async fn fetch_parquet_metadata<F: MetadataFetch>(
fetch: F,
#[deprecated(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left all the existing public APIs and deprecated them, and updated them to call the new DFParquetMetadata structure

@@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized(
Ok(file_metadata)
}

/// Min/max aggregation can take Dictionary encode input but always produces unpacked
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite please that most of the statistics handling is now consolidated into its own module

file_meta.object_meta.location,
))
})
// TODO should there be metadata prefetch hint here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata prefetch hint isn't passed here (it isn't on main either) but this refactor leads me to believe it might be helpful to do so 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user's perspective, I think it makes sense that the metadata prefetch option should apply everywhere metadata is fetched. It can be quite confusing when you change an option and either see no change at all (positive, negative, system resource usage etc.), or perhaps even worse, inconsistent change based on a specific workflow (e.g. "Why do queries for table X use twice the network hops, but table Y uses 50% more bandwidth?")

Copy link
Contributor Author

@alamb alamb Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in theory it should be controlled by a config option: https://datafusion.apache.org/user-guide/configs.html

datafusion.execution.parquet.metadata_size_hint NULL

I haven't traced down why that one is not used here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I only set the hint in inner = inner.with_footer_size_hint(hint), and then in get_metadata we would read it like so: reader.try_load(&mut self.inner, object_meta.size).await?;. Yes its better if we pass it to DFParquetMetadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a change / regression in this PR, so can we open an issue to follow up and let it go here?

I do agree it should be passed down from the config at least in ListingTable and other "default" uses that have access to the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb force-pushed the alamb/extract_parquet_metadata_handling branch from d993b04 to ef90d05 Compare August 11, 2025 19:03
@github-actions github-actions bot added the common Related to common crate label Aug 11, 2025
@alamb alamb marked this pull request as ready for review August 11, 2025 19:07
@nuno-faria
Copy link
Contributor

LGTM, its a much cleaner API.

@alamb
Copy link
Contributor Author

alamb commented Aug 20, 2025

I merged up to fix some conflicts, largely caused by

@alamb
Copy link
Contributor Author

alamb commented Aug 20, 2025

@adriangb I wonder if you might be able to review this PR? @jonathanc-n and @nuno-faria have already approved it but neither of them are committers so I can't merge this PR yet unfortunately

@adriangb
Copy link
Contributor

Yes will put it on my queue

@adriangb adriangb self-requested a review August 20, 2025 20:26
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Some nits and a request for a followup issue.


let fetch = ObjectStoreFetch::new(*store, object_meta);

// implementation to fetch parquet metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not reviewing the implementation, I assume it was largely copied over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was entirely copied over

}

if cache_metadata && file_metadata_cache.is_some() {
// Need to retrieve the entire metadata for the caching to be effective.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that even if I have page indexes disabled if I use a metadata cache it will still retrieve (and decode?) them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is exactly what it it means, which is confusing.

Comment on lines +173 to +175
/// Read and parse the schema of the Parquet file
pub async fn fetch_schema(&self) -> Result<Schema> {
let metadata = self.fetch_metadata().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to question above: how much work is fetching the schema? Does it also fetch row group stats? Page indexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I am not 100% sure, but I am to find out with @BlakeOrth

Ok((loc_path, schema))
}

pub async fn fetch_statistics(&self, table_schema: &SchemaRef) -> Result<Statistics> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docstring, details about what this operation involves

file_meta.object_meta.location,
))
})
// TODO should there be metadata prefetch hint here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a change / regression in this PR, so can we open an issue to follow up and let it go here?

I do agree it should be passed down from the config at least in ListingTable and other "default" uses that have access to the config.

@alamb alamb merged commit f363e38 into apache:main Aug 22, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants