-
Notifications
You must be signed in to change notification settings - Fork 990
[Parquet] Add tests for IO/CPU access in parquet reader #7971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ba073f0
to
741c0d2
Compare
Update here is I am quite pleased with how the sync reader looks. Now I am working on sorting out how to test the async reader |
1315070
to
8d25562
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive test coverage for IO patterns in both sync and async Parquet readers. The purpose is to ensure that existing IO functionality doesn't regress when refactoring the Parquet reader internals as part of the PushDecoder work.
Key changes:
- Creates a new IO testing module with infrastructure to track and validate IO operations during Parquet reads
- Adds extensive test coverage for various reading scenarios including projections, row filters, and row selections
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
parquet/tests/arrow_reader/mod.rs | Adds new IO test module |
parquet/tests/arrow_reader/io/mod.rs | Core testing infrastructure for tracking and analyzing IO patterns |
parquet/tests/arrow_reader/io/sync_reader.rs | Tests for synchronous Parquet reader IO patterns |
parquet/tests/arrow_reader/io/async_reader.rs | Tests for asynchronous Parquet reader IO patterns |
parquet/src/file/reader.rs | Minor documentation improvements for ChunkReader trait |
Comments suppressed due to low confidence (1)
parquet/tests/arrow_reader/io/mod.rs:359
- The format string is malformed. There's an extra colon and quote after 'dictionary_page: true,' that will appear literally in the output string. This should be either removed or the format string should be restructured.
)
4024030
to
3b5ec20
Compare
3b5ec20
to
cb89102
Compare
&test_file, | ||
builder, | ||
[ | ||
"Get Provided Metadata", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this PR is to get tests in this style: human readable descriptions of what IO the decoders are doing. I am quite please with how it came out, though it took a lot of work 😅
|
||
// Expect to see only IO for Row Group 1. | ||
// Should see no IO for Row Group 0. | ||
run_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find itpretty cool to see the IO patterns visible and tested -- like here is the IO pattern showing that projection pushdown actually reduces IO!
// | ||
// Note there is significant IO that happens during the construction of the | ||
// reader (between "Builder Configured" and "Reader Built") | ||
run_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also pretty cool -- it shows that the IO for evaluating the filter with the sync reader actually happens during the construction
|
||
// Expect to see I/O for column b in both row groups to evaluate filter, | ||
// then a single pages for the "a" column in each row group | ||
run_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can see the async reader does IO during read, not during reader construction, which is different than the sync reader
@zhuqi-lucas , @crepererum @XiangpengHao, @tustvold, @Dandandan, @thinkharderdev and @etseidl -- you may be interested in this PR. It does not make any code changes, but adds tests that show the IO patterns of the existing parquet readers I view this as a first step towards actually changing those IO patterns This PR is ready to review, but it is failing clippy due to |
.with_projection(ProjectionMask::columns(&schema_descr, ["a"])) | ||
.with_limit(125); | ||
|
||
run_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot like a snapshot. These hardcoded strings will be a pain to update. Hence, could we use something like insta
-- which is widely adopted by the Rust ecosystem incl. DataFusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is an excellent idea -- I will do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/// Create an appropriate LogEntry for the specified range | ||
fn entry_for_range(&self, range: &Range<usize>) -> LogEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the most complicated part of this PR -- basically translating a range into a human readable description of what that range represents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we basically have a 2nd file parser here 🤔 I'm wondering if instead of creating yet-another-parser -- even though it's partial -- we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests. For example, we could extend AsyncFileReader
with an implemented-by-default method:
trait AsyncFileReader {
// all the current methods stay!
fn get_bytes_with_trace(&mut self, range: Range<u64>, trace: Trace) -> BoxFuture<'_, parquet::errors::Result<Bytes>> {
// ignore trace by default
self.get_bytes(range)
}
// same for the other two methods...
}
// bikeshed whatever `Trace` is, maybe use :http::Extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests.
This is a neat idea.
I definitely don't want to make AsyncFileReader
and more complicated than it currently is 🤮
However, this could be fairly easily added to the Push decoder API here:
Maybe instead of
/// This is used to communicate between the decoder and the caller
/// to indicate what data is needed next, or what the result of decoding is.
#[derive(Debug)]
pub enum DecodeResult<T: Debug> {
/// The ranges of data necessary to proceed
// TODO: distinguish between minimim needed to make progress and what could be used?
NeedsData(Vec<Range<u64>>),
/// The decoder produced an output item
Data(T),
/// The decoder finished processing
Finished,
}
We could add something like
pub enum DataRequest {
/// last 8 bytes
Footer,
/// Metadata at the eend
Metadata,
PageIndex,
DataPage { row_group_index, page_index},
Unknown,
...
}
#[derive(Debug)]
pub enum DecodeResult<T: Debug> {
/// The ranges of data necessary to proceed
// TODO: distinguish between minimim needed to make progress and what could be used?
NeedsData {
trace: DataRequest
ranges: Vec<Range>,
},
/// The decoder produced an output item
Data(T),
/// The decoder finished processing
Finished,
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's better. You're right that instead of using the async interface -- which is just one way to drive the push decoder --, we should use the push decoder interface directly to convey the trace/intend 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this idea into its own ticket so it doesn't get lost when we merge this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , the tests look good to me, I like that we have human readable trace of what's going on with the reader pattern!
Thank you for the review @XiangpengHao -- I plan to merge this PR in once it passes as I want to use it as a way to test out using the push decoder as the guts of the async reader |
🚀 hooray for testing |
Which issue does this PR close?
Part of [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000
Related to Speed up Parquet filter pushdown v4 (Predicate evaluation cache for async_reader) #7850
Rationale for this change
There is quite a bit of code in the current Parquet sync and async readers related to IO patterns that I do not think is not covered by existing tests. As I refactor the guts of the readers into the PushDecoder, I would like to ensure we don't introduce regressions in existing functionality.
I would like to add tests that cover the IO patterns of the Parquet Reader so I don't break it
What changes are included in this PR?
Add tests which
This is done for both the sync and async readers.
I am sorry this is such a massive PR, but it is entirely tests and I think it is quite important. I could break the sync or async tests into their own PR, but this seems uncessary
Are these changes tested?
Yes, indeed the entire PR is only tests
Are there any user-facing changes?