Skip to content

[Parquet] Add tests for IO/CPU access in parquet reader #7971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 15, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 21, 2025

Which issue does this PR close?

Rationale for this change

There is quite a bit of code in the current Parquet sync and async readers related to IO patterns that I do not think is not covered by existing tests. As I refactor the guts of the readers into the PushDecoder, I would like to ensure we don't introduce regressions in existing functionality.

I would like to add tests that cover the IO patterns of the Parquet Reader so I don't break it

What changes are included in this PR?

Add tests which

  1. Creates a temporary parquet file with a known row group structure
  2. Reads data from that file using the Arrow Parquet Reader, recording the IO operations
  3. Asserts the expected IO patterns based on the read operations in a human understandable behavior

This is done for both the sync and async readers.

I am sorry this is such a massive PR, but it is entirely tests and I think it is quite important. I could break the sync or async tests into their own PR, but this seems uncessary

Are these changes tested?

Yes, indeed the entire PR is only tests

Are there any user-facing changes?

@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2025

Update here is I am quite pleased with how the sync reader looks. Now I am working on sorting out how to test the async reader

@alamb alamb force-pushed the alamb/parquet_io_test branch 3 times, most recently from 1315070 to 8d25562 Compare August 6, 2025 20:12
@alamb alamb changed the title WIP: [Parquet] Add tests for IO/CPU access in parquet reader [Parquet] Add tests for IO/CPU access in parquet reader Aug 6, 2025
@alamb alamb requested a review from Copilot August 6, 2025 20:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive test coverage for IO patterns in both sync and async Parquet readers. The purpose is to ensure that existing IO functionality doesn't regress when refactoring the Parquet reader internals as part of the PushDecoder work.

Key changes:

  • Creates a new IO testing module with infrastructure to track and validate IO operations during Parquet reads
  • Adds extensive test coverage for various reading scenarios including projections, row filters, and row selections

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parquet/tests/arrow_reader/mod.rs Adds new IO test module
parquet/tests/arrow_reader/io/mod.rs Core testing infrastructure for tracking and analyzing IO patterns
parquet/tests/arrow_reader/io/sync_reader.rs Tests for synchronous Parquet reader IO patterns
parquet/tests/arrow_reader/io/async_reader.rs Tests for asynchronous Parquet reader IO patterns
parquet/src/file/reader.rs Minor documentation improvements for ChunkReader trait
Comments suppressed due to low confidence (1)

parquet/tests/arrow_reader/io/mod.rs:359

  • The format string is malformed. There's an extra colon and quote after 'dictionary_page: true,' that will appear literally in the output string. This should be either removed or the format string should be restructured.
                )

@alamb alamb force-pushed the alamb/parquet_io_test branch from 4024030 to 3b5ec20 Compare August 6, 2025 21:02
@alamb alamb force-pushed the alamb/parquet_io_test branch from 3b5ec20 to cb89102 Compare August 7, 2025 11:46
&test_file,
builder,
[
"Get Provided Metadata",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to get tests in this style: human readable descriptions of what IO the decoders are doing. I am quite please with how it came out, though it took a lot of work 😅


// Expect to see only IO for Row Group 1.
// Should see no IO for Row Group 0.
run_test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find itpretty cool to see the IO patterns visible and tested -- like here is the IO pattern showing that projection pushdown actually reduces IO!

//
// Note there is significant IO that happens during the construction of the
// reader (between "Builder Configured" and "Reader Built")
run_test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also pretty cool -- it shows that the IO for evaluating the filter with the sync reader actually happens during the construction


// Expect to see I/O for column b in both row groups to evaluate filter,
// then a single pages for the "a" column in each row group
run_test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can see the async reader does IO during read, not during reader construction, which is different than the sync reader

@alamb alamb marked this pull request as ready for review August 7, 2025 12:10
@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2025

@zhuqi-lucas , @crepererum @XiangpengHao, @tustvold, @Dandandan, @thinkharderdev and @etseidl -- you may be interested in this PR. It does not make any code changes, but adds tests that show the IO patterns of the existing parquet readers

I view this as a first step towards actually changing those IO patterns

This PR is ready to review, but it is failing clippy due to

.with_projection(ProjectionMask::columns(&schema_descr, ["a"]))
.with_limit(125);

run_test(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like a snapshot. These hardcoded strings will be a pain to update. Hence, could we use something like insta -- which is widely adopted by the Rust ecosystem incl. DataFusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is an excellent idea -- I will do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 851495c and e239704

}

/// Create an appropriate LogEntry for the specified range
fn entry_for_range(&self, range: &Range<usize>) -> LogEntry {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the most complicated part of this PR -- basically translating a range into a human readable description of what that range represents

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we basically have a 2nd file parser here 🤔 I'm wondering if instead of creating yet-another-parser -- even though it's partial -- we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests. For example, we could extend AsyncFileReader with an implemented-by-default method:

trait AsyncFileReader {
   // all the current methods stay!

   fn get_bytes_with_trace(&mut self, range: Range<u64>, trace: Trace) -> BoxFuture<'_, parquet::errors::Result<Bytes>> {
        // ignore trace by default
        self.get_bytes(range)
   }

   // same for the other two methods...
}

// bikeshed whatever `Trace` is, maybe use :http::Extensions?

Copy link
Contributor Author

@alamb alamb Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests.

This is a neat idea.

I definitely don't want to make AsyncFileReader and more complicated than it currently is 🤮

However, this could be fairly easily added to the Push decoder API here:

Maybe instead of

/// This is used to communicate between the decoder and the caller
/// to indicate what data is needed next, or what the result of decoding is.
#[derive(Debug)]
pub enum DecodeResult<T: Debug> {
    /// The ranges of data necessary to proceed
    // TODO: distinguish between minimim needed to make progress and what could be used?
    NeedsData(Vec<Range<u64>>),
    /// The decoder produced an output item
    Data(T),
    /// The decoder finished processing
    Finished,
}

We could add something like

pub enum DataRequest {
  /// last 8 bytes
  Footer,
  /// Metadata at the eend
  Metadata,
  PageIndex,
  DataPage { row_group_index, page_index},
  Unknown,
...
} 


#[derive(Debug)]
pub enum DecodeResult<T: Debug> {
    /// The ranges of data necessary to proceed
    // TODO: distinguish between minimim needed to make progress and what could be used?
    NeedsData {
      trace: DataRequest
       ranges: Vec<Range>,
    },
    /// The decoder produced an output item
    Data(T),
    /// The decoder finished processing
    Finished,
}

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better. You're right that instead of using the async interface -- which is just one way to drive the push decoder --, we should use the push decoder interface directly to convey the trace/intend 👍

Copy link
Contributor Author

@alamb alamb Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this idea into its own ticket so it doesn't get lost when we merge this PR

Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb , the tests look good to me, I like that we have human readable trace of what's going on with the reader pattern!

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2025

Thank you for the review @XiangpengHao -- I plan to merge this PR in once it passes as I want to use it as a way to test out using the push decoder as the guts of the async reader

@alamb
Copy link
Contributor Author

alamb commented Aug 15, 2025

🚀 hooray for testing

@alamb alamb merged commit d7d847a into apache:main Aug 15, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants