[Parquet] Add tests for IO/CPU access in parquet reader #7971

alamb · 2025-07-21T15:46:33Z

Which issue does this PR close?

Part of [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000
Related to Speed up Parquet filter pushdown v4 (Predicate evaluation cache for async_reader) #7850

Rationale for this change

There is quite a bit of code in the current Parquet sync and async readers related to IO patterns that I do not think is not covered by existing tests. As I refactor the guts of the readers into the PushDecoder, I would like to ensure we don't introduce regressions in existing functionality.

I would like to add tests that cover the IO patterns of the Parquet Reader so I don't break it

What changes are included in this PR?

Add tests which

Creates a temporary parquet file with a known row group structure
Reads data from that file using the Arrow Parquet Reader, recording the IO operations
Asserts the expected IO patterns based on the read operations in a human understandable behavior

This is done for both the sync and async readers.

I am sorry this is such a massive PR, but it is entirely tests and I think it is quite important. I could break the sync or async tests into their own PR, but this seems uncessary

Are these changes tested?

Yes, indeed the entire PR is only tests

Are there any user-facing changes?

alamb · 2025-07-22T21:04:30Z

Update here is I am quite pleased with how the sync reader looks. Now I am working on sorting out how to test the async reader

Copilot

Pull Request Overview

This PR adds comprehensive test coverage for IO patterns in both sync and async Parquet readers. The purpose is to ensure that existing IO functionality doesn't regress when refactoring the Parquet reader internals as part of the PushDecoder work.

Key changes:

Creates a new IO testing module with infrastructure to track and validate IO operations during Parquet reads
Adds extensive test coverage for various reading scenarios including projections, row filters, and row selections

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
parquet/tests/arrow_reader/mod.rs	Adds new IO test module
parquet/tests/arrow_reader/io/mod.rs	Core testing infrastructure for tracking and analyzing IO patterns
parquet/tests/arrow_reader/io/sync_reader.rs	Tests for synchronous Parquet reader IO patterns
parquet/tests/arrow_reader/io/async_reader.rs	Tests for asynchronous Parquet reader IO patterns
parquet/src/file/reader.rs	Minor documentation improvements for ChunkReader trait

Comments suppressed due to low confidence (1)

parquet/tests/arrow_reader/io/mod.rs:359

The format string is malformed. There's an extra colon and quote after 'dictionary_page: true,' that will appear literally in the output string. This should be either removed or the format string should be restructured.

parquet/tests/arrow_reader/io/mod.rs

parquet/tests/arrow_reader/io/async_reader.rs

alamb · 2025-08-06T20:15:27Z

parquet/tests/arrow_reader/io/async_reader.rs

+        &test_file,
+        builder,
+        [
+            "Get Provided Metadata",


The whole point of this PR is to get tests in this style: human readable descriptions of what IO the decoders are doing. I am quite please with how it came out, though it took a lot of work 😅

alamb · 2025-08-06T20:57:11Z

parquet/tests/arrow_reader/io/async_reader.rs

+
+    // Expect to see only IO for Row Group 1.
+    // Should see no IO for Row Group 0.
+    run_test(


I find itpretty cool to see the IO patterns visible and tested -- like here is the IO pattern showing that projection pushdown actually reduces IO!

alamb · 2025-08-07T11:47:15Z

parquet/tests/arrow_reader/io/sync_reader.rs

+    //
+    // Note there is significant IO that happens during the construction of the
+    // reader (between "Builder Configured" and "Reader Built")
+    run_test(


This is also pretty cool -- it shows that the IO for evaluating the filter with the sync reader actually happens during the construction

alamb · 2025-08-07T11:47:49Z

parquet/tests/arrow_reader/io/async_reader.rs

+
+    // Expect to see I/O for column b in both row groups to evaluate filter,
+    // then a single pages for the "a" column in each row group
+    run_test(


Here you can see the async reader does IO during read, not during reader construction, which is different than the sync reader

alamb · 2025-08-07T12:49:48Z

@zhuqi-lucas , @crepererum @XiangpengHao, @tustvold, @Dandandan, @thinkharderdev and @etseidl -- you may be interested in this PR. It does not make any code changes, but adds tests that show the IO patterns of the existing parquet readers

I view this as a first step towards actually changing those IO patterns

This PR is ready to review, but it is failing clippy due to

Fix new clippy lints from Rust 1.89 #8078

crepererum · 2025-08-07T14:52:14Z

parquet/tests/arrow_reader/io/async_reader.rs

+        .with_projection(ProjectionMask::columns(&schema_descr, ["a"]))
+        .with_limit(125);
+
+    run_test(


This looks a lot like a snapshot. These hardcoded strings will be a pain to update. Hence, could we use something like insta -- which is widely adopted by the Rust ecosystem incl. DataFusion.

that is an excellent idea -- I will do so

In 851495c and e239704

alamb · 2025-08-07T16:10:57Z

parquet/tests/arrow_reader/io/mod.rs

+    }
+
+    /// Create an appropriate LogEntry for the specified range
+    fn entry_for_range(&self, range: &Range<usize>) -> LogEntry {


this is the most complicated part of this PR -- basically translating a range into a human readable description of what that range represents

So we basically have a 2nd file parser here 🤔 I'm wondering if instead of creating yet-another-parser -- even though it's partial -- we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests. For example, we could extend AsyncFileReader with an implemented-by-default method:

trait AsyncFileReader { // all the current methods stay! fn get_bytes_with_trace(&mut self, range: Range<u64>, trace: Trace) -> BoxFuture<'_, parquet::errors::Result<Bytes>> { // ignore trace by default self.get_bytes(range) } // same for the other two methods... } // bikeshed whatever `Trace` is, maybe use :http::Extensions?

we could ask the decoder to provide us a "reason" or a "trace" on the individual read requests.

This is a neat idea.

I definitely don't want to make AsyncFileReader and more complicated than it currently is 🤮

However, this could be fairly easily added to the Push decoder API here:

[Parquet] Add ParquetMetadataPushDecoder #8080

Maybe instead of

/// This is used to communicate between the decoder and the caller /// to indicate what data is needed next, or what the result of decoding is. #[derive(Debug)] pub enum DecodeResult<T: Debug> { /// The ranges of data necessary to proceed // TODO: distinguish between minimim needed to make progress and what could be used? NeedsData(Vec<Range<u64>>), /// The decoder produced an output item Data(T), /// The decoder finished processing Finished, }

We could add something like

pub enum DataRequest { /// last 8 bytes Footer, /// Metadata at the eend Metadata, PageIndex, DataPage { row_group_index, page_index}, Unknown, ... } #[derive(Debug)] pub enum DecodeResult<T: Debug> { /// The ranges of data necessary to proceed // TODO: distinguish between minimim needed to make progress and what could be used? NeedsData { trace: DataRequest ranges: Vec<Range>, }, /// The decoder produced an output item Data(T), /// The decoder finished processing Finished, }

🤔

Yes, that's better. You're right that instead of using the async interface -- which is just one way to drive the push decoder --, we should use the push decoder interface directly to convey the trace/intend 👍

I copied this idea into its own ticket so it doesn't get lost when we merge this PR

Add an API to provide the reason / what is being requested in IO #8157

XiangpengHao

Thank you @alamb , the tests look good to me, I like that we have human readable trace of what's going on with the reader pattern!

alamb · 2025-08-15T16:31:57Z

Thank you for the review @XiangpengHao -- I plan to merge this PR in once it passes as I want to use it as a way to test out using the push decoder as the guts of the async reader

alamb · 2025-08-15T16:46:23Z

🚀 hooray for testing

github-actions bot added the parquet Changes to the parquet crate label Jul 21, 2025

This was referenced Jul 21, 2025

Speed up Parquet filter pushdown v4 (Predicate evaluation cache for async_reader) #7850

Merged

[Epic] Accurate performance tracking XiangpengHao/liquid-cache#302

Open

alamb force-pushed the alamb/parquet_io_test branch 2 times, most recently from ba073f0 to 741c0d2 Compare July 22, 2025 20:54

alamb mentioned this pull request Jul 23, 2025

Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983

Open

This was referenced Jul 31, 2025

Implement Push Parquet Decoder #7997

Draft

[Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000

Open

alamb force-pushed the alamb/parquet_io_test branch 3 times, most recently from 1315070 to 8d25562 Compare August 6, 2025 20:12

alamb changed the title ~~WIP: [Parquet] Add tests for IO/CPU access in parquet reader~~ [Parquet] Add tests for IO/CPU access in parquet reader Aug 6, 2025

alamb requested a review from Copilot August 6, 2025 20:38

Copilot AI reviewed Aug 6, 2025

View reviewed changes

parquet/tests/arrow_reader/io/mod.rs Outdated Show resolved Hide resolved

parquet/tests/arrow_reader/io/async_reader.rs Outdated Show resolved Hide resolved

alamb force-pushed the alamb/parquet_io_test branch from 4024030 to 3b5ec20 Compare August 6, 2025 21:02

Add Parquet IO test

cb89102

alamb force-pushed the alamb/parquet_io_test branch from 3b5ec20 to cb89102 Compare August 7, 2025 11:46

alamb commented Aug 7, 2025

View reviewed changes

alamb marked this pull request as ready for review August 7, 2025 12:10

crepererum reviewed Aug 7, 2025

View reviewed changes

alamb added 3 commits August 7, 2025 11:38

Merge remote-tracking branch 'apache/main' into alamb/parquet_io_test

8bf2c92

use insta for sync reader

851495c

Update to use insta

e239704

alamb commented Aug 7, 2025

View reviewed changes

fix windows tests

f73a7c1

alamb self-assigned this Aug 7, 2025

alamb mentioned this pull request Aug 8, 2025

DO NOT MERGE -- test pushdown v4 with io tests #8096

Closed

XiangpengHao approved these changes Aug 14, 2025

View reviewed changes

alamb mentioned this pull request Aug 15, 2025

Add an API to provide the reason / what is being requested in IO #8157

Open

Merge remote-tracking branch 'apache/main' into alamb/parquet_io_test

05e4adb

alamb merged commit d7d847a into apache:main Aug 15, 2025
17 checks passed

[Parquet] Add tests for IO/CPU access in parquet reader #7971

[Parquet] Add tests for IO/CPU access in parquet reader #7971

Uh oh!

Conversation

alamb commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiangpengHao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 15, 2025

Uh oh!

alamb commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

alamb commented Jul 21, 2025 •

edited

Loading

alamb Aug 8, 2025 •

edited

Loading

alamb Aug 15, 2025 •

edited

Loading