You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering
### What changes were proposed in this pull request?
The `FileFormatWriter` materializes an `AdaptiveQueryPlan` before accessing the plan's `outputOrdering`. This is required for Spark 3.0 to 3.3. Spark 3.4 does not need this because `FileFormatWriter` gets the final plan.
### Why are the changes needed?
`FileFormatWriter` enforces an ordering if the written plan does not provide that ordering. An `AdaptiveQueryPlan` does not know its final ordering (Spark 3.0 to 3.3), in which case `FileFormatWriter` enforces the ordering (e.g. by column `"a"`) even if the plan provides a compatible ordering (e.g. by columns `"a", "b"`). In case of spilling, that order (e.g. by columns `"a", "b"`) gets broken (see SPARK-40588).
### Does this PR introduce _any_ user-facing change?
This fixes SPARK-40588, which was introduced in 3.0. This restores behaviour from Spark 2.4.
### How was this patch tested?
The final plan that is written to files cannot be extracted from `FileFormatWriter`. The bug explained in [SPARK-40588](https://issues.apache.org/jira/browse/SPARK-40588) can only be asserted on the result files when spilling occurs. This is very hard to control in an unit test scenario.
Therefore, this was tested manually. The [example to reproduce this issue](https://issues.apache.org/jira/browse/SPARK-40588?focusedCommentId=17621032&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17621032) given in SPARK-40588 now produces sorted files.
The actual plan written into the files changed from
```
Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0
+- AdaptiveSparkPlan isFinalPlan=false
+- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(day#2L, 2), REPARTITION_BY_NUM, [id=#30]
+- BroadcastNestedLoopJoin BuildLeft, Inner
:- BroadcastExchange IdentityBroadcastMode, [id=#28]
: +- Project [id#0L AS day#2L]
: +- Range (0, 2, step=1, splits=2)
+- Range (0, 10000000, step=1, splits=2)
```
where `FileFormatWriter` enforces order with `Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0`, to
```
*(3) Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0
+- AQEShuffleRead coalesced
+- ShuffleQueryStage 1
+- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, [id=#68]
+- *(2) BroadcastNestedLoopJoin BuildLeft, Inner
:- BroadcastQueryStage 0
: +- BroadcastExchange IdentityBroadcastMode, [id=#42]
: +- *(1) Project [id#0L AS day#2L]
: +- *(1) Range (0, 2, step=1, splits=2)
+- *(2) Range (0, 1000000, step=1, splits=2)
```
where the sort given by the user is the outermost sort now.
Closes#38358 from EnricoMi/branch-3.3-materialize-aqe-plan.
Authored-by: Enrico Minack <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments