[WIP] Fix problem in NestedColumnAliasing.scala , replaceWithAliases in connection with Generate plan node #49061

trohwer · 2024-12-04T11:24:06Z

When one uses replaceWithChildren, one has to be careful with Generate plan nodes. Generate contains a list unrequiredChildIndex of unneeded child outputs in the Generate output. This data has to be adjusted accordingly. Otherwise an incorrect plan may be generated during optimisation. Here is an example (tested with Spark 3.5.3):

from pyspark.sql import SparkSession

session= SparkSession.builder.master("local").getOrCreate()

session.sql("""
select
    named_struct(
          'b', '',
          'c', '',
          'd', array(named_struct('f', '', 'g', '')),
          'e', ''
    ) as a
""").write.mode("overwrite").parquet("tmp")

df= session.read.parquet("tmp")
df.createOrReplaceTempView("tmp")

sql="""
SELECT
a.b f1, a.c f2, x.f,
STACK(1, y) as (z)
FROM tmp
LATERAL VIEW POSEXPLODE_OUTER(a.d) as y, x
"""

session.sql(sql).explain()

#== Physical Plan ==                                                             
#*(1) !Project [_extract_b#21 AS f1#5, _extract_c#19 AS f2#6, _extract_f#20 AS f#12, z#13]
#+- *(1) Generate stack(1, y#8), [_extract_b#21, _extract_f#20], false, [z#13]
#   +- *(1) Project [_extract_b#21, y#8, x#9 AS _extract_f#20]
#      +- *(1) Generate posexplode(_extract_f#26), [_extract_b#21], true, [y#8, x#9]
#         +- *(1) Project [a#3.b AS _extract_b#21, a#3.d.f AS _extract_f#26]
#            +- *(1) ColumnarToRow
#               +- FileScan parquet [a#3] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/pa/test/spark-bug/tmp], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:struct<b:string,d:array<struct<f:string>>>>

session.sql(sql).show()

# java.lang.IllegalStateException: Couldn't find _extract_c#54 in [_extract_b#56,_extract_f#55,z#36]

One can see, that the generated plan is invalid (extract_c#19 is missing in the in previous Project) and yields an error during execution. With this fix, the problem does not occur.

github-actions · 2025-03-16T00:28:16Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

fix problem in replaceWithAliases in connection with Generate plan node

1e2af3f

github-actions bot added the SQL label Dec 4, 2024

trohwer mentioned this pull request Dec 4, 2024

[SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate #37348

Closed

Thomas Rohwer added 3 commits December 5, 2024 12:24

remove unrelated comment

d1f1554

fix code format

7ddf36d

fix code format

3102f76

github-actions bot added the Stale label Mar 16, 2025

github-actions bot closed this Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix problem in NestedColumnAliasing.scala , replaceWithAliases in connection with Generate plan node #49061

[WIP] Fix problem in NestedColumnAliasing.scala , replaceWithAliases in connection with Generate plan node #49061

Uh oh!

trohwer commented Dec 4, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

Uh oh!

[WIP] Fix problem in NestedColumnAliasing.scala , replaceWithAliases in connection with Generate plan node #49061

[WIP] Fix problem in NestedColumnAliasing.scala , replaceWithAliases in connection with Generate plan node #49061

Uh oh!

Conversation

trohwer commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

Uh oh!

trohwer commented Dec 4, 2024 •

edited

Loading