Skip to content

[SPARK-6550][SQL] Use analyzed plan in DataFrame #5217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

marmbrus
Copy link
Contributor

This is based on bug and test case proposed by @viirya. See #5203 for a excellent description of the problem.

TLDR; The problem occurs because the function groupBy(String) calls resolve, which returns an AttributeReference. However, this AttributeReference is based on an analyzed plan which is thrown away. At execution time, we once again analyze the plan. However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned AttributeReference invalid.

As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a DataFrame.

@SparkQA
Copy link

SparkQA commented Mar 26, 2015

Test build #29258 has finished for PR 5217 at commit dd4dec1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29261 has finished for PR 5217 at commit 1f98e2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 27, 2015
This is based on bug and test case proposed by viirya.  See #5203 for a excellent description of the problem.

TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed plan which is thrown away.  At execution time, we once again analyze the plan.  However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid.

As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`.

Author: Michael Armbrust <[email protected]>

Closes #5217 from marmbrus/preanalyzer and squashes the following commits:

1f98e2d [Michael Armbrust] revert change
dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame
089c52e [Michael Armbrust] WIP

(cherry picked from commit 5d9c37c)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 5d9c37c Mar 27, 2015
@marmbrus marmbrus deleted the preanalyzer branch August 3, 2015 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants