Skip to content

[SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences #2109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

marmbrus
Copy link
Contributor

It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting AttributeReference objects into a standard Scala Set result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an AttributeSet, which performs all comparisons using the globally unique ExpressionId instead of case class equality. (There is already a related class, AttributeMap) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the references of an Expression were the same. Thus, I moved this logic into the base class.

For query plans the semantics of the references method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it.

TODO:

  • Finish scala doc for AttributeSet
  • Scan the code for other instances of Set[Attribute] and refactor them.
  • Finish removing references from QueryPlan

@SparkQA
Copy link

SparkQA commented Aug 24, 2014

QA tests have started for PR 2109 at commit fc26b49.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 24, 2014

QA tests have finished for PR 2109 at commit fc26b49.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
    • $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$
    • class AttributeEquals(val a: Attribute)
    • class AttributeSet(val baseSet: Set[AttributeEquals]) extends Traversable[Attribute]

@SparkQA
Copy link

SparkQA commented Aug 24, 2014

QA tests have started for PR 2109 at commit d6e16be.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 24, 2014

QA tests have finished for PR 2109 at commit d6e16be.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AttributeEquals(val a: Attribute)
    • class AttributeSet(val baseSet: Set[AttributeEquals]) extends Traversable[Attribute]

@chuxi
Copy link

chuxi commented Aug 25, 2014

I think references is used for transformExpression in QueryPlan to make sure the same attribute just traversed once when Analyzer is explaining the Attributes in a TreeNode (especially start to recursive from root treenode).

= = .....

@marmbrus marmbrus changed the title [WIP][SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences Aug 26, 2014
@marmbrus
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2109 at commit 40ce7f6.

  • This patch merges cleanly.

@@ -39,6 +39,7 @@ case class UnresolvedRelation(
alias: Option[String] = None) extends LeafNode {
override def output = Nil
override lazy val resolved = false
def reference = Set.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this should just be removed.

@rxin
Copy link
Contributor

rxin commented Aug 26, 2014

Hey @marmbrus,

Two questions:

  1. What if we just override equals in Attribute?
  2. Does AttributeSet provide anything that Scala's Set doesn't provide?

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

Tests timed out after a configured wait of 120m.

@marmbrus
Copy link
Contributor Author

What if we just override equals in Attribute?

This is actually how things were implemented originally. However, its really weird when AttributeReference("a"...) == AttrributeReference("b", ...). It breaks tests, and also makes doing transformations hard (we always try keep older trees instead of new ones when the transformation was a no-op).

Does AttributeSet provide anything that Scala's Set doesn't provide?

No, it is actually a subset, this class is just encapsulating a very common pattern where you create sets that contain expression ids. Furthermore, since an AttributeSet is not a Set[Attribute] you get nice type errors in many cases where you are trying to do semantically invalid things.

@marmbrus
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2109 at commit 40ce7f6.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Aug 26, 2014

LGTM. Please add the answers to my questions directly into the code documentation :)

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2109 at commit 40ce7f6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class AttributeEquals(val a: Attribute)
    • class AttributeSet protected (val baseSet: Set[AttributeEquals]) extends Traversable[Attribute]

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2109 at commit 1c0dae5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2109 at commit 1c0dae5.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class AttributeEquals(val a: Attribute)

@rxin
Copy link
Contributor

rxin commented Aug 26, 2014

Merging in master & branch-1.1.

@asfgit asfgit closed this in c4787a3 Aug 26, 2014
@marmbrus marmbrus deleted the attributeSets branch August 27, 2014 20:46
asfgit pushed a commit that referenced this pull request Aug 29, 2014
When tests time out we should link to the Jenkins console output for easy review. We already do this for when tests start or complete normally.

Here's [a recent example](#2109 (comment)) of where this would be helpful.

Author: nchammas <[email protected]>

Closes #2140 from nchammas/patch-1 and squashes the following commits:

3b26c8d [nchammas] [Spark QA] Link to console output on test time out
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…ns of AttributeReferences

It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.

For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.

TODO:
 - [x] Finish scala doc for `AttributeSet`
 - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
 - [x] Finish removing `references` from `QueryPlan`

Author: Michael Armbrust <[email protected]>

Closes apache#2109 from marmbrus/attributeSets and squashes the following commits:

1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
When tests time out we should link to the Jenkins console output for easy review. We already do this for when tests start or complete normally.

Here's [a recent example](apache#2109 (comment)) of where this would be helpful.

Author: nchammas <[email protected]>

Closes apache#2140 from nchammas/patch-1 and squashes the following commits:

3b26c8d [nchammas] [Spark QA] Link to console output on test time out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants