-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences #2109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c27f33f
to
fc26b49
Compare
QA tests have started for PR 2109 at commit
|
QA tests have finished for PR 2109 at commit
|
QA tests have started for PR 2109 at commit
|
QA tests have finished for PR 2109 at commit
|
I think references is used for transformExpression in QueryPlan to make sure the same attribute just traversed once when Analyzer is explaining the Attributes in a TreeNode (especially start to recursive from root treenode). = = ..... |
Jenkins, test this please. |
QA tests have started for PR 2109 at commit
|
@@ -39,6 +39,7 @@ case class UnresolvedRelation( | |||
alias: Option[String] = None) extends LeafNode { | |||
override def output = Nil | |||
override lazy val resolved = false | |||
def reference = Set.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this should just be removed.
Hey @marmbrus, Two questions:
|
Tests timed out after a configured wait of |
This is actually how things were implemented originally. However, its really weird when
No, it is actually a subset, this class is just encapsulating a very common pattern where you create sets that contain expression ids. Furthermore, since an AttributeSet is not a Set[Attribute] you get nice type errors in many cases where you are trying to do semantically invalid things. |
Jenkins, test this please. |
QA tests have started for PR 2109 at commit
|
LGTM. Please add the answers to my questions directly into the code documentation :) |
QA tests have finished for PR 2109 at commit
|
QA tests have started for PR 2109 at commit
|
QA tests have finished for PR 2109 at commit
|
Merging in master & branch-1.1. |
When tests time out we should link to the Jenkins console output for easy review. We already do this for when tests start or complete normally. Here's [a recent example](#2109 (comment)) of where this would be helpful. Author: nchammas <[email protected]> Closes #2140 from nchammas/patch-1 and squashes the following commits: 3b26c8d [nchammas] [Spark QA] Link to console output on test time out
…ns of AttributeReferences It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal. In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality. (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32)) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators. I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the `references` of an `Expression` were the same. Thus, I moved this logic into the base class. For query plans the semantics of the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it. TODO: - [x] Finish scala doc for `AttributeSet` - [x] Scan the code for other instances of `Set[Attribute]` and refactor them. - [x] Finish removing `references` from `QueryPlan` Author: Michael Armbrust <[email protected]> Closes apache#2109 from marmbrus/attributeSets and squashes the following commits: 1c0dae5 [Michael Armbrust] work on serialization bug. 9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets 3ae5288 [Michael Armbrust] review comments 40ce7f6 [Michael Armbrust] style d577cc7 [Michael Armbrust] Scaladoc cae5d22 [Michael Armbrust] remove more references implementations d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes. fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
When tests time out we should link to the Jenkins console output for easy review. We already do this for when tests start or complete normally. Here's [a recent example](apache#2109 (comment)) of where this would be helpful. Author: nchammas <[email protected]> Closes apache#2140 from nchammas/patch-1 and squashes the following commits: 3b26c8d [nchammas] [Spark QA] Link to console output on test time out
It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting
AttributeReference
objects into a standard ScalaSet
result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.In this PR I introduce a new abstraction, an
AttributeSet
, which performs all comparisons using the globally uniqueExpressionId
instead of case class equality. (There is already a related class,AttributeMap
) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the
references
of anExpression
were the same. Thus, I moved this logic into the base class.For query plans the semantics of the
references
method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it.TODO:
AttributeSet
Set[Attribute]
and refactor them.references
fromQueryPlan