-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-8432] [SQL] fix hashCode() and equals() of BinaryType in Row #6876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #35112 has finished for PR 6876 at commit
|
for (int i = 0; i < arr.length; i++) { | ||
hash = hash * 37 + (int)arr[i]; | ||
} | ||
return hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it same with java.util.Arrays.hashCode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, we should use that.
Test build #35114 has finished for PR 6876 at commit
|
Test build #35144 has finished for PR 6876 at commit
|
Test build #35146 has finished for PR 6876 at commit
|
@marmbrus Could you help to review this one? |
/** | ||
* A generic version of Row.equals(Row), which is used for tests. | ||
*/ | ||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing: can you add some javadoc to this class to explain what its used for and why its in Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Row
is a trait, UnsafeRow and SpecificRow are both in Java, they can not inherit some default implementations from Row
, so created BaseRow in Java for them. Right now, we have InternalRow
, will be clean these in another PR.
@davies, thanks for working on this! I'm okay with this approach, but did you consider the alternative, where we instead change the internal type of |
@marmbrus We're working to have more efficient representation in catalyst, putting Array[Byte] inside a wrapper sounds not in the same direction. I'd like to go this approach. |
I think using a wrapper might be necessary for efficiency. For example, we will want to reuse the same byte array when reading from something like parquet, instead of needing to allocate one of the exact size each time (think |
Test build #35189 timed out for PR 6876 at commit |
override def copy(): InternalRow = this | ||
|
||
override def equals(o: Any): Boolean = { | ||
if (!o.isInstanceOf[Row]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we change it to isInstanceOf[InternalRow]
after #6869?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -31,8 +45,16 @@ class LiteralExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
} | |||
|
|||
test("int literals") { | |||
checkEvaluation(Literal(1), 1) | |||
checkEvaluation(Literal(0L), 0L) | |||
List(0, 1, Int.MinValue, Int.MaxValue).foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty weird way of indenting. you can do
List(0, 1, Int.MinValue, Int.MaxValue).foreach { d =>
...
}
or
for (d <- List(0, 1, Int.MinValue, Int.MaxValue)) {
...
}
Test build #35225 has finished for PR 6876 at commit
|
Test build #35283 has finished for PR 6876 at commit
|
Test build #937 has finished for PR 6876 at commit
|
Test build #35284 has finished for PR 6876 at commit
|
test this please |
Test build #939 has finished for PR 6876 at commit
|
Test build #35307 has finished for PR 6876 at commit
|
Test build #35322 has finished for PR 6876 at commit
|
@@ -127,6 +127,7 @@ object GenerateProjection extends CodeGenerator[Seq[Expression], Projection] { | |||
case FloatType => s"Float.floatToIntBits($col)" | |||
case DoubleType => | |||
s"(int)(Double.doubleToLongBits($col) ^ (Double.doubleToLongBits($col) >>> 32))" | |||
case BinaryType => s"java.util.Arrays.hashCode($col)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update equals
for generated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, genEqual
has already handled BinaryType
.
Test build #946 has finished for PR 6876 at commit
|
Conflicts: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Test build #35483 has finished for PR 6876 at commit
|
Thanks, merging to master. |
Also added more tests in LiteralExpressionSuite