-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-10466][SQL] UnsafeRow SerDe exception with data spill #8635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -72,7 +72,6 @@ private class UnsafeRowSerializerInstance(numFields: Int) extends SerializerInst | |||
override def writeKey[T: ClassTag](key: T): SerializationStream = { | |||
// The key is only needed on the map side when computing partition ids. It does not need to | |||
// be shuffled. | |||
assert(key.isInstanceOf[Int]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key
is possible null, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala#L146
This will happens with external sorting (with data spill).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't the right thing to do here be to allow nulls as well? In general it's a bad idea to remove assertions
assert(key == null || key.isInstanceOf[Int])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the dummy value to a number (-1) instead of null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I like that better. We can't have a partition ID of -1
, whereas null.asInstanceOf[Int]
may be confused with the partition ID of 0
cc @rxin |
Test build #42079 has finished for PR 8635 at commit
|
a99683b
to
229ce8a
Compare
Test build #42090 has finished for PR 8635 at commit
|
@chenghao-intel Can you add a unit test? |
By the way, I was able to come up with a smaller reproduction:
|
yes, that's more simple for unit test, I will steal it. :) |
229ce8a
to
c47c53c
Compare
|
||
import org.apache.spark.{SparkFunSuite, SparkContext, SparkConf} | ||
|
||
class MiniSparkSQLClusterSuite extends SparkFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we don't necessary to create a special unit test for the bug fixing, however, there are some other issues, which probably requires re-creating the SparkContext with different SparkConf.
For example: https://issues.apache.org/jira/browse/SPARK-10474
@chenghao-intel thanks for adding the test. When I posted the code reproduction it wasn't meant as unit test code, but for those following this issue to reproduce it. Given that we understand the root cause of this issue I would prefer to have a finer-grained test that doesn't rely on thresholds. |
Thank you @andrewor14, I agree, it's too tricky with unit test like that, i will follow your idea to re-write the unit test. |
Test build #42178 has finished for PR 8635 at commit
|
c47c53c
to
7f09a62
Compare
// Make sure it spilled | ||
assert(sc.env.blockManager.diskBlockManager.getAllFiles().length > 0) | ||
|
||
assert(sorter.writePartitionedFile(shuffleBlockId, taskContext, outputFile).sum > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception will be thrown here if we didn't change the UnsafeRowSerializer
as above.
@andrewor14 seems very difficult to have a very simple unit test, as Those mock stuff should be helpful, as I found some other interesting bug, and I can continue to fix it once this merged. |
Test build #42202 has finished for PR 8635 at commit
|
LGTM |
@chenghao-intel thanks for taking the time to write the test. However I think it is much more complicated than necessary. I was able to add the same test to the existing
|
7f09a62
to
b8dd7eb
Compare
Thank you @andrewor14 your code is much simple, I took it already. :) |
Test build #1734 has finished for PR 8635 at commit
|
Test build #1733 has finished for PR 8635 at commit
|
Test build #42235 has finished for PR 8635 at commit
|
retest this please |
Test build #42236 has finished for PR 8635 at commit
|
Test build #42241 has finished for PR 8635 at commit
|
retest this please |
Test build #42263 has finished for PR 8635 at commit
|
The latest commit actually already passed tests:
LGTM I'm merging this into master 1.5. Thanks @chenghao-intel. |
converter(row) | ||
} | ||
|
||
private def unsafeRowConverter(schema: Array[DataType]): Row => UnsafeRow = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method seems strictly unnecessary... we can just remove it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually UnsafeProjection.create(schema)
will do the codegen stuff, and this causes long time if we have to generate the large mount of UnsafeRow
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we can just inline it in toUnsafeRow
. There's no reason why it needs to be its own method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I got your mean, if we inline that in toUnsafeRow
, then for every call of toUnsafeRow
, we will get a new instance of Converter
according to the schema, this is actually very expensive, as it's codegen internally for creating the converter
instance.
Probably we'd better to remove the function toUnsafeRow
in the future, since it's always cause performance problem, and people even not notice that.
Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <[email protected]> Closes #8635 from chenghao-intel/unsafe_spill. (cherry picked from commit e048111) Signed-off-by: Andrew Or <[email protected]>
Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <[email protected]> Closes apache#8635 from chenghao-intel/unsafe_spill. (cherry picked from commit e048111) Signed-off-by: Andrew Or <[email protected]> (cherry picked from commit bc70043)
Data Spill with UnsafeRow causes assert failure.
To reproduce that with code (thanks @andrewor14):