[SPARK-10466][SQL] UnsafeRow SerDe exception with data spill #8635

chenghao-intel · 2015-09-07T00:49:52Z

Data Spill with UnsafeRow causes assert failure.

java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
    at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

To reproduce that with code (thanks @andrewor14):

bin/spark-shell --master local
  --conf spark.shuffle.memoryFraction=0.005
  --conf spark.shuffle.sort.bypassMergeThreshold=0

sc.parallelize(1 to 2 * 1000 * 1000, 10)
  .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()

chenghao-intel · 2015-09-07T00:52:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala

@@ -72,7 +72,6 @@ private class UnsafeRowSerializerInstance(numFields: Int) extends SerializerInst
    override def writeKey[T: ClassTag](key: T): SerializationStream = {
      // The key is only needed on the map side when computing partition ids. It does not need to
      // be shuffled.
-      assert(key.isInstanceOf[Int])


key is possible null, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala#L146

This will happens with external sorting (with data spill).

wouldn't the right thing to do here be to allow nulls as well? In general it's a bad idea to remove assertions

assert(key == null || key.isInstanceOf[Int])

How about change the dummy value to a number (-1) instead of null?

yeah I like that better. We can't have a partition ID of -1, whereas null.asInstanceOf[Int] may be confused with the partition ID of 0

chenghao-intel · 2015-09-07T00:52:49Z

cc @rxin

SparkQA · 2015-09-07T02:56:30Z

Test build #42079 has finished for PR 8635 at commit 684cdb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-07T12:48:39Z

Test build #42090 has finished for PR 8635 at commit 229ce8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-09-08T17:55:54Z

@chenghao-intel Can you add a unit test?

andrewor14 · 2015-09-09T00:39:58Z

By the way, I was able to come up with a smaller reproduction:

bin/spark-shell --master local
  --conf spark.shuffle.memoryFraction=0.005
  --conf spark.shuffle.sort.bypassMergeThreshold=0

sc.parallelize(1 to 2 * 1000 * 1000, 10)
  .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()

chenghao-intel · 2015-09-09T00:52:21Z

yes, that's more simple for unit test, I will steal it. :)

chenghao-intel · 2015-09-09T02:25:34Z

sql/core/src/test/scala/org/apache/spark/sql/MiniSparkSQLClusterSuite.scala

+
+import org.apache.spark.{SparkFunSuite, SparkContext, SparkConf}
+
+class MiniSparkSQLClusterSuite extends SparkFunSuite {


Ideally, we don't necessary to create a special unit test for the bug fixing, however, there are some other issues, which probably requires re-creating the SparkContext with different SparkConf.
For example: https://issues.apache.org/jira/browse/SPARK-10474

andrewor14 · 2015-09-09T02:34:08Z

@chenghao-intel thanks for adding the test. When I posted the code reproduction it wasn't meant as unit test code, but for those following this issue to reproduce it. Given that we understand the root cause of this issue I would prefer to have a finer-grained test that doesn't rely on thresholds.

chenghao-intel · 2015-09-09T02:38:19Z

Thank you @andrewor14, I agree, it's too tricky with unit test like that, i will follow your idea to re-write the unit test.

SparkQA · 2015-09-09T04:31:23Z

Test build #42178 has finished for PR 8635 at commit c47c53c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-09-09T11:31:03Z

sql/core/src/test/scala/org/apache/spark/sql/MiniSparkSQLClusterSuite.scala

+      // Make sure it spilled
+      assert(sc.env.blockManager.diskBlockManager.getAllFiles().length > 0)
+
+      assert(sorter.writePartitionedFile(shuffleBlockId, taskContext, outputFile).sum > 0)


Exception will be thrown here if we didn't change the UnsafeRowSerializer as above.

chenghao-intel · 2015-09-09T11:36:14Z

@andrewor14 seems very difficult to have a very simple unit test, as ExternalSorter have to work with lots of other components, hence I added some mock stuff.

Those mock stuff should be helpful, as I found some other interesting bug, and I can continue to fix it once this merged.

SparkQA · 2015-09-09T13:33:07Z

Test build #42202 has finished for PR 8635 at commit 7f09a62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-09-09T16:05:09Z

LGTM

andrewor14 · 2015-09-09T18:11:13Z

@chenghao-intel thanks for taking the time to write the test. However I think it is much more complicated than necessary. I was able to add the same test to the existing UnsafeRowSerializerSuite in ~50 lines without all the mocking. Can you use this one instead?

test("SPARK-10466: external sorter spilling with unsafe row serializer") {
  val conf = new SparkConf()
    .set("spark.shuffle.spill.initialMemoryThreshold", "1024")
    .set("spark.shuffle.sort.bypassMergeThreshold", "0")
    .set("spark.shuffle.memoryFraction", "0.0001")
  var sc: SparkContext = null
  var outputFile: File = null
  try {
    sc = new SparkContext("local", "test", conf)
    outputFile = File.createTempFile("test-unsafe-row-serializer-spill", "")
    val data = (1 to 1000).iterator.map { i =>
      val internalRow = CatalystTypeConverters.convertToCatalyst(Row(i)).asInstanceOf[InternalRow]
      val unsafeRow = UnsafeProjection.create(Array(IntegerType: DataType)).apply(internalRow)
      (i, unsafeRow)
    }
    val sorter = new ExternalSorter[Int, UnsafeRow, UnsafeRow](
      partitioner = Some(new HashPartitioner(10)),
      serializer = Some(new UnsafeRowSerializer(numFields = 2)))

    // Ensure we spilled something and have to merge them later
    assert(sorter.numSpills === 0)
    sorter.insertAll(data)
    assert(sorter.numSpills > 0)

    // Merging spilled files should not throw assertion error
    val taskContext = new TaskContextImpl(0, 0, 0, 0, null, null, InternalAccumulator.create(sc))
    taskContext.taskMetrics.shuffleWriteMetrics = Some(new ShuffleWriteMetrics)
    sorter.writePartitionedFile(ShuffleBlockId(0, 0, 0), taskContext, outputFile)

  } finally {
    // Clean up
    if (sc != null) {
      sc.stop()
    }
    if (outputFile != null) {
      outputFile.delete()
    }
  }
}

// In ExternalSorter.scala:
/**
 * Number of files this sorter has spilled so far.
 * Exposed for testing.
 */
private[spark] def numSpills: Int = spills.size

chenghao-intel · 2015-09-10T01:12:03Z

Thank you @andrewor14 your code is much simple, I took it already. :)

SparkQA · 2015-09-10T04:08:04Z

Test build #1734 has finished for PR 8635 at commit b8dd7eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-10T04:31:47Z

Test build #1733 has finished for PR 8635 at commit b8dd7eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BlockFetchException(messages: String, throwable: Throwable)

SparkQA · 2015-09-10T05:18:43Z

Test build #42235 has finished for PR 8635 at commit 68ff3d3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-09-10T06:03:34Z

retest this please

SparkQA · 2015-09-10T06:14:00Z

Test build #42236 has finished for PR 8635 at commit e8b27b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-10T08:57:45Z

Test build #42241 has finished for PR 8635 at commit e8b27b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-09-10T10:59:19Z

retest this please

SparkQA · 2015-09-10T13:31:31Z

Test build #42263 has finished for PR 8635 at commit e8b27b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-09-10T18:10:07Z

The latest commit actually already passed tests:

Test build #42236 has finished for PR 8635 at commit e8b27b5.
This patch passes all tests.

LGTM I'm merging this into master 1.5. Thanks @chenghao-intel.

andrewor14 · 2015-09-10T18:11:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeRowSerializerSuite.scala

+    converter(row)
+  }
+
+  private def unsafeRowConverter(schema: Array[DataType]): Row => UnsafeRow = {


this method seems strictly unnecessary... we can just remove it in the future.

Actually UnsafeProjection.create(schema) will do the codegen stuff, and this causes long time if we have to generate the large mount of UnsafeRows.

I mean we can just inline it in toUnsafeRow. There's no reason why it needs to be its own method.

Yes, I got your mean, if we inline that in toUnsafeRow, then for every call of toUnsafeRow, we will get a new instance of Converter according to the schema, this is actually very expensive, as it's codegen internally for creating the converter instance.

Probably we'd better to remove the function toUnsafeRow in the future, since it's always cause performance problem, and people even not notice that.

Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <[email protected]> Closes #8635 from chenghao-intel/unsafe_spill. (cherry picked from commit e048111) Signed-off-by: Andrew Or <[email protected]>

Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <[email protected]> Closes apache#8635 from chenghao-intel/unsafe_spill. (cherry picked from commit e048111) Signed-off-by: Andrew Or <[email protected]> (cherry picked from commit bc70043)

chenghao-intel reviewed Sep 7, 2015
View reviewed changes

chenghao-intel force-pushed the unsafe_spill branch from a99683b to 229ce8a Compare September 7, 2015 10:45

chenghao-intel force-pushed the unsafe_spill branch from 229ce8a to c47c53c Compare September 9, 2015 02:19

chenghao-intel reviewed Sep 9, 2015
View reviewed changes

chenghao-intel force-pushed the unsafe_spill branch from c47c53c to 7f09a62 Compare September 9, 2015 11:28

chenghao-intel reviewed Sep 9, 2015
View reviewed changes

simplify the unit test

b8dd7eb

chenghao-intel force-pushed the unsafe_spill branch from 7f09a62 to b8dd7eb Compare September 10, 2015 01:08

chenghao-intel added 2 commits September 9, 2015 19:42

restore the SparkEnv after SparkContext.stop()

68ff3d3

code style

e8b27b5

andrewor14 reviewed Sep 10, 2015
View reviewed changes

asfgit closed this in e048111 Sep 10, 2015

chenghao-intel deleted the unsafe_spill branch September 11, 2015 01:16


		import org.apache.spark.{SparkFunSuite, SparkContext, SparkConf}

		class MiniSparkSQLClusterSuite extends SparkFunSuite {

[SPARK-10466][SQL] UnsafeRow SerDe exception with data spill #8635

[SPARK-10466][SQL] UnsafeRow SerDe exception with data spill #8635

Uh oh!

Conversation

chenghao-intel commented Sep 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Sep 7, 2015

Uh oh!

SparkQA commented Sep 7, 2015

Uh oh!

SparkQA commented Sep 7, 2015

Uh oh!

yhuai commented Sep 8, 2015

Uh oh!

andrewor14 commented Sep 9, 2015

Uh oh!

chenghao-intel commented Sep 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 9, 2015

Uh oh!

chenghao-intel commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

davies commented Sep 9, 2015

Uh oh!

andrewor14 commented Sep 9, 2015

Uh oh!

chenghao-intel commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

chenghao-intel commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

chenghao-intel commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

andrewor14 commented Sep 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!