Skip to content

[SPARK-18890][CORE] Move task serialization from the TaskSetManager to the CoarseGrainedSchedulerBackend #15505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

witgo
Copy link
Contributor

@witgo witgo commented Oct 16, 2016

What changes were proposed in this pull request?

Performance Testing:

The code:

val rdd = sc.parallelize(0 until 100).repartition(100000)
rdd.localCheckpoint().count()
rdd.sum()
(1 to 10).foreach{ i=>
  val serializeStart = System.currentTimeMillis()
  rdd.sum()
  val serializeFinish = System.currentTimeMillis()
  println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f")
}

and spark-defaults.conf file:

spark.master                                      yarn-client
spark.executor.instances                          20
spark.driver.memory                               64g
spark.executor.memory                             30g
spark.executor.cores                              5
spark.default.parallelism                         100 
spark.sql.shuffle.partitions                      100
spark.serializer                                  org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize                        0
spark.ui.enabled                                  false 
spark.driver.extraJavaOptions                     -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M 
spark.executor.extraJavaOptions                   -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
spark.cleaner.referenceTracking.blocking          true
spark.cleaner.referenceTracking.blocking.shuffle  true

The test results are as follows:

SPARK-17931 db0ddce
9.427 s 9.566 s

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Oct 16, 2016

Test build #67033 has finished for PR 15505 at commit b51d00c.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 16, 2016

Test build #67035 has finished for PR 15505 at commit 771949e.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [WIP][SPARK-17931]taskScheduler has some unneeded serialization [SPARK-17931]taskScheduler has some unneeded serialization Oct 17, 2016
@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67062 has finished for PR 15505 at commit bee165a.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [SPARK-17931]taskScheduler has some unneeded serialization [SPARK-17931][CORE] taskScheduler has some unneeded serialization Oct 17, 2016
@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67056 has finished for PR 15505 at commit 8a6062d.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor

wzhfy commented Oct 18, 2016

There are many unnecessary changes, can you recover them to minimize diff? That'll be easier for others to review. :)

@witgo
Copy link
Contributor Author

witgo commented Oct 18, 2016

@wzhfy
Ok, the code has been modified

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67109 has finished for PR 15505 at commit d956ff5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67110 has finished for PR 15505 at commit ca9da40.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67126 has finished for PR 15505 at commit 80eed8f.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67159 has finished for PR 15505 at commit 589f3bb.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo
Copy link
Contributor Author

witgo commented Oct 19, 2016

cc @rxin

@witgo witgo changed the title [SPARK-17931][CORE] taskScheduler has some unneeded serialization [WIP][SPARK-17931][CORE] taskScheduler has some unneeded serialization Oct 19, 2016
@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67179 has finished for PR 15505 at commit 84488c4.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67184 has finished for PR 15505 at commit 8a6b37e.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67242 has finished for PR 15505 at commit 07b0581.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [WIP][SPARK-17931][CORE] taskScheduler has some unneeded serialization [SPARK-17931][CORE] taskScheduler has some unneeded serialization Nov 4, 2016
@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68141 has finished for PR 15505 at commit ace0114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 4, 2016

This is pretty big and will miss 2.1.

But cc @kayousterhout / @squito / @JoshRosen

@@ -486,7 +481,7 @@ private[spark] class Executor(
* Download any missing dependencies if we receive a new set of files and JARs from the
* SparkContext. Also adds any new JARs we fetched to the class loader.
*/
private def updateDependencies(newFiles: HashMap[String, Long], newJars: HashMap[String, Long]) {
private def updateDependencies(newFiles: Map[String, Long], newJars: Map[String, Long]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, I will change it back.

Copy link
Contributor Author

@witgo witgo Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy The source of this parameter is sc.addedFiles and sc.addedJars, Their types are mutable.Map[String, Long] , This change is reasonable.

@witgo
Copy link
Contributor Author

witgo commented Nov 21, 2016

ping @kayousterhout / @squito / @JoshRosen

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a very brief review -- the idea here makes a lot of sense. The only big problem I see with the patch now is the tests that have been eliminated, some of those we definitely need to bring back. But I need to do a longer pass to think about how the pieces fit together.

_serializedTask: ByteBuffer)
extends Serializable {
private[spark] class TaskDescription private(
val taskId: Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double indent (4 spaces) for constructor params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

private var taskProps: Properties) {

def this(taskId: Long,
attemptNumber: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: each parameter on its own line (even the first one), and double indent all params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -139,29 +139,6 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B
assert(!failedTaskSet)
}

test("Scheduler does not crash when tasks are not serializable") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like there is any replacement for this test, right? We certainly want to keep this check in some form.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
For rdd and closures, there are already related test cases, see DAGSchedulerSuite.scala#L506
For task, user can use a custom partition, and the partition instance may not be serializable. This has no test cases I will add one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case has been added.

Copy link
Contributor

@squito squito Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand right, you are saying there are now test cases which cover all of the little individual pieces inside a Task, so we don't need an explicit test for the serializing the entire task.

I'd still prefer a test for serializing the entire task (to prevent future regressions), but I guess its hard to do b/c now the task actually gets serialized by the SchedulerBackends.

@@ -592,47 +579,6 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
assert(manager.resourceOffer("execB", "host2", RACK_LOCAL).get.index === 1)
}

test("do not emit warning when serialized task is small") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing on replacements for these tests. we definitely want to keep the test on not-serializability. The others should probably stay as well, unless there is some reason why its extremely hard to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ShuffleMapTask and ResultTask, since they do not contain rdd instance, they are very small.
Detection of the size of the serialized RDD should be more reasonable. I added the corresponding code in the DAGScheduler class, but did not add the test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a stretch, but Task could be big, right? Eg. if there was a long chain of partitions, and for some reason they had a lot of data? That seems unlikely, but I also have no idea if it ever happens or not. It seems easy enough to add the check back in -- is there any reason not to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with @squito that we should keep this test / check. I've seen many huge task sizes due to a developer mistake, so think we should absolutely keep warning about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add this test case back.

taskMetrics.incMemoryBytesSpilled(10)
override def runTask(tc: TaskContext): Int = 0
}
val taskDesc = new TaskDescription(1L, 0, "s1", "n1", 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with so many args, all with pretty generic types, its really helpful to name each one, eg

TaskDescription(
  taskId=1L,
  attemptNumber = 0,
...

(here and elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll change it

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73694 has finished for PR 15505 at commit 335b7b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73721 has finished for PR 15505 at commit b2b1eec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo
Copy link
Contributor Author

witgo commented Mar 2, 2017

@kayousterhout It takes some time to update the test report.

@kayousterhout
Copy link
Contributor

@witgo OK I'll hold off on doing another pass on the code until you have the test results.

@witgo
Copy link
Contributor Author

witgo commented Mar 2, 2017

@kayousterhout
Test results have been updated:

SPARK-17931 db0ddce
9.427 s 9.566 s

@witgo
Copy link
Contributor Author

witgo commented Mar 2, 2017

I don't know which PR causes the run time of this test case to be reduced from 21.764 s to 9.566 s.

@kayousterhout
Copy link
Contributor

@witgo I don't think the ~1.5% improvement in runtime merits the added complexity of this change. I could be convinced to merge this if it simplified the code or the ability to reason about the code, but unfortunately this makes things somewhat more complicated because of the new logic about aborting task sets.

@mridulm
Copy link
Contributor

mridulm commented Mar 3, 2017

@kayousterhout I am surprised it is not more, but I agree that the added complexity for such low returns is not worth it.

@witgo
Copy link
Contributor Author

witgo commented Mar 3, 2017

Yes, maybe a multithreaded serialization task code can have a better performance, let me close the PR

@witgo witgo closed this Mar 3, 2017
@witgo
Copy link
Contributor Author

witgo commented Mar 3, 2017

SPARK-18890_20170303 `s code is older but the test case running time is 5.2 s

@libratiger
Copy link

libratiger commented May 16, 2017

I agree with Kay that putting in a smaller change first is better, assuming it still has the performance gains. That doesn't preclude any further optimizations that are bigger changes.

I'm a little surprised that the serializing tasks has much of an impact, given how little data is getting serialized. But if it really is, I feel like there is a much bigger optimization we're completely missing. Why are we repeating the work of serialization for each task in a taskset? The serialized data is almost exactly the same for every task. they only differ in the partition id (an int) and the preferred locations (which aren't even used by the executor at all).

Task serialization already leverages the idea of having info across all the tasks in the Broadcast for the task binary. We just need to use that same idea for all the rest of the task data that is sent to the executor. Then the only difference between the serialized task data sent to executors is the int for the partitionId. You'd serialize into a bytebuffer once, and then your per-task "serialization" becomes copying the buffer and modifying that int directly.

@squito I like this idea very much. I just encounte the de-serialization time is too long (about more than 10s for some tasks). Is there any PR try to solve this? If there is no related PR, I would like open an issue and try to solve this.:-D

@squito
Copy link
Contributor

squito commented May 30, 2017

@djvulee this is https://issues.apache.org/jira/browse/SPARK-19108. Note that there is some discussion there about this being a bit harder than what I originally thought, though I think its still worth exploring where task serialization is an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants