[SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers #7274

nishkamravi2 · 2015-07-08T00:22:03Z

Current scheduling algorithm allocates one core at a time and in doing so ends up ignoring spark.executor.cores. As a result, when spark.cores.max/spark.executor.cores (i.e, num_executors) < num_workers, executors are not launched and the app hangs. This PR fixes and refactors the scheduling algorithm.

@andrewor14

srowen · 2015-07-08T00:24:39Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+      while (toAssign > 0) {
+        if (usableWorkers(pos).coresFree - assignedCores(pos) >= coresPerExecutor &&
+            usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor) {
+          toAssign -= coresPerExecutor


If I understand what you're trying to change, this won't help. If there aren't enough cores on any worker, then this becomes an infinite loop.

Please read code carefully

Eh, sorry that is not addressing my question but I think I see the situation now i.e. I need 16 cores for 2 8 core executors and I have 4 workers so each fails to be enough to cause an executor to launch anywhere? Example would be really helpful or a test.

If so think this is also fixable by just considering no more workers than executors.

Also yes I see you still have the filtering on cores available so this shouldn't keep looping over workers, right. Unless the available count can drop while this is in progress but that is either not a problem or already a problem so not directly relevant

Consider the following: 4 workers each with 16 cores, spark.cores.max=48, spark.executor.cores = 16. When we spread out, we allocate one core at a time and in doing so end up allocating 12 cores from each worker. First, we ended up ignoring spark.executor.cores during allocation, which isn't right. Second, when the following condition is checked: while (coresLeft >= coresPerExecutor), coresLeft is 12 and coresPerExecutor is 16. As a result, executors don't launch.

Yep, that's right.

Yes, makes sense. Is it maybe more direct to never spread the allocation over more than 3 workers in this case since only 3 executors are needed? Same effect but I also see the value in allocating whole executors of cores at a time for clarity

Yep. Allocating spark.executor.cores at a time is cleaner and directly enforces semantics.

nishkamravi2 · 2015-07-08T00:30:38Z

Overview of changes:

scheduleExecutorsOnWorkers rewritten and separated out (so it can be unit tested)
allocateWorkerResourceToExecutors modified accordingly and simplified

Comments:

The two while loops in scheduleExecutorsOnWorkers can potentially be fused into one
Would be good to add a couple of unit tests (we don't have any for executor scheduling at the moment)

andrewor14 · 2015-07-08T18:16:02Z

add to whitelist

SparkQA · 2015-07-08T18:21:18Z

Test build #36810 has finished for PR 7274 at commit 66362d5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-08T22:34:59Z

Test build #36831 has finished for PR 7274 at commit 2d6371c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nishkamravi2 · 2015-07-08T23:28:10Z

Can this be retested please?

SparkQA · 2015-07-09T02:44:32Z

Test build #1016 has finished for PR 7274 at commit 2d6371c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nishkamravi2 · 2015-07-09T05:08:52Z

Hey @andrewor14, not sure what to make of these test results. Are you able to see which tests failed?

andrewor14 · 2015-07-09T05:44:17Z

Not sure... retest this please

SparkQA · 2015-07-09T06:04:44Z

Test build #36897 has finished for PR 7274 at commit 2d6371c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

nishkamravi2 · 2015-07-09T06:45:48Z

Can this be retested please?

SparkQA · 2015-07-09T12:17:22Z

Test build #36923 has finished for PR 7274 at commit 5d6a19c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-07-09T14:08:17Z

EventLoggingListenerSuite seems to be failing regularly. Does it pass when you run locally?

nishkamravi2 · 2015-07-09T23:22:45Z

Thanks Imran. I thought my local run had gone through, will check again. Btw, were you able to make this out from the test result tab or by scanning the console output ?

Bug in the test

SparkQA · 2015-07-10T03:58:33Z

Test build #37005 has finished for PR 7274 at commit c11c689.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-10T06:29:09Z

Test build #37007 has finished for PR 7274 at commit 40c8f9f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-10T12:11:21Z

Test build #37032 has finished for PR 7274 at commit a06da76.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-14T09:48:13Z

Test build #1064 has finished for PR 7274 at commit a06da76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-14T10:41:58Z

This looks pretty good to my eyes but would be good to get another pair. @squito @andrewor14 and maybe @sryza what do you think of the logic refactoring? I think it preserves the original behavior and fixes the issue at hand.

andrewor14 · 2015-07-17T00:50:50Z

retest this please

andrewor14 · 2015-07-17T00:51:44Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

-    var coresLeft = coresToAllocate
-    while (coresLeft >= coresPerExecutor && worker.memoryFree >= memoryPerExecutor) {
+
+    var numExecutors = assignedCores/coresPerExecutor


please add spaces around /

also, this can be a val

can you add a comment here stating your implicit assumptions:

// If cores per executor is specified, then this division should have a remainder of zero

SparkQA · 2015-07-18T04:14:07Z

Test build #37684 has finished for PR 7274 at commit f279cdf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-20T00:09:37Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+            usableWorkers(pos).memoryFree - assignedMemory(pos) >= memoryPerExecutor) {
+          coresToAssign -= coresPerExecutor
+          assignedCores(pos) += coresPerExecutor
+          assignedMemory(pos) += memoryPerExecutor


So I stared at this loop for a little bit and I think it could bring us into an infinite loop.

E.g. We have 3 workers, with 3, 3, and 4 free cores respectively, so that coresToAssign == 10. Now let's say coresPerExecutor == 3, so after allocating 3 executors we end up with coresToAssign == 1. What happens next? Well, none of the usable workers can accommodate a new executor, and coresToAssign > 0 is still true, so this loop never exits.

Let me know if I'm missing something.

(same for the non spread out case)

Wouldn't app.coresLeft be a multiple of 3 in this case? so 9 or 12 rather than 10? but yeah it still raises the question of what happens if there simply aren't enough cores on one worker: I want 4x 3-core executors, and I have 3x 4-core workers. It will never schedule. Previously I think we'd just manage to schedule 3x 3-core executors but I think this would keep looping. I think there needs to be some logic for detecting when there is no worker left that could possibly fit another.

I haven't thought this through either but are there race condition problems here too? as long as the worst case is just that resources that looked available aren't anymore and fail to schedule, that's fine.

"resources that looked available aren't anymore and fail to schedule, that's fine." This is the assumption being made here. If the user didn't care about the size of the executor, they would skip executor.cores and the algorithm would proceed as before (best-effort: one-core at a time). If they do, we should either schedule as requested or not at all. If we care to be extra-friendly, we could add a check to log a message from within the loop: "Not enough resources, please check spark.cores.max and spark.executor.cores" ?

Seems OK but I think there is an infinite loop problem here still?

We could potentially return assignedCores that we have thus far and proceed with scheduling. But as discussed earlier, we are better off failing than scheduling incorrectly. Do you feel otherwise?

Didn't see your note. I would think that by failing and allowing the user to reconfigure, we would be doing them a favor. But I can see the value in scheduling whatever we can as well.

Now we have both versions. We can choose to keep this or revert to the previous one.

I think the previous behavior was to schedule as much as possible? since before it would only try to assign as many cores as are available, not necessarily as many as are requested. If so I think it's best to retain that behavior.

Yeah, I don't think it's really a user error. The contract of setting spark.executor.cores is that every executor has that exactly many cores. If the total number of cores across the cluster is not a multiple of that then there will be some unused cores, but scheduling should still work.

SparkQA · 2015-07-21T10:21:35Z

Test build #37934 has finished for PR 7274 at commit 79084e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-21T10:26:37Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

@@ -533,6 +533,7 @@ private[master] class Master(

  /**
   * Schedule executors to be launched on the workers.
+   * Returns an array containing number of cores assigned to each worker (None if scheduling fails)


Nit, for if another change is needed: this could be a @return tag

also, this shouldn't say None anymore

SparkQA · 2015-07-22T05:18:11Z

Test build #38019 has finished for PR 7274 at commit da0f491.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-22T18:45:02Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

+      freeWorkers = freeWorkers.filter(canLaunchExecutor)
+      freeWorkers.foreach { pos =>
+        var keepScheduling = true
+        while (keepScheduling && canLaunchExecutor(pos) && coresToAssign > 0) {


I just tested this out locally myself and found a bug. The comparison here should be coresToAssign >= coresPerExecutor. Otherwise, we could end up allocating more than spark.cores.max. Same in L573.

E.g. spark.executor.cores == 3, and spark.cores.max == 10, then we'll allocate 12 cores because in the last iteration coresToAssign == 1 > 0.

Good catch @andrewor14. Hopefully we've covered everything now.

andrewor14 · 2015-07-22T19:11:06Z

Hey @nishkamravi2 the latest changes look great other than the one bug I pointed out. By the way, I think this also happens to solve SPARK-9260, which is a completely separate issue. Would you mind adding that JIRA to the title of this patch as well?

andrewor14 · 2015-07-22T19:16:50Z

Proof that this fixes SPARK-9260:

Before

After

nishkamravi2 · 2015-07-22T22:20:26Z

Sure, thanks.

nishkamravi2 · 2015-07-23T03:54:02Z

Can this be retested please..

squito · 2015-07-23T14:30:37Z

Jenkins add to whitelist

SparkQA · 2015-07-23T14:35:36Z

Test build #1186 has finished for PR 7274 at commit b998097.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-23T16:39:00Z

Test build #1187 has finished for PR 7274 at commit b998097.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-23T17:32:57Z

LGTM, this is mergeable as is, but I will wait for some unit tests before doing so. Thanks for following up on the comments promptly @nishkamravi2.

andrewor14 · 2015-07-26T05:54:15Z

@nishkamravi2 I'm going to go ahead and merge this patch since it's blocking development in other patches. I have written the unit tests locally and will push a PR for it immediately after this is merged. Thanks everyone for your input.

…orkers Current scheduling algorithm allocates one core at a time and in doing so ends up ignoring spark.executor.cores. As a result, when spark.cores.max/spark.executor.cores (i.e, num_executors) < num_workers, executors are not launched and the app hangs. This PR fixes and refactors the scheduling algorithm. andrewor14 Author: Nishkam Ravi <[email protected]> Author: nishkamravi2 <[email protected]> Closes #7274 from nishkamravi2/master_scheduler and squashes the following commits: b998097 [nishkamravi2] Update Master.scala da0f491 [Nishkam Ravi] Update Master.scala 79084e8 [Nishkam Ravi] Update Master.scala 1daf25f [Nishkam Ravi] Update Master.scala f279cdf [Nishkam Ravi] Update Master.scala adec84b [Nishkam Ravi] Update Master.scala a06da76 [nishkamravi2] Update Master.scala 40c8f9f [nishkamravi2] Update Master.scala (to trigger retest) c11c689 [nishkamravi2] Update EventLoggingListenerSuite.scala 5d6a19c [nishkamravi2] Update Master.scala (for the purpose of issuing a retest) 2d6371c [Nishkam Ravi] Update Master.scala 66362d5 [nishkamravi2] Update Master.scala ee7cf0e [Nishkam Ravi] Improved scheduling algorithm for executors (cherry picked from commit 41a7cdf) Signed-off-by: Andrew Or <[email protected]>

andrewor14 · 2015-07-26T17:51:09Z

#7668

nishkamravi2 · 2015-07-28T06:51:45Z

Hey @andrewor14, thanks for taking care of this! Sorry, couldn't respond sooner, was out for a couple of days.

Improved scheduling algorithm for executors

ee7cf0e

srowen reviewed Jul 8, 2015
View reviewed changes

Update Master.scala

66362d5

Update Master.scala

2d6371c

Update Master.scala (for the purpose of issuing a retest)

5d6a19c

Update EventLoggingListenerSuite.scala

c11c689

Bug in the test

Update Master.scala (to trigger retest)

40c8f9f

Update Master.scala

a06da76

andrewor14 reviewed Jul 17, 2015
View reviewed changes

Update Master.scala

f279cdf

andrewor14 reviewed Jul 20, 2015
View reviewed changes

nishkamravi2 added 2 commits July 20, 2015 23:26

Update Master.scala

1daf25f

Update Master.scala

79084e8

srowen reviewed Jul 21, 2015
View reviewed changes

Update Master.scala

da0f491

andrewor14 reviewed Jul 22, 2015
View reviewed changes

nishkamravi2 changed the title ~~[SPARK-8881] Fix algorithm for scheduling executors on workers~~ [SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers Jul 22, 2015

Update Master.scala

b998097

asfgit closed this in 41a7cdf Jul 26, 2015

carsonwang mentioned this pull request Aug 7, 2015

[SPARK-9731] Standalone scheduling incorrect cores if spark.executor.cores is not set #8017

Closed

[SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers #7274

[SPARK-8881][SPARK-9260] Fix algorithm for scheduling executors on workers #7274

Uh oh!

Conversation

nishkamravi2 commented Jul 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishkamravi2 commented Jul 8, 2015

Uh oh!

andrewor14 commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

nishkamravi2 commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

nishkamravi2 commented Jul 9, 2015

Uh oh!

andrewor14 commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

nishkamravi2 commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

squito commented Jul 9, 2015

Uh oh!

nishkamravi2 commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 14, 2015

Uh oh!

srowen commented Jul 14, 2015

Uh oh!

andrewor14 commented Jul 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!