SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors #1360

markhamstra · 2014-07-10T18:18:35Z

Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.

Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.

This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.

markhamstra · 2014-07-10T18:19:21Z

@aarondav

AmplabJenkins · 2014-07-10T18:21:15Z

Merged build triggered.

AmplabJenkins · 2014-07-10T18:21:22Z

Merged build started.

SparkQA · 2014-07-10T18:22:49Z

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull

SparkQA · 2014-07-10T19:59:51Z

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull

AmplabJenkins · 2014-07-10T19:59:55Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-07-10T19:59:56Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/

SparkQA · 2014-07-11T18:02:28Z

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull

SparkQA · 2014-07-11T19:37:03Z

QA results for PR 1360:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull

SparkQA · 2014-07-13T18:27:38Z

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull

SparkQA · 2014-07-13T20:07:43Z

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull

markhamstra · 2014-07-27T18:36:36Z

ping

Probably too late for a 1.0.2-rc, but this should go into 1.0.3 and 1.1.0.

SparkQA · 2014-08-01T18:04:05Z

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull

SparkQA · 2014-08-01T18:58:20Z

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull

markhamstra · 2014-08-12T18:51:20Z

@pwendell Still should go into 1.1.0... The change is fairly small, and the unpatched behavior is pretty nasty for long-running applications.

SparkQA · 2014-08-12T18:54:55Z

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull

SparkQA · 2014-08-12T19:49:26Z

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull

SparkQA · 2014-08-22T16:45:41Z

QA tests have started for PR 1360 at commit 206a448.

This patch merges cleanly.

SparkQA · 2014-08-22T17:40:21Z

QA tests have finished for PR 1360 at commit 206a448.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2014-08-22T18:23:10Z

Will blacklisting executors with appropriate value not solve this ? (If
executor is still usable, but some tasks fail on the specific executor for
whatever reason).

The more general problem of detecting straddlers and/or "misbehaving"
executors is something we have yet to do in spark.
On 22-Aug-2014 11:10 pm, "Apache Spark QA" [email protected] wrote:

QA tests have finished
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19085/consoleFull
for PR 1360 at commit 206a448
206a448
.

This patch passes unit tests.

This patch merges cleanly.

This patch adds no public classes.

—
Reply to this email directly or view it on GitHub
#1360 (comment).

markhamstra · 2014-08-23T19:46:24Z

I'm not sure I'm following, @mridulm. The problem is not one of removing Executors, but rather of removing Applications that could and should still be left running even though some (but not all) Executors assigned to an Application are dying.

mridulm · 2014-08-23T19:52:14Z

@markhamstra In our cluster, this usually happens due to one or more executor being in a bad state : either due to insufficient disk for finishing a task or it is in process of cleaning up and exit'ing.
When task fails, usually due to locality, the same task gets re-assigned to the executor where it just failed usually due to locality match. And this repeated re-schedule on failing executor, fail loop causes application to fail since it hits the maximum number of failed tasks for application, or maximum number of task failures for a specific task (iirc there are two params).

We alleviate this by setting blacklist timeout to a non trivially appropriate value : this prevents the rapid reschedule of a failing task on the executor (and usually some other executor picks up the task - the timeout is chosen so that this is possible).
If the executor is healthy but cant execute this specific task, then blacklist works fine.
If executor is unhealthy and going to exit, then we will still have rapid task failures until executor notifies master when it exits - but the failure count per task is not hit (iirc the number of failed tasks for app is much higher than number of failed attempts per tasks).

Ofcourse, not sure if this is completely applicable in this case.

markhamstra · 2014-08-23T20:15:26Z

@mridulm Is this blacklisting behavior a customization that you have made to Spark? If not, could you point me to where and how it is implemented?

What you are describing seems to be orthogonal and probably complementary to this PR: Yours, a means to prevent rescheduling of a task on an Executor where it cannot run successfully vs. this one, a means to prevent the killing of a running Application when some Executors die but others are still running the Application successfully. Sounds to me like we want both of those means.

mridulm · 2014-08-23T20:25:25Z

Take a look at 'spark.scheduler.executorTaskBlacklistTime' in TaskSetManager.
Since I run mostly in yarn-cluster mode, and there is only single application there; I was not sure how relevant black-listing was in your case actually ! (multiple apps via standalone I guess ?)

Note that we actually need a third case, which is not yet handled, slow executors/stragglers - particularly for low latency stages, they really kill execution times for some of our ML jobs (a 50x speedup becomes much much lower due to these).

andrewor14 · 2014-09-08T16:49:17Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

@@ -295,28 +295,34 @@ private[spark] class Master(
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        case Some(exec) => {
+          val appInfo = idToApp(appId)


This will crash if idToApp doesn't already contain appId. Maybe we should log a warning instead of throwing a key not found exception.

Ah, I realize you moved this from elsewhere. It would still be good to add a safeguard

That's not necessary. If idToApp does not contain appId, then execOption will be None and we'll never end up in this case Some(exec).

Ah you're right

and prevent Master from removing Application with RUNNING Executors

andrewor14 · 2014-09-08T18:31:39Z

test this please

SparkQA · 2014-09-08T18:48:37Z

QA tests have started for PR 1360 at commit f099c0b.

This patch merges cleanly.

SparkQA · 2014-09-08T19:48:06Z

QA tests have finished for PR 1360 at commit f099c0b.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2014-09-09T00:37:25Z

retest this please

SparkQA · 2014-09-09T00:49:29Z

QA tests have started for PR 1360 at commit f099c0b.

This patch merges cleanly.

SparkQA · 2014-09-09T01:55:08Z

QA tests have finished for PR 1360 at commit f099c0b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2014-09-09T03:54:23Z

Thanks, I merged this.

…behaving Executors Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors. Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors. This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running. Author: Mark Hamstra <[email protected]> Closes #1360 from markhamstra/SPARK-2425 and squashes the following commits: f099c0b [Mark Hamstra] Reuse appInfo b2b7b25 [Mark Hamstra] Moved 'Application failed' logging bdd0928 [Mark Hamstra] switched to string interpolation 1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors

…behaving Executors Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors. Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors. This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running. Author: Mark Hamstra <[email protected]> Closes apache#1360 from markhamstra/SPARK-2425 and squashes the following commits: f099c0b [Mark Hamstra] Reuse appInfo b2b7b25 [Mark Hamstra] Moved 'Application failed' logging bdd0928 [Mark Hamstra] switched to string interpolation 1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors Conflicts: core/src/main/scala/org/apache/spark/deploy/master/Master.scala core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <[email protected]>

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <[email protected]>

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (apache#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes apache#14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit ace458f)

markhamstra force-pushed the SPARK-2425 branch from 8d9f973 to 206a448 Compare August 22, 2014 16:42

andrewor14 reviewed Sep 8, 2014
View reviewed changes

SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition

1dd591b

and prevent Master from removing Application with RUNNING Executors

markhamstra added 3 commits September 8, 2014 11:07

switched to string interpolation

bdd0928

Moved 'Application failed' logging

b2b7b25

Reuse appInfo

f099c0b

markhamstra force-pushed the SPARK-2425 branch from 206a448 to f099c0b Compare September 8, 2014 18:08

asfgit closed this in 092e2f1 Sep 9, 2014

JoshRosen mentioned this pull request Aug 8, 2016

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

Closed

SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors #1360

SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors #1360

Uh oh!

Conversation

markhamstra commented Jul 10, 2014

Uh oh!

markhamstra commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

AmplabJenkins commented Jul 10, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 11, 2014

Uh oh!

SparkQA commented Jul 13, 2014

Uh oh!

SparkQA commented Jul 13, 2014

Uh oh!

markhamstra commented Jul 27, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

SparkQA commented Aug 1, 2014

Uh oh!

markhamstra commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 12, 2014

Uh oh!

SparkQA commented Aug 22, 2014

Uh oh!

SparkQA commented Aug 22, 2014

Uh oh!

mridulm commented Aug 22, 2014

Uh oh!

markhamstra commented Aug 23, 2014

Uh oh!

mridulm commented Aug 23, 2014

Uh oh!

markhamstra commented Aug 23, 2014

Uh oh!

mridulm commented Aug 23, 2014

Uh oh!

andrewor14 Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

markhamstra Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

andrewor14 commented Sep 9, 2014

Uh oh!

SparkQA commented Sep 9, 2014

Uh oh!

SparkQA commented Sep 9, 2014

Uh oh!

andrewor14 commented Sep 9, 2014

Uh oh!