Skip to content

SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors #1360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

markhamstra
Copy link
Contributor

Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.

Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.

This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not. I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that. This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.

@markhamstra
Copy link
Contributor Author

@aarondav

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/consoleFull

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16508/

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 1360:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16571/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 13, 2014

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull

@markhamstra
Copy link
Contributor Author

ping

Probably too late for a 1.0.2-rc, but this should go into 1.0.3 and 1.1.0.

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17686/consoleFull

@markhamstra
Copy link
Contributor Author

@pwendell Still should go into 1.1.0... The change is fairly small, and the unpatched behavior is pretty nasty for long-running applications.

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1360. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1360:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18382/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have started for PR 1360 at commit 206a448.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have finished for PR 1360 at commit 206a448.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Aug 22, 2014

Will blacklisting executors with appropriate value not solve this ? (If
executor is still usable, but some tasks fail on the specific executor for
whatever reason).

The more general problem of detecting straddlers and/or "misbehaving"
executors is something we have yet to do in spark.
On 22-Aug-2014 11:10 pm, "Apache Spark QA" [email protected] wrote:

QA tests have finished
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19085/consoleFull
for PR 1360 at commit 206a448
206a448
.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


Reply to this email directly or view it on GitHub
#1360 (comment).

@markhamstra
Copy link
Contributor Author

I'm not sure I'm following, @mridulm. The problem is not one of removing Executors, but rather of removing Applications that could and should still be left running even though some (but not all) Executors assigned to an Application are dying.

@mridulm
Copy link
Contributor

mridulm commented Aug 23, 2014

@markhamstra In our cluster, this usually happens due to one or more executor being in a bad state : either due to insufficient disk for finishing a task or it is in process of cleaning up and exit'ing.
When task fails, usually due to locality, the same task gets re-assigned to the executor where it just failed usually due to locality match. And this repeated re-schedule on failing executor, fail loop causes application to fail since it hits the maximum number of failed tasks for application, or maximum number of task failures for a specific task (iirc there are two params).

We alleviate this by setting blacklist timeout to a non trivially appropriate value : this prevents the rapid reschedule of a failing task on the executor (and usually some other executor picks up the task - the timeout is chosen so that this is possible).
If the executor is healthy but cant execute this specific task, then blacklist works fine.
If executor is unhealthy and going to exit, then we will still have rapid task failures until executor notifies master when it exits - but the failure count per task is not hit (iirc the number of failed tasks for app is much higher than number of failed attempts per tasks).

Ofcourse, not sure if this is completely applicable in this case.

@markhamstra
Copy link
Contributor Author

@mridulm Is this blacklisting behavior a customization that you have made to Spark? If not, could you point me to where and how it is implemented?

What you are describing seems to be orthogonal and probably complementary to this PR: Yours, a means to prevent rescheduling of a task on an Executor where it cannot run successfully vs. this one, a means to prevent the killing of a running Application when some Executors die but others are still running the Application successfully. Sounds to me like we want both of those means.

@mridulm
Copy link
Contributor

mridulm commented Aug 23, 2014

Take a look at 'spark.scheduler.executorTaskBlacklistTime' in TaskSetManager.
Since I run mostly in yarn-cluster mode, and there is only single application there; I was not sure how relevant black-listing was in your case actually ! (multiple apps via standalone I guess ?)

Note that we actually need a third case, which is not yet handled, slow executors/stragglers - particularly for low latency stages, they really kill execution times for some of our ML jobs (a 50x speedup becomes much much lower due to these).

@@ -295,28 +295,34 @@ private[spark] class Master(
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) => {
val appInfo = idToApp(appId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if idToApp doesn't already contain appId. Maybe we should log a warning instead of throwing a key not found exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I realize you moved this from elsewhere. It would still be good to add a safeguard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not necessary. If idToApp does not contain appId, then execOption will be None and we'll never end up in this case Some(exec).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right

and prevent Master from removing Application with RUNNING Executors
@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have started for PR 1360 at commit f099c0b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have finished for PR 1360 at commit f099c0b.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 1360 at commit f099c0b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have finished for PR 1360 at commit f099c0b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Thanks, I merged this.

@asfgit asfgit closed this in 092e2f1 Sep 9, 2014
asfgit pushed a commit that referenced this pull request Sep 9, 2014
…behaving Executors

Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.

Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.

This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not.  I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that.  This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.

Author: Mark Hamstra <[email protected]>

Closes #1360 from markhamstra/SPARK-2425 and squashes the following commits:

f099c0b [Mark Hamstra] Reuse appInfo
b2b7b25 [Mark Hamstra] Moved 'Application failed' logging
bdd0928 [Mark Hamstra] switched to string interpolation
1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors
markhamstra added a commit to markhamstra/spark that referenced this pull request Sep 10, 2014
…behaving Executors

Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.

Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.

This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not.  I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that.  This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.

Author: Mark Hamstra <[email protected]>

Closes apache#1360 from markhamstra/SPARK-2425 and squashes the following commits:

f099c0b [Mark Hamstra] Reuse appInfo
b2b7b25 [Mark Hamstra] Moved 'Application failed' logging
bdd0928 [Mark Hamstra] switched to string interpolation
1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
asfgit pushed a commit that referenced this pull request Aug 9, 2016
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <[email protected]>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.

(cherry picked from commit b89b3a5)
Signed-off-by: Josh Rosen <[email protected]>
asfgit pushed a commit that referenced this pull request Aug 9, 2016
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <[email protected]>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
asfgit pushed a commit that referenced this pull request Aug 9, 2016
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <[email protected]>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.

(cherry picked from commit b89b3a5)
Signed-off-by: Josh Rosen <[email protected]>
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Aug 10, 2016
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (apache#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <[email protected]>

Closes apache#14544 from JoshRosen/add-setting-for-max-executor-failures.

(cherry picked from commit b89b3a5)
Signed-off-by: Josh Rosen <[email protected]>
(cherry picked from commit ace458f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants