Skip to content

[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

markhamstra
Copy link
Contributor

The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.

@JoshRosen

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24033 has started for PR 3550 at commit 8f543b1.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

I considered something like this, but I think that this re-introduces cases where a single bad host can cause the entire application to fail. Imagine that I have a cluster where all but one of the hosts are functioning correctly; I'll register executors on the good hosts once at the beginning of the app and can then experience an infinite number of executor launch failures on the buggy host since we don't have a blacklist. So, we might have a case where the application is able to make progress with the executors that it has but is killed due to failed attempts to acquire more executors, since all of the resets/decrements to the "progress towards failure" counter only occurred at the beginning of the app, while the increments occurred continuously.

@markhamstra
Copy link
Contributor Author

The application won't be killed if an executor has been recognized by master as RUNNING (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L328). The buggy host will just keep trying and failing to launch executors.

Detecting and blacklisting buggy hosts seems like a separable and complex issue. It would also be a new feature that maybe we don't want to add to 1.2 at the last minute.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24033 has finished for PR 3550 at commit 8f543b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24033/
Test PASSed.

@andrewor14
Copy link
Contributor

Yeah, this seems safe to me. Even if the Master doesn't know that the driver has exited for some reason (i.e. if the finishApplication was somehow not triggered from the DisassociatedEvent here), this will still fail the application correctly because all existing executors will have exited and new executors will fail immediately because they can't connect to the driver. I agree that this is a better fix for 1.2. Eventually it would be good to get Josh's changes in too because it's easier to test things there. LGTM

@markhamstra
Copy link
Contributor Author

It's worth spending a little time checking that any executors that are RUNNING for an application definitely will transition to a Finished state and be removed from the master's accounting if the application dies. If we are certain that all the running executors will finish after application death and that repeatedly failing executors from a bad node while a running executor remains on master's books will not progressively consume resources, then I think this PR solves the problems. The only sort-of negative that I am seeing is that there can be an arbitrarily large number of failed executor launch attempts while at least one executor remains running, which will at least fill up error logs; but that is arguably not an all bad thing and is something whose proper resolution can be better handled (at least for now) by a system administrator than by an attempt to automate resolution.

@JoshRosen
Copy link
Contributor

One idea for testing this: comment out the line in the DisassociationEvent handler that removes the application then check that a killed application is eventually removed via this mechanism.

@JoshRosen
Copy link
Contributor

I tested this locally by commenting out the DisassociationEvent application cleanup logic and can confirm that my exited driver's application eventually experienced enough failures to cause the application to be removed.

This fix looks good to me (and it's been tested externally, too), so I'm going to merge this commit into master and branch-1.2. Thanks!

@JoshRosen
Copy link
Contributor

Oh, and I'll also cherry pick to branch-1.1.

@asfgit asfgit closed this in 96b2785 Dec 3, 2014
asfgit pushed a commit that referenced this pull request Dec 3, 2014
…ver adds Executor

The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master.  Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.

Author: Mark Hamstra <[email protected]>

Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits:

8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
asfgit pushed a commit that referenced this pull request Dec 3, 2014
…ver adds Executor

The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master.  Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.

Author: Mark Hamstra <[email protected]>

Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits:

8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants