[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

markhamstra · 2014-12-02T07:41:07Z

The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.

@JoshRosen

…river

SparkQA · 2014-12-02T07:44:59Z

Test build #24033 has started for PR 3550 at commit 8f543b1.

This patch merges cleanly.

JoshRosen · 2014-12-02T07:53:26Z

I considered something like this, but I think that this re-introduces cases where a single bad host can cause the entire application to fail. Imagine that I have a cluster where all but one of the hosts are functioning correctly; I'll register executors on the good hosts once at the beginning of the app and can then experience an infinite number of executor launch failures on the buggy host since we don't have a blacklist. So, we might have a case where the application is able to make progress with the executors that it has but is killed due to failed attempts to acquire more executors, since all of the resets/decrements to the "progress towards failure" counter only occurred at the beginning of the app, while the increments occurred continuously.

markhamstra · 2014-12-02T08:02:07Z

The application won't be killed if an executor has been recognized by master as RUNNING (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L328). The buggy host will just keep trying and failing to launch executors.

Detecting and blacklisting buggy hosts seems like a separable and complex issue. It would also be a new feature that maybe we don't want to add to 1.2 at the last minute.

SparkQA · 2014-12-02T09:04:32Z

Test build #24033 has finished for PR 3550 at commit 8f543b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-02T09:04:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24033/
Test PASSed.

andrewor14 · 2014-12-02T16:58:12Z

Yeah, this seems safe to me. Even if the Master doesn't know that the driver has exited for some reason (i.e. if the finishApplication was somehow not triggered from the DisassociatedEvent here), this will still fail the application correctly because all existing executors will have exited and new executors will fail immediately because they can't connect to the driver. I agree that this is a better fix for 1.2. Eventually it would be good to get Josh's changes in too because it's easier to test things there. LGTM

markhamstra · 2014-12-02T17:28:11Z

It's worth spending a little time checking that any executors that are RUNNING for an application definitely will transition to a Finished state and be removed from the master's accounting if the application dies. If we are certain that all the running executors will finish after application death and that repeatedly failing executors from a bad node while a running executor remains on master's books will not progressively consume resources, then I think this PR solves the problems. The only sort-of negative that I am seeing is that there can be an arbitrarily large number of failed executor launch attempts while at least one executor remains running, which will at least fill up error logs; but that is arguably not an all bad thing and is something whose proper resolution can be better handled (at least for now) by a system administrator than by an attempt to automate resolution.

JoshRosen · 2014-12-02T21:23:57Z

One idea for testing this: comment out the line in the DisassociationEvent handler that removes the application then check that a killed application is eventually removed via this mechanism.

JoshRosen · 2014-12-03T23:07:40Z

I tested this locally by commenting out the DisassociationEvent application cleanup logic and can confirm that my exited driver's application eventually experienced enough failures to cause the application to be removed.

This fix looks good to me (and it's been tested externally, too), so I'm going to merge this commit into master and branch-1.2. Thanks!

JoshRosen · 2014-12-03T23:09:00Z

Oh, and I'll also cherry pick to branch-1.1.

…ver adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <[email protected]> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver

Don't transition ExecutorInfo to RUNNING until Executor is added by D…

8f543b1

…river

markhamstra mentioned this pull request Dec 2, 2014

[SPARK-4498][SPARK-2424] [WIP] Add driver -> master heartbeat to detect exited applications and fix executor failure detection logic #3548

Closed

asfgit closed this in 96b2785 Dec 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

andrewor14 commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 3, 2014

Uh oh!

JoshRosen commented Dec 3, 2014

Uh oh!

Uh oh!

[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550

Uh oh!

Conversation

markhamstra commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

AmplabJenkins commented Dec 2, 2014

Uh oh!

andrewor14 commented Dec 2, 2014

Uh oh!

markhamstra commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 2, 2014

Uh oh!

JoshRosen commented Dec 3, 2014

Uh oh!

JoshRosen commented Dec 3, 2014

Uh oh!

Uh oh!