-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor #3550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #24033 has started for PR 3550 at commit
|
I considered something like this, but I think that this re-introduces cases where a single bad host can cause the entire application to fail. Imagine that I have a cluster where all but one of the hosts are functioning correctly; I'll register executors on the good hosts once at the beginning of the app and can then experience an infinite number of executor launch failures on the buggy host since we don't have a blacklist. So, we might have a case where the application is able to make progress with the executors that it has but is killed due to failed attempts to acquire more executors, since all of the resets/decrements to the "progress towards failure" counter only occurred at the beginning of the app, while the increments occurred continuously. |
The application won't be killed if an executor has been recognized by master as RUNNING (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L328). The buggy host will just keep trying and failing to launch executors. Detecting and blacklisting buggy hosts seems like a separable and complex issue. It would also be a new feature that maybe we don't want to add to 1.2 at the last minute. |
Test build #24033 has finished for PR 3550 at commit
|
Test PASSed. |
Yeah, this seems safe to me. Even if the Master doesn't know that the driver has exited for some reason (i.e. if the |
It's worth spending a little time checking that any executors that are RUNNING for an application definitely will transition to a Finished state and be removed from the master's accounting if the application dies. If we are certain that all the running executors will finish after application death and that repeatedly failing executors from a bad node while a running executor remains on master's books will not progressively consume resources, then I think this PR solves the problems. The only sort-of negative that I am seeing is that there can be an arbitrarily large number of failed executor launch attempts while at least one executor remains running, which will at least fill up error logs; but that is arguably not an all bad thing and is something whose proper resolution can be better handled (at least for now) by a system administrator than by an attempt to automate resolution. |
One idea for testing this: comment out the line in the |
I tested this locally by commenting out the This fix looks good to me (and it's been tested externally, too), so I'm going to merge this commit into |
Oh, and I'll also cherry pick to |
…ver adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <[email protected]> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
…ver adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <[email protected]> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.
@JoshRosen