Skip to content

Commit 44115e9

Browse files
committed
[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable
## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <[email protected]>
1 parent 41d9dca commit 44115e9

File tree

3 files changed

+21
-3
lines changed

3 files changed

+21
-3
lines changed

core/src/main/scala/org/apache/spark/deploy/master/ApplicationState.scala

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,4 @@ private[master] object ApplicationState extends Enumeration {
2222
type ApplicationState = Value
2323

2424
val WAITING, RUNNING, FINISHED, FAILED, KILLED, UNKNOWN = Value
25-
26-
val MAX_NUM_RETRY = 10
2725
}

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ private[deploy] class Master(
5858
private val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200)
5959
private val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
6060
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
61+
private val MAX_EXECUTOR_RETRIES = conf.getInt("spark.deploy.maxExecutorRetries", 10)
6162

6263
val workers = new HashSet[WorkerInfo]
6364
val idToApp = new HashMap[String, ApplicationInfo]
@@ -265,7 +266,11 @@ private[deploy] class Master(
265266

266267
val normalExit = exitStatus == Some(0)
267268
// Only retry certain number of times so we don't go into an infinite loop.
268-
if (!normalExit && appInfo.incrementRetryCount() >= ApplicationState.MAX_NUM_RETRY) {
269+
// Important note: this code path is not exercised by tests, so be very careful when
270+
// changing this `if` condition.
271+
if (!normalExit
272+
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
273+
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
269274
val execs = appInfo.executors.values
270275
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
271276
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +

docs/spark-standalone.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,21 @@ SPARK_MASTER_OPTS supports the following system properties:
195195
the whole cluster by default. <br/>
196196
</td>
197197
</tr>
198+
<tr>
199+
<td><code>spark.deploy.maxExecutorRetries</code></td>
200+
<td>10</td>
201+
<td>
202+
Limit on the maximum number of back-to-back executor failures that can occur before the
203+
standalone cluster manager removes a faulty application. An application will never be removed
204+
if it has any running executors. If an application experiences more than
205+
<code>spark.deploy.maxExecutorRetries</code> failures in a row, no executors
206+
successfully start running in between those failures, and the application has no running
207+
executors then the standalone cluster manager will remove the application and mark it as failed.
208+
To disable this automatic removal, set <code>spark.deploy.maxExecutorRetries</code> to
209+
<code>-1</code>.
210+
<br/>
211+
</td>
212+
</tr>
198213
<tr>
199214
<td><code>spark.worker.timeout</code></td>
200215
<td>60</td>

0 commit comments

Comments
 (0)