Skip to content

Commit fb98488

Browse files
committed
Clean up and simplify Spark configuration
Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements: 1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file. 2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath. 3. Adds ability to set these same variables for the driver using `spark-submit`. 4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`. 5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node. Author: Patrick Wendell <[email protected]> Closes #299 from pwendell/config-cleanup and squashes the following commits: 127f301 [Patrick Wendell] Improvements to testing a006464 [Patrick Wendell] Moving properties file template. b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf 0086939 [Patrick Wendell] Minor style fixes af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide af0adf7 [Patrick Wendell] Automatically add user jar a56b125 [Patrick Wendell] Responses to Tom's review d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a762901 [Patrick Wendell] Fixing test failures ffa00fe [Patrick Wendell] Review feedback fda0301 [Patrick Wendell] Note 308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN e83cd8f [Patrick Wendell] Changes to allow re-use of test applications be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set c2a2909 [Patrick Wendell] Test compile fixes 4ee6f9d [Patrick Wendell] Making YARN doc changes consistent afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors. b08893b [Patrick Wendell] Additional improvements. ace4ead [Patrick Wendell] Responses to review feedback. b72d183 [Patrick Wendell] Review feedback for spark env file 46555c1 [Patrick Wendell] Review feedback and import clean-ups 437aed1 [Patrick Wendell] Small fix 761ebcd [Patrick Wendell] Library path and classpath for drivers 7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script 5b0ba8e [Patrick Wendell] Don't ship executor envs 84cc5e5 [Patrick Wendell] Small clean-up 1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings 4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH 6eaf7d0 [Patrick Wendell] executorJavaOpts 0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
1 parent 3a390bf commit fb98488

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+886
-401
lines changed

.rat-excludes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ RELEASE
1111
control
1212
docs
1313
fairscheduler.xml.template
14+
spark-defaults.conf.template
1415
log4j.properties
1516
log4j.properties.template
1617
metrics.properties.template

bin/run-example

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,6 @@ fi
7575

7676
# Set JAVA_OPTS to be able to load native libraries and to set heap size
7777
JAVA_OPTS="$SPARK_JAVA_OPTS"
78-
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
7978
# Load extra JAVA_OPTS from conf/java-opts, if it exists
8079
if [ -e "$FWDIR/conf/java-opts" ] ; then
8180
JAVA_OPTS="$JAVA_OPTS `cat $FWDIR/conf/java-opts`"

bin/spark-class

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ fi
9898

9999
# Set JAVA_OPTS to be able to load native libraries and to set heap size
100100
JAVA_OPTS="$OUR_JAVA_OPTS"
101-
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
101+
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$_SPARK_LIBRARY_PATH"
102102
JAVA_OPTS="$JAVA_OPTS -Xms$OUR_JAVA_MEM -Xmx$OUR_JAVA_MEM"
103103
# Load extra JAVA_OPTS from conf/java-opts, if it exists
104104
if [ -e "$FWDIR/conf/java-opts" ] ; then

bin/spark-submit

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,13 @@ while (($#)); do
2525
DEPLOY_MODE=$2
2626
elif [ $1 = "--driver-memory" ]; then
2727
DRIVER_MEMORY=$2
28+
elif [ $1 = "--driver-library-path" ]; then
29+
export _SPARK_LIBRARY_PATH=$2
30+
elif [ $1 = "--driver-class-path" ]; then
31+
export SPARK_CLASSPATH="$SPARK_CLASSPATH:$2"
32+
elif [ $1 = "--driver-java-options" ]; then
33+
export SPARK_JAVA_OPTS="$SPARK_JAVA_OPTS $2"
2834
fi
29-
3035
shift
3136
done
3237

conf/spark-defaults.conf.template

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Default system properties included when running spark-submit.
2+
# This is useful for setting default environmental settings.
3+
4+
# Example:
5+
# spark.master spark://master:7077
6+
# spark.eventLog.enabled true
7+
# spark.eventLog.dir hdfs://namenode:8021/directory

conf/spark-env.sh.template

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,41 @@
11
#!/usr/bin/env bash
22

3-
# This file contains environment variables required to run Spark. Copy it as
4-
# spark-env.sh and edit that to configure Spark for your site.
5-
#
6-
# The following variables can be set in this file:
3+
# This file is sourced when running various Spark programs.
4+
# Copy it as spark-env.sh and edit that to configure Spark for your site.
5+
6+
# Options read when launching programs locally with
7+
# ./bin/run-example or ./bin/spark-submit
8+
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
9+
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
10+
# - SPARK_CLASSPATH, default classpath entries to append
11+
12+
# Options read by executors and drivers running inside the cluster
713
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
14+
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
15+
# - SPARK_CLASSPATH, default classpath entries to append
16+
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
817
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos
9-
# - SPARK_JAVA_OPTS, to set node-specific JVM options for Spark. Note that
10-
# we recommend setting app-wide options in the application's driver program.
11-
# Examples of node-specific options : -Dspark.local.dir, GC options
12-
# Examples of app-wide options : -Dspark.serializer
13-
#
14-
# If using the standalone deploy mode, you can also set variables for it here:
18+
19+
# Options read in YARN client mode
20+
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
21+
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
22+
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
23+
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
24+
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
25+
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
26+
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
27+
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.
28+
29+
# Options for the daemons used in the standalone deploy mode:
1530
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
1631
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
32+
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
1733
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
18-
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
34+
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
1935
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
2036
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
2137
# - SPARK_WORKER_DIR, to set the working directory of worker processes
22-
# - SPARK_PUBLIC_DNS, to set the public dns name of the master
38+
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
39+
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
40+
# - SPARK_DAEMON_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
41+
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

core/src/main/scala/org/apache/spark/SparkConf.scala

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,82 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
208208
new SparkConf(false).setAll(settings)
209209
}
210210

211+
/** Checks for illegal or deprecated config settings. Throws an exception for the former. Not
212+
* idempotent - may mutate this conf object to convert deprecated settings to supported ones. */
213+
private[spark] def validateSettings() {
214+
if (settings.contains("spark.local.dir")) {
215+
val msg = "In Spark 1.0 and later spark.local.dir will be overridden by the value set by " +
216+
"the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN)."
217+
logWarning(msg)
218+
}
219+
220+
val executorOptsKey = "spark.executor.extraJavaOptions"
221+
val executorClasspathKey = "spark.executor.extraClassPath"
222+
val driverOptsKey = "spark.driver.extraJavaOptions"
223+
val driverClassPathKey = "spark.driver.extraClassPath"
224+
225+
// Validate spark.executor.extraJavaOptions
226+
settings.get(executorOptsKey).map { javaOpts =>
227+
if (javaOpts.contains("-Dspark")) {
228+
val msg = s"$executorOptsKey is not allowed to set Spark options (was '$javaOpts)'. " +
229+
"Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit."
230+
throw new Exception(msg)
231+
}
232+
if (javaOpts.contains("-Xmx") || javaOpts.contains("-Xms")) {
233+
val msg = s"$executorOptsKey is not allowed to alter memory settings (was '$javaOpts'). " +
234+
"Use spark.executor.memory instead."
235+
throw new Exception(msg)
236+
}
237+
}
238+
239+
// Check for legacy configs
240+
sys.env.get("SPARK_JAVA_OPTS").foreach { value =>
241+
val error =
242+
s"""
243+
|SPARK_JAVA_OPTS was detected (set to '$value').
244+
|This has undefined behavior when running on a cluster and is deprecated in Spark 1.0+.
245+
|
246+
|Please instead use:
247+
| - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
248+
| - ./spark-submit with --driver-java-options to set -X options for a driver
249+
| - spark.executor.extraJavaOptions to set -X options for executors
250+
| - SPARK_DAEMON_OPTS to set java options for standalone daemons (i.e. master, worker)
251+
""".stripMargin
252+
logError(error)
253+
254+
for (key <- Seq(executorOptsKey, driverOptsKey)) {
255+
if (getOption(key).isDefined) {
256+
throw new SparkException(s"Found both $key and SPARK_JAVA_OPTS. Use only the former.")
257+
} else {
258+
logWarning(s"Setting '$key' to '$value' as a work-around.")
259+
set(key, value)
260+
}
261+
}
262+
}
263+
264+
sys.env.get("SPARK_CLASSPATH").foreach { value =>
265+
val error =
266+
s"""
267+
|SPARK_CLASSPATH was detected (set to '$value').
268+
| This has undefined behavior when running on a cluster and is deprecated in Spark 1.0+.
269+
|
270+
|Please instead use:
271+
| - ./spark-submit with --driver-class-path to augment the driver classpath
272+
| - spark.executor.extraClassPath to augment the executor classpath
273+
""".stripMargin
274+
logError(error)
275+
276+
for (key <- Seq(executorClasspathKey, driverClassPathKey)) {
277+
if (getOption(key).isDefined) {
278+
throw new SparkException(s"Found both $key and SPARK_CLASSPATH. Use only the former.")
279+
} else {
280+
logWarning(s"Setting '$key' to '$value' as a work-around.")
281+
set(key, value)
282+
}
283+
}
284+
}
285+
}
286+
211287
/**
212288
* Return a string listing all keys and values, one per line. This is useful to print the
213289
* configuration out for debugging.

core/src/main/scala/org/apache/spark/SparkContext.scala

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ class SparkContext(config: SparkConf) extends Logging {
148148
this(master, appName, sparkHome, jars, Map(), Map())
149149

150150
private[spark] val conf = config.clone()
151+
conf.validateSettings()
151152

152153
/**
153154
* Return a copy of this SparkContext's configuration. The configuration ''cannot'' be
@@ -159,7 +160,7 @@ class SparkContext(config: SparkConf) extends Logging {
159160
throw new SparkException("A master URL must be set in your configuration")
160161
}
161162
if (!conf.contains("spark.app.name")) {
162-
throw new SparkException("An application must be set in your configuration")
163+
throw new SparkException("An application name must be set in your configuration")
163164
}
164165

165166
if (conf.getBoolean("spark.logConf", false)) {
@@ -170,11 +171,11 @@ class SparkContext(config: SparkConf) extends Logging {
170171
conf.setIfMissing("spark.driver.host", Utils.localHostName())
171172
conf.setIfMissing("spark.driver.port", "0")
172173

173-
val jars: Seq[String] = if (conf.contains("spark.jars")) {
174-
conf.get("spark.jars").split(",").filter(_.size != 0)
175-
} else {
176-
null
177-
}
174+
val jars: Seq[String] =
175+
conf.getOption("spark.jars").map(_.split(",")).map(_.filter(_.size != 0)).toSeq.flatten
176+
177+
val files: Seq[String] =
178+
conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.size != 0)).toSeq.flatten
178179

179180
val master = conf.get("spark.master")
180181
val appName = conf.get("spark.app.name")
@@ -235,6 +236,10 @@ class SparkContext(config: SparkConf) extends Logging {
235236
jars.foreach(addJar)
236237
}
237238

239+
if (files != null) {
240+
files.foreach(addFile)
241+
}
242+
238243
private def warnSparkMem(value: String): String = {
239244
logWarning("Using SPARK_MEM to set amount of memory to use per executor process is " +
240245
"deprecated, please use spark.executor.memory instead.")
@@ -247,30 +252,28 @@ class SparkContext(config: SparkConf) extends Logging {
247252
.map(Utils.memoryStringToMb)
248253
.getOrElse(512)
249254

250-
// Environment variables to pass to our executors
251-
private[spark] val executorEnvs = HashMap[String, String]()
252-
for (key <- Seq("SPARK_CLASSPATH", "SPARK_LIBRARY_PATH", "SPARK_JAVA_OPTS");
253-
value <- Option(System.getenv(key))) {
254-
executorEnvs(key) = value
255-
}
255+
// Environment variables to pass to our executors.
256+
// NOTE: This should only be used for test related settings.
257+
private[spark] val testExecutorEnvs = HashMap[String, String]()
258+
256259
// Convert java options to env vars as a work around
257260
// since we can't set env vars directly in sbt.
258-
for { (envKey, propKey) <- Seq(("SPARK_HOME", "spark.home"), ("SPARK_TESTING", "spark.testing"))
261+
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
259262
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
260-
executorEnvs(envKey) = value
263+
testExecutorEnvs(envKey) = value
261264
}
262265
// The Mesos scheduler backend relies on this environment variable to set executor memory.
263266
// TODO: Set this only in the Mesos scheduler.
264-
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
265-
executorEnvs ++= conf.getExecutorEnv
267+
testExecutorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
268+
testExecutorEnvs ++= conf.getExecutorEnv
266269

267270
// Set SPARK_USER for user who is running SparkContext.
268271
val sparkUser = Option {
269272
Option(System.getProperty("user.name")).getOrElse(System.getenv("SPARK_USER"))
270273
}.getOrElse {
271274
SparkContext.SPARK_UNKNOWN_USER
272275
}
273-
executorEnvs("SPARK_USER") = sparkUser
276+
testExecutorEnvs("SPARK_USER") = sparkUser
274277

275278
// Create and start the scheduler
276279
private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)

core/src/main/scala/org/apache/spark/deploy/Client.scala

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,21 @@ private class ClientActor(driverArgs: ClientArguments, conf: SparkConf) extends
5454
System.getenv().foreach{case (k, v) => env(k) = v}
5555

5656
val mainClass = "org.apache.spark.deploy.worker.DriverWrapper"
57+
58+
val classPathConf = "spark.driver.extraClassPath"
59+
val classPathEntries = sys.props.get(classPathConf).toSeq.flatMap { cp =>
60+
cp.split(java.io.File.pathSeparator)
61+
}
62+
63+
val libraryPathConf = "spark.driver.extraLibraryPath"
64+
val libraryPathEntries = sys.props.get(libraryPathConf).toSeq.flatMap { cp =>
65+
cp.split(java.io.File.pathSeparator)
66+
}
67+
68+
val javaOptionsConf = "spark.driver.extraJavaOptions"
69+
val javaOpts = sys.props.get(javaOptionsConf)
5770
val command = new Command(mainClass, Seq("{{WORKER_URL}}", driverArgs.mainClass) ++
58-
driverArgs.driverOptions, env)
71+
driverArgs.driverOptions, env, classPathEntries, libraryPathEntries, javaOpts)
5972

6073
val driverDescription = new DriverDescription(
6174
driverArgs.jarUrl,

core/src/main/scala/org/apache/spark/deploy/Command.scala

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,8 @@ import scala.collection.Map
2222
private[spark] case class Command(
2323
mainClass: String,
2424
arguments: Seq[String],
25-
environment: Map[String, String]) {
25+
environment: Map[String, String],
26+
classPathEntries: Seq[String],
27+
libraryPathEntries: Seq[String],
28+
extraJavaOptions: Option[String] = None) {
2629
}

0 commit comments

Comments
 (0)