[MINOR][DOCS] Add note about Spark network security

srowen · HyukjinKwon · commit c9914cf0490d · 2018-08-02T10:22:52.000+08:00
## What changes were proposed in this pull request? In response to a recent question, this reiterates that network access to a Spark cluster should be disabled by default, and that access to its hosts and services from outside a private network should be added back explicitly. Also, some minor touch-ups while I was at it. ## How was this patch tested? N/A Author: Sean Owen <srowen@gmail.com> Closes #21947 from srowen/SecurityNote.
diff --git a/docs/security.md b/docs/security.md
@@ -278,7 +278,7 @@ To enable authorization in the SHS, a few extra options are used:
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td>spark.history.ui.acls.enable</td>
+  <td><code>spark.history.ui.acls.enable</code></td>
   <td>false</td>
   <td>
     Specifies whether ACLs should be checked to authorize users viewing the applications in
@@ -292,15 +292,15 @@ To enable authorization in the SHS, a few extra options are used:
   </td>
 </tr>
 <tr>
-  <td>spark.history.ui.admin.acls</td>
+  <td><code>spark.history.ui.admin.acls</code></td>
   <td>None</td>
   <td>
     Comma separated list of users that have view access to all the Spark applications in history
     server.
   </td>
 </tr>
 <tr>
-  <td>spark.history.ui.admin.acls.groups</td>
+  <td><code>spark.history.ui.admin.acls.groups</code></td>
   <td>None</td>
   <td>
     Comma separated list of groups that have view access to all the Spark applications in history
@@ -501,6 +501,7 @@ can be accomplished by setting `spark.ssl.useNodeLocalConf` to `true`. In that c
 provided by the user on the client side are not used.
 
 ### Mesos mode
+
 Mesos 1.3.0 and newer supports `Secrets` primitives as both file-based and environment based
 secrets. Spark allows the specification of file-based and environment variable based secrets with
 `spark.mesos.driver.secret.filenames` and `spark.mesos.driver.secret.envkeys`, respectively.
@@ -562,8 +563,12 @@ Security.
 
 # Configuring Ports for Network Security
 
-Spark makes heavy use of the network, and some environments have strict requirements for using tight
-firewall settings.  Below are the primary ports that Spark uses for its communication and how to
+Generally speaking, a Spark cluster and its services are not deployed on the public internet.
+They are generally private services, and should only be accessible within the network of the
+organization that deploys Spark. Access to the hosts and ports used by Spark services should
+be limited to origin hosts that need to access the services.
+
+Below are the primary ports that Spark uses for its communication and how to
 configure those ports.
 
 ## Standalone mode only
@@ -597,6 +602,14 @@ configure those ports.
     <td><code>SPARK_MASTER_PORT</code></td>
     <td>Set to "0" to choose a port randomly. Standalone mode only.</td>
   </tr>
+  <tr>
+    <td>External Service</td>
+    <td>Standalone Master</td>
+    <td>6066</td>
+    <td>Submit job to cluster via REST API</td>
+    <td><code>spark.master.rest.port</code></td>
+    <td>Use <code>spark.master.rest.enabled</code> to enable/disable this service. Standalone mode only.</td>
+  </tr>
   <tr>
     <td>Standalone Master</td>
     <td>Standalone Worker</td>
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
@@ -362,8 +362,15 @@ You can run Spark alongside your existing Hadoop cluster by just launching it as
 
 # Configuring Ports for Network Security
 
-Spark makes heavy use of the network, and some environments have strict requirements for using
-tight firewall settings. For a complete list of ports to configure, see the
+Generally speaking, a Spark cluster and its services are not deployed on the public internet.
+They are generally private services, and should only be accessible within the network of the
+organization that deploys Spark. Access to the hosts and ports used by Spark services should
+be limited to origin hosts that need to access the services.
+
+This is particularly important for clusters using the standalone resource manager, as they do
+not support fine-grained access control in a way that other resource managers do.
+
+For a complete list of ports to configure, see the
 [security page](security.html#configuring-ports-for-network-security).
 
 # High Availability
@@ -376,7 +383,7 @@ By default, standalone scheduling clusters are resilient to Worker failures (ins
 
 Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected "leader" and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master's state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling _new_ applications -- applications that were already running during Master failover are unaffected.
 
-Learn more about getting started with ZooKeeper [here](http://zookeeper.apache.org/doc/current/zookeeperStarted.html).
+Learn more about getting started with ZooKeeper [here](https://zookeeper.apache.org/doc/current/zookeeperStarted.html).
 
 **Configuration**
 
@@ -419,6 +426,6 @@ In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spa
 
 **Details**
 
-* This solution can be used in tandem with a process monitor/manager like [monit](http://mmonit.com/monit/), or just to enable manual recovery via restart.
+* This solution can be used in tandem with a process monitor/manager like [monit](https://mmonit.com/monit/), or just to enable manual recovery via restart.
 * While filesystem recovery seems straightforwardly better than not doing any recovery at all, this mode may be suboptimal for certain development or experimental purposes. In particular, killing a master via stop-master.sh does not clean up its recovery state, so whenever you start a new Master, it will enter recovery mode. This could increase the startup time by up to 1 minute if it needs to wait for all previously-registered Workers/clients to timeout.
 * While it's not officially supported, you could mount an NFS directory as the recovery directory. If the original Master node dies completely, you could then start a Master on a different node, which would correctly recover all previously registered Workers/applications (equivalent to ZooKeeper recovery). Future applications will have to be able to find the new Master, however, in order to register.