[SPARK-10004] [shuffle] Perform auth checks when clients read shuffle data. #8218

vanzin · 2015-08-15T00:15:42Z

To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.

… data. To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is.

SparkQA · 2015-08-15T02:14:29Z

Test build #1618 has finished for PR 8218 at commit 17eb187.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-08-15T19:04:25Z

My understanding is that the network package is not a public library, so I added a mima exclude for the whole package to get the build going.

SparkQA · 2015-08-15T21:56:48Z

Test build #40971 has finished for PR 8218 at commit c68deab.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-08-15T22:33:00Z

retest this please

SparkQA · 2015-08-16T01:24:18Z

Test build #40976 has finished for PR 8218 at commit c68deab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2015-08-17T02:46:30Z

network/shuffle/src/test/java/org/apache/spark/network/sasl/SaslIntegrationSuite.java

+      client1 = clientFactory.createClient(TestUtils.getLocalHost(),
+        blockServer.getPort());
+
+      final AtomicBoolean result = new AtomicBoolean(false);


how about renaming this to gotException? I was momentarily confused thinking this was the result of a successful request.

squito · 2015-08-17T02:54:05Z

I'm not an expert on this part of the code, but it looks sane. I just left a few minor comments

squito · 2015-08-17T21:32:21Z

network/shuffle/src/test/java/org/apache/spark/network/sasl/SaslIntegrationSuite.java

@@ -19,17 +19,24 @@

 import java.io.IOException;
 import java.util.Arrays;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicLong;


nit: unused, delete

SparkQA · 2015-08-17T22:30:27Z

Test build #41049 has finished for PR 8218 at commit 292a299.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-17T23:53:16Z

Test build #41057 has finished for PR 8218 at commit fadff27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-18T22:05:22Z

Test build #41156 has finished for PR 8218 at commit 3cc9321.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-08-19T02:20:15Z

Let's try pinging a couple of people. @rxin @aarondav

SparkQA · 2015-08-20T20:59:36Z

Test build #41329 has finished for PR 8218 at commit 4d19ed5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java

SparkQA · 2015-08-22T00:10:36Z

Test build #41386 timed out for PR 8218 at commit 8153497 after a configured wait of 175m.

vanzin · 2015-08-24T17:07:37Z

retest this please

SparkQA · 2015-08-24T20:13:55Z

Test build #41460 timed out for PR 8218 at commit 8153497 after a configured wait of 175m.

vanzin · 2015-08-24T22:13:21Z

retest this please

SparkQA · 2015-08-25T00:42:25Z

Test build #41481 has finished for PR 8218 at commit 8153497.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

vanzin · 2015-08-25T18:33:21Z

@pwendell here's an example of more timeouts; last timed out build took 154m just for Java / Scala tests; the failed build above took 124m for the same tests.

SparkQA · 2015-08-25T20:33:23Z

Test build #41540 has finished for PR 8218 at commit d25c6cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-08-27T00:15:36Z

Ping?

pwendell · 2015-08-28T05:05:10Z

ping @aarondav

rxin · 2015-08-28T05:59:48Z

network/common/src/main/java/org/apache/spark/network/client/TransportClient.java

@@ -70,6 +70,7 @@

  private final Channel channel;
  private final TransportResponseHandler handler;
+  private String clientId;


Can we use an Optional here, or annotate it as @nullable?

rxin · 2015-08-28T06:03:12Z

Looks alright to me - would be good if @aarondav takes a look too.

SparkQA · 2015-08-29T06:03:07Z

Test build #41777 has finished for PR 8218 at commit b491ac7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-08-31T17:23:48Z

retest this please

vanzin · 2015-08-31T17:24:05Z

@aarondav do you have any comments to add? Otherwise I really want to merge this soon.

SparkQA · 2015-08-31T20:27:25Z

Test build #41835 has finished for PR 8218 at commit b491ac7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-08-31T22:47:17Z

I triggered the tests again. I think for this one, since it is so early in the release cycle for 1.6, we can also optimistically merge it for now and do post-hoc reviews, provided that @vanzin is not going to disappear :)

SparkQA · 2015-09-01T01:21:34Z

Test build #1707 has finished for PR 8218 at commit b491ac7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-09-02T19:52:45Z

These pyspark tests fail on every other PR. I'll merge this since I haven't seen any more feedback.

andrewor14 · 2015-09-02T20:27:31Z

network/common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java

@@ -109,15 +111,34 @@ public void connectionTerminated(Channel channel) {
    }
  }

+  @Override
+  public void checkAuthorization(TransportClient client, long streamId) {
+    if (client.getClientId() != null) {


I'm confused about this: since getClientId returns null if the client did not enable spark.authenticate, does that mean any application that did not enable SASL can read my shuffle files?

That would be true, but I don't believe you can actually set things up like that. Authentication is either enabled on the server or its not, for all clients.

I see, do you mean if the server enabled authentication, then any client that did not also enable it will fail the handshake in the first place?

andrewor14 · 2015-09-02T20:34:57Z

@vanzin sorry I couldn't review this in time. While looking through the code I was wondering about two things:

(1) See my inline comment

(2) Looks like the app ID is relatively easy to spoof? Aren't they listed in the standalone Master / RM UI? Should we use something that's more like a secret, like what we do in YarnShuffleService? The only problem there is that in standalone mode it's relatively difficult to securely pass secrets from drivers to executors.

vanzin · 2015-09-02T20:41:29Z

Hi @andrewor14,

This patch is not very useful outside YARN. On Standalone, all apps run as the same user, and authenticate using the same user and secret. So there's no way to prevent one app from reading another's shuffle files (either through the shuffle service or reading them directly from disk).

On YARN, each app authenticates itself using the app's ID as the user name, and a secure, per-app secret (see SecurityManager::generateSecretKey). Authentication is not based on the app simply saying who it is - the app needs to know that secret. After the SASL auth occurs, then we just do simple matching of what app the shuffle file belongs to and what app the connection was authenticated as. You can't spoof it.

andrewor14 · 2015-09-02T21:41:03Z

I see, so it seems then that there are two kinds of authentication, one during the handshake where we use some secure secret (i.e. the one used passed from ExecutorRunnable to YarnShuffleService), and another for reading each shuffle block where we use the app ID.

For the latter, could we use the same shuffle secret instead of the app ID? That would require us to pass the secret to the executor JVM securely, which could be difficult. This patch is already a strict improvement as is, so I'm just wondering whether we could strengthen the security guarantees further. Maybe it's not worth it.

vanzin · 2015-09-02T21:55:30Z

it seems then that there are two kinds of authentication

No, there's one kind of authentication. I don't know what's this "handshake" you talk about. Whenever you open a connection to read shuffle blocks and spark.authenticate is enabled (no matter whether you are connecting to a block manager directly, or to an external shuffle service), you will perform SASL authentication and provide the user name (= app id) and secret for your application.

The secret is already distributed securely on YARN; it's stashed in the credentials held by the UserGroupInformation object used to start the application.

andrewor14 · 2015-09-02T22:48:46Z

No, there's one kind of authentication. I don't know what's this "handshake" you talk about.

What? I'm talking about the shuffle secret the executor container needs when it initially registers with the shuffle service:

spark/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala

Line 116 in 6cd98c1

ctx.setServiceData(Collections.singletonMap("spark_shuffle", secretBytes))

. The whole motivation for adding that in the first place is for app-level authentication during the registration.

vanzin · 2015-09-02T22:55:56Z

Yes, and if you look at that code, that secret is the return value of SecurityManager.getSecretKey(), which on YARN is stored in the UserGroupInformation object. Here's the code to make it clear:

  val secretKey = SparkHadoopUtil.get.getSecretKeyFromUserCredentials(sparkSecretLookupKey)
  if (secretKey != null) {
    logDebug("in yarn mode, getting secret from credentials")
    return new Text(secretKey).toString
  } else {
    logDebug("getSecretKey: yarn mode, secret key from credentials is null")
  }
  val cookie = akka.util.Crypt.generateSecureCookie
  // if we generated the secret then we must be the first so lets set it so t
  // gets used by everyone else
  SparkHadoopUtil.get.addSecretKeyToUserCredentials(sparkSecretLookupKey, cookie)

If the secret doesn't yet exist (i.e. before the app is submitted), then a new one is created and stashed in the user's credentials. If it already exists (e.g. for the AM and all executors), then it's used. Authentication works fine just as it was originally designed. This patch is not about authentication. It's about authorization.

tedyu · 2015-09-02T22:57:14Z

network/common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java

+      Preconditions.checkArgument(state != null, "Unknown stream ID.");
+      if (!client.getClientId().equals(state.appId)) {
+        throw new SecurityException(String.format(
+          "Client %s not authorized to read stream %d (app %s).",


Should we not disclose the actual appId in the exception message ?

App IDs are not secret.

… data. To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is. Author: Marcelo Vanzin <[email protected]> Closes apache#8218 from vanzin/SPARK-10004. (cherry picked from commit 2da3a9e) Conflicts: core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

Unblock jenkins build (mima check failed).

c68deab

squito reviewed Aug 17, 2015
View reviewed changes

Feedback + fix a test that was failing to fail.

292a299

squito reviewed Aug 17, 2015
View reviewed changes

Clean imports.

fadff27

Merge branch 'master' into SPARK-10004

3cc9321

Better error messages when tests fail.

4d19ed5

Merge branch 'master' into SPARK-10004

8153497

Conflicts: network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java

Merge branch 'master' into SPARK-10004

d25c6cf

Conflicts: core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

rxin reviewed Aug 28, 2015
View reviewed changes

Make clientId @nullable.

b491ac7

asfgit closed this in 2da3a9e Sep 2, 2015

andrewor14 reviewed Sep 2, 2015
View reviewed changes

tedyu reviewed Sep 2, 2015
View reviewed changes

vanzin deleted the SPARK-10004 branch September 9, 2015 23:07

[SPARK-10004] [shuffle] Perform auth checks when clients read shuffle data. #8218

[SPARK-10004] [shuffle] Perform auth checks when clients read shuffle data. #8218

Uh oh!

Conversation

vanzin commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

vanzin commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 15, 2015

Uh oh!

vanzin commented Aug 15, 2015

Uh oh!

SparkQA commented Aug 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Aug 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 17, 2015

Uh oh!

SparkQA commented Aug 18, 2015

Uh oh!

vanzin commented Aug 19, 2015

Uh oh!

SparkQA commented Aug 20, 2015

Uh oh!

SparkQA commented Aug 22, 2015

Uh oh!

vanzin commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

vanzin commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

vanzin commented Aug 25, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

vanzin commented Aug 27, 2015

Uh oh!

pwendell commented Aug 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Aug 28, 2015

Uh oh!

SparkQA commented Aug 29, 2015

Uh oh!

vanzin commented Aug 31, 2015

Uh oh!

vanzin commented Aug 31, 2015

Uh oh!

SparkQA commented Aug 31, 2015

Uh oh!

rxin commented Aug 31, 2015

Uh oh!

SparkQA commented Sep 1, 2015

Uh oh!

vanzin commented Sep 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 2, 2015

Uh oh!

vanzin commented Sep 2, 2015

Uh oh!