Skip to content

[SPARK-10004] [shuffle] Perform auth checks when clients read shuffle data. #8218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Aug 15, 2015

To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.

… data.

To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.
@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #1618 has finished for PR 8218 at commit 17eb187.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 15, 2015

My understanding is that the network package is not a public library, so I added a mima exclude for the whole package to get the build going.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40971 has finished for PR 8218 at commit c68deab.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 15, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2015

Test build #40976 has finished for PR 8218 at commit c68deab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

client1 = clientFactory.createClient(TestUtils.getLocalHost(),
blockServer.getPort());

final AtomicBoolean result = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about renaming this to gotException? I was momentarily confused thinking this was the result of a successful request.

@squito
Copy link
Contributor

squito commented Aug 17, 2015

I'm not an expert on this part of the code, but it looks sane. I just left a few minor comments

@@ -19,17 +19,24 @@

import java.io.IOException;
import java.util.Arrays;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicLong;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused, delete

@SparkQA
Copy link

SparkQA commented Aug 17, 2015

Test build #41049 has finished for PR 8218 at commit 292a299.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2015

Test build #41057 has finished for PR 8218 at commit fadff27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41156 has finished for PR 8218 at commit 3cc9321.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 19, 2015

Let's try pinging a couple of people. @rxin @aarondav

@SparkQA
Copy link

SparkQA commented Aug 20, 2015

Test build #41329 has finished for PR 8218 at commit 4d19ed5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
@SparkQA
Copy link

SparkQA commented Aug 22, 2015

Test build #41386 timed out for PR 8218 at commit 8153497 after a configured wait of 175m.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 24, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2015

Test build #41460 timed out for PR 8218 at commit 8153497 after a configured wait of 175m.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 24, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41481 has finished for PR 8218 at commit 8153497.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala
@vanzin
Copy link
Contributor Author

vanzin commented Aug 25, 2015

@pwendell here's an example of more timeouts; last timed out build took 154m just for Java / Scala tests; the failed build above took 124m for the same tests.

@SparkQA
Copy link

SparkQA commented Aug 25, 2015

Test build #41540 has finished for PR 8218 at commit d25c6cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 27, 2015

Ping?

@pwendell
Copy link
Contributor

ping @aarondav

@@ -70,6 +70,7 @@

private final Channel channel;
private final TransportResponseHandler handler;
private String clientId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use an Optional here, or annotate it as @nullable?

@rxin
Copy link
Contributor

rxin commented Aug 28, 2015

Looks alright to me - would be good if @aarondav takes a look too.

@SparkQA
Copy link

SparkQA commented Aug 29, 2015

Test build #41777 has finished for PR 8218 at commit b491ac7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 31, 2015

retest this please

@vanzin
Copy link
Contributor Author

vanzin commented Aug 31, 2015

@aarondav do you have any comments to add? Otherwise I really want to merge this soon.

@SparkQA
Copy link

SparkQA commented Aug 31, 2015

Test build #41835 has finished for PR 8218 at commit b491ac7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 31, 2015

I triggered the tests again. I think for this one, since it is so early in the release cycle for 1.6, we can also optimistically merge it for now and do post-hoc reviews, provided that @vanzin is not going to disappear :)

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #1707 has finished for PR 8218 at commit b491ac7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 2, 2015

These pyspark tests fail on every other PR. I'll merge this since I haven't seen any more feedback.

@asfgit asfgit closed this in 2da3a9e Sep 2, 2015
@@ -109,15 +111,34 @@ public void connectionTerminated(Channel channel) {
}
}

@Override
public void checkAuthorization(TransportClient client, long streamId) {
if (client.getClientId() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this: since getClientId returns null if the client did not enable spark.authenticate, does that mean any application that did not enable SASL can read my shuffle files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be true, but I don't believe you can actually set things up like that. Authentication is either enabled on the server or its not, for all clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, do you mean if the server enabled authentication, then any client that did not also enable it will fail the handshake in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@andrewor14
Copy link
Contributor

@vanzin sorry I couldn't review this in time. While looking through the code I was wondering about two things:

(1) See my inline comment

(2) Looks like the app ID is relatively easy to spoof? Aren't they listed in the standalone Master / RM UI? Should we use something that's more like a secret, like what we do in YarnShuffleService? The only problem there is that in standalone mode it's relatively difficult to securely pass secrets from drivers to executors.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 2, 2015

Hi @andrewor14,

This patch is not very useful outside YARN. On Standalone, all apps run as the same user, and authenticate using the same user and secret. So there's no way to prevent one app from reading another's shuffle files (either through the shuffle service or reading them directly from disk).

On YARN, each app authenticates itself using the app's ID as the user name, and a secure, per-app secret (see SecurityManager::generateSecretKey). Authentication is not based on the app simply saying who it is - the app needs to know that secret. After the SASL auth occurs, then we just do simple matching of what app the shuffle file belongs to and what app the connection was authenticated as. You can't spoof it.

@andrewor14
Copy link
Contributor

I see, so it seems then that there are two kinds of authentication, one during the handshake where we use some secure secret (i.e. the one used passed from ExecutorRunnable to YarnShuffleService), and another for reading each shuffle block where we use the app ID.

For the latter, could we use the same shuffle secret instead of the app ID? That would require us to pass the secret to the executor JVM securely, which could be difficult. This patch is already a strict improvement as is, so I'm just wondering whether we could strengthen the security guarantees further. Maybe it's not worth it.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 2, 2015

it seems then that there are two kinds of authentication

No, there's one kind of authentication. I don't know what's this "handshake" you talk about. Whenever you open a connection to read shuffle blocks and spark.authenticate is enabled (no matter whether you are connecting to a block manager directly, or to an external shuffle service), you will perform SASL authentication and provide the user name (= app id) and secret for your application.

The secret is already distributed securely on YARN; it's stashed in the credentials held by the UserGroupInformation object used to start the application.

@andrewor14
Copy link
Contributor

No, there's one kind of authentication. I don't know what's this "handshake" you talk about.

What? I'm talking about the shuffle secret the executor container needs when it initially registers with the shuffle service:

ctx.setServiceData(Collections.singletonMap("spark_shuffle", secretBytes))
. The whole motivation for adding that in the first place is for app-level authentication during the registration.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 2, 2015

Yes, and if you look at that code, that secret is the return value of SecurityManager.getSecretKey(), which on YARN is stored in the UserGroupInformation object. Here's the code to make it clear:

  val secretKey = SparkHadoopUtil.get.getSecretKeyFromUserCredentials(sparkSecretLookupKey)
  if (secretKey != null) {
    logDebug("in yarn mode, getting secret from credentials")
    return new Text(secretKey).toString
  } else {
    logDebug("getSecretKey: yarn mode, secret key from credentials is null")
  }
  val cookie = akka.util.Crypt.generateSecureCookie
  // if we generated the secret then we must be the first so lets set it so t
  // gets used by everyone else
  SparkHadoopUtil.get.addSecretKeyToUserCredentials(sparkSecretLookupKey, cookie)

If the secret doesn't yet exist (i.e. before the app is submitted), then a new one is created and stashed in the user's credentials. If it already exists (e.g. for the AM and all executors), then it's used. Authentication works fine just as it was originally designed. This patch is not about authentication. It's about authorization.

Preconditions.checkArgument(state != null, "Unknown stream ID.");
if (!client.getClientId().equals(state.appId)) {
throw new SecurityException(String.format(
"Client %s not authorized to read stream %d (app %s).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not disclose the actual appId in the exception message ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App IDs are not secret.

@vanzin vanzin deleted the SPARK-10004 branch September 9, 2015 23:07
ashangit pushed a commit to ashangit/spark that referenced this pull request Oct 19, 2016
… data.

To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.

Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.

Author: Marcelo Vanzin <[email protected]>

Closes apache#8218 from vanzin/SPARK-10004.

(cherry picked from commit 2da3a9e)

Conflicts:
	core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants