-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-10004] [shuffle] Perform auth checks when clients read shuffle data. #8218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… data. To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is.
Test build #1618 has finished for PR 8218 at commit
|
My understanding is that the network package is not a public library, so I added a mima exclude for the whole package to get the build going. |
Test build #40971 has finished for PR 8218 at commit
|
retest this please |
Test build #40976 has finished for PR 8218 at commit
|
client1 = clientFactory.createClient(TestUtils.getLocalHost(), | ||
blockServer.getPort()); | ||
|
||
final AtomicBoolean result = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about renaming this to gotException
? I was momentarily confused thinking this was the result of a successful request.
I'm not an expert on this part of the code, but it looks sane. I just left a few minor comments |
@@ -19,17 +19,24 @@ | |||
|
|||
import java.io.IOException; | |||
import java.util.Arrays; | |||
import java.util.concurrent.atomic.AtomicBoolean; | |||
import java.util.concurrent.atomic.AtomicLong; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unused, delete
Test build #41049 has finished for PR 8218 at commit
|
Test build #41057 has finished for PR 8218 at commit
|
Test build #41156 has finished for PR 8218 at commit
|
Test build #41329 has finished for PR 8218 at commit
|
Conflicts: network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
Test build #41386 timed out for PR 8218 at commit |
retest this please |
Test build #41460 timed out for PR 8218 at commit |
retest this please |
Test build #41481 has finished for PR 8218 at commit
|
Conflicts: core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala
@pwendell here's an example of more timeouts; last timed out build took 154m just for Java / Scala tests; the failed build above took 124m for the same tests. |
Test build #41540 has finished for PR 8218 at commit
|
Ping? |
ping @aarondav |
@@ -70,6 +70,7 @@ | |||
|
|||
private final Channel channel; | |||
private final TransportResponseHandler handler; | |||
private String clientId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use an Optional here, or annotate it as @nullable?
Looks alright to me - would be good if @aarondav takes a look too. |
Test build #41777 has finished for PR 8218 at commit
|
retest this please |
@aarondav do you have any comments to add? Otherwise I really want to merge this soon. |
Test build #41835 has finished for PR 8218 at commit
|
I triggered the tests again. I think for this one, since it is so early in the release cycle for 1.6, we can also optimistically merge it for now and do post-hoc reviews, provided that @vanzin is not going to disappear :) |
Test build #1707 has finished for PR 8218 at commit
|
These pyspark tests fail on every other PR. I'll merge this since I haven't seen any more feedback. |
@@ -109,15 +111,34 @@ public void connectionTerminated(Channel channel) { | |||
} | |||
} | |||
|
|||
@Override | |||
public void checkAuthorization(TransportClient client, long streamId) { | |||
if (client.getClientId() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this: since getClientId
returns null if the client did not enable spark.authenticate
, does that mean any application that did not enable SASL can read my shuffle files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be true, but I don't believe you can actually set things up like that. Authentication is either enabled on the server or its not, for all clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, do you mean if the server enabled authentication, then any client that did not also enable it will fail the handshake in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
@vanzin sorry I couldn't review this in time. While looking through the code I was wondering about two things: (1) See my inline comment (2) Looks like the app ID is relatively easy to spoof? Aren't they listed in the standalone Master / RM UI? Should we use something that's more like a secret, like what we do in |
Hi @andrewor14, This patch is not very useful outside YARN. On Standalone, all apps run as the same user, and authenticate using the same user and secret. So there's no way to prevent one app from reading another's shuffle files (either through the shuffle service or reading them directly from disk). On YARN, each app authenticates itself using the app's ID as the user name, and a secure, per-app secret (see |
I see, so it seems then that there are two kinds of authentication, one during the handshake where we use some secure secret (i.e. the one used passed from For the latter, could we use the same shuffle secret instead of the app ID? That would require us to pass the secret to the executor JVM securely, which could be difficult. This patch is already a strict improvement as is, so I'm just wondering whether we could strengthen the security guarantees further. Maybe it's not worth it. |
No, there's one kind of authentication. I don't know what's this "handshake" you talk about. Whenever you open a connection to read shuffle blocks and The secret is already distributed securely on YARN; it's stashed in the credentials held by the |
What? I'm talking about the shuffle secret the executor container needs when it initially registers with the shuffle service:
|
Yes, and if you look at that code, that secret is the return value of
If the secret doesn't yet exist (i.e. before the app is submitted), then a new one is created and stashed in the user's credentials. If it already exists (e.g. for the AM and all executors), then it's used. Authentication works fine just as it was originally designed. This patch is not about authentication. It's about authorization. |
Preconditions.checkArgument(state != null, "Unknown stream ID."); | ||
if (!client.getClientId().equals(state.appId)) { | ||
throw new SecurityException(String.format( | ||
"Client %s not authorized to read stream %d (app %s).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not disclose the actual appId in the exception message ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
App IDs are not secret.
… data. To correctly isolate applications, when requests to read shuffle data arrive at the shuffle service, proper authorization checks need to be performed. This change makes sure that only the application that created the shuffle data can read from it. Such checks are only enabled when "spark.authenticate" is enabled, otherwise there's no secure way to make sure that the client is really who it says it is. Author: Marcelo Vanzin <[email protected]> Closes apache#8218 from vanzin/SPARK-10004. (cherry picked from commit 2da3a9e) Conflicts: core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala
To correctly isolate applications, when requests to read shuffle data
arrive at the shuffle service, proper authorization checks need to
be performed. This change makes sure that only the application that
created the shuffle data can read from it.
Such checks are only enabled when "spark.authenticate" is enabled,
otherwise there's no secure way to make sure that the client is really
who it says it is.