-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4277] Support external shuffle service on Standalone Worker #3142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
require(appId == myAppId, s"SASL appId $appId did not match my appId ${myAppId}") | ||
getSecretKey() | ||
} | ||
// Default SecurityManager only has a single secret key, so ignore appId. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that these checks were not useful (since the secret key is already doing the security checks). Additionally, it doesn't make sense on the Worker.
@pwendell Perhaps you could review this? |
Test build #23009 has started for PR 3142 at commit
|
Test build #23012 has started for PR 3142 at commit
|
Test build #23009 has finished for PR 3142 at commit
|
Test FAILed. |
Test build #23014 has started for PR 3142 at commit
|
Test build #23012 has finished for PR 3142 at commit
|
Test PASSed. |
Test build #23014 has finished for PR 3142 at commit
|
Test FAILed. |
retest this please |
class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging { | ||
|
||
private val enabled = sparkConf.getBoolean("spark.shuffle.service.enabled", false) | ||
private val port = sparkConf.getInt("spark.shuffle.service.port", 7337) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some node-level environment variable in addition to this? The point is that if the Worker is started on a different port then it would be good if the executors can just pick it up. Otherwise the executors have no way of figuring out which other port the service is started on unless they go check out the worker logs or something. In Yarn for instance we have shared Yarn config used by both the executors and the NM. Though this is probably fine for first-cut implementation. Any thoughts @pwendell
Test build #23022 has started for PR 3142 at commit
|
* Optionally requires SASL authentication in order to read. See [[SecurityManager]]. | ||
*/ | ||
private[worker] | ||
class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be opposed to this being called StandaloneShuffleService
now that you renamed the other one ExternalShuffleService
. This name would match YarnShuffleService
more. The existing name is also fine.
Addressed or responded to all comments, save the environment variable one. |
Test build #23027 has started for PR 3142 at commit
|
Test build #23022 has finished for PR 3142 at commit
|
Test PASSed. |
Test build #23027 has finished for PR 3142 at commit
|
Test PASSed. |
Ok, LGTM. Now the dynamic allocation feature can be used on standalone mode too. Super cool. I merge. |
Author: Aaron Davidson <[email protected]> Closes #3142 from aarondav/worker and squashes the following commits: 3780bd7 [Aaron Davidson] Address comments 2dcdfc1 [Aaron Davidson] Add private[worker] 47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors) 258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor (cherry picked from commit 6e9ef10) Signed-off-by: Andrew Or <[email protected]>
Hey Aaron - this LGTM and seems pretty straightforward. Just wondering, what is the behavior if it can't bind to that port? I'm thinking about the case where multiple workers are running on each machine. Will it crash the worker or just e.g. send a log message? If the latter, I think this would still work correctly... it would be a bit hackish, but one worker would grab the port first and then the others would not start a shuffle service. |
The current behavior would crash the second worker. We could disable this behavior and revert to logging if SPARK_WORKER_INSTANCES is set and > 1, but I would be hesitant to just always catch errors from starting the server, as it would just continuously produce failing executors. |
No description provided.