Skip to content

[SPARK-4277] Support external shuffle service on Standalone Worker #3142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented Nov 6, 2014

No description provided.

require(appId == myAppId, s"SASL appId $appId did not match my appId ${myAppId}")
getSecretKey()
}
// Default SecurityManager only has a single secret key, so ignore appId.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that these checks were not useful (since the secret key is already doing the security checks). Additionally, it doesn't make sense on the Worker.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

@pwendell Perhaps you could review this?

@aarondav aarondav changed the title [SPARK-4277] Support external shuffle service on executor [SPARK-4277] Support external shuffle service on worker Nov 6, 2014
@aarondav aarondav changed the title [SPARK-4277] Support external shuffle service on worker [SPARK-4277] Support external shuffle service on Standalone Worker Nov 6, 2014
@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23009 has started for PR 3142 at commit 258417c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23012 has started for PR 3142 at commit 47f49d3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23009 has finished for PR 3142 at commit 258417c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23009/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23014 has started for PR 3142 at commit 2dcdfc1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23012 has finished for PR 3142 at commit 47f49d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23012/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23014 has finished for PR 3142 at commit 2dcdfc1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23014/
Test FAILed.

@andrewor14
Copy link
Contributor

retest this please

class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging {

private val enabled = sparkConf.getBoolean("spark.shuffle.service.enabled", false)
private val port = sparkConf.getInt("spark.shuffle.service.port", 7337)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some node-level environment variable in addition to this? The point is that if the Worker is started on a different port then it would be good if the executors can just pick it up. Otherwise the executors have no way of figuring out which other port the service is started on unless they go check out the worker logs or something. In Yarn for instance we have shared Yarn config used by both the executors and the NM. Though this is probably fine for first-cut implementation. Any thoughts @pwendell

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23022 has started for PR 3142 at commit 2dcdfc1.

  • This patch merges cleanly.

* Optionally requires SASL authentication in order to read. See [[SecurityManager]].
*/
private[worker]
class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be opposed to this being called StandaloneShuffleService now that you renamed the other one ExternalShuffleService. This name would match YarnShuffleService more. The existing name is also fine.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

Addressed or responded to all comments, save the environment variable one.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23027 has started for PR 3142 at commit 3780bd7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23022 has finished for PR 3142 at commit 2dcdfc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23022/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23027 has finished for PR 3142 at commit 3780bd7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StandaloneWorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23027/
Test PASSed.

@andrewor14
Copy link
Contributor

Ok, LGTM. Now the dynamic allocation feature can be used on standalone mode too. Super cool. I merge.

asfgit pushed a commit that referenced this pull request Nov 7, 2014
Author: Aaron Davidson <[email protected]>

Closes #3142 from aarondav/worker and squashes the following commits:

3780bd7 [Aaron Davidson] Address comments
2dcdfc1 [Aaron Davidson] Add private[worker]
47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors)
258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor

(cherry picked from commit 6e9ef10)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 6e9ef10 Nov 7, 2014
@pwendell
Copy link
Contributor

pwendell commented Nov 7, 2014

Hey Aaron - this LGTM and seems pretty straightforward. Just wondering, what is the behavior if it can't bind to that port? I'm thinking about the case where multiple workers are running on each machine. Will it crash the worker or just e.g. send a log message? If the latter, I think this would still work correctly... it would be a bit hackish, but one worker would grab the port first and then the others would not start a shuffle service.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 7, 2014

The current behavior would crash the second worker. We could disable this behavior and revert to logging if SPARK_WORKER_INSTANCES is set and > 1, but I would be hesitant to just always catch errors from starting the server, as it would just continuously produce failing executors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants