Skip to content

[SPARK-4006] In long running contexts, we encountered the situation of d... #2915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tsliwowicz
Copy link
Contributor

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.1

…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache/spark#2854 except it's on master

Author: Tal Sliwowicz <[email protected]>

Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2915 at commit d122236.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

Tests timed out after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22087/
Test FAILed.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have started for PR 2915 at commit d122236.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have finished for PR 2915 at commit d122236.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22149/
Test PASSed.

@andrewor14
Copy link
Contributor

I'm merging this thanks.

asfgit pushed a commit that referenced this pull request Oct 24, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.1

Author: Tal Sliwowicz <[email protected]>

Closes #2915 from tsliwowicz/branch-1.1-block-mgr-removal and squashes the following commits:

d122236 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
@JoshRosen
Copy link
Contributor

Hi @tsliwowicz,

It looks like this pull request was merged into branch-1.1, but it won't automatically be closed by GitHub since that commit won't land in the master branch. Therefore, do you mind closing this PR? I'd do it myself, but I don't have the permissions. Thanks!

@asfgit asfgit closed this in 06dc1b1 Dec 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants