Skip to content

[SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. #3579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Dec 3, 2014

cc @aarondav @kayousterhout @pwendell

This should go into 1.2?

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24087 has started for PR 3579 at commit 2afaf35.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24087 has finished for PR 3579 at commit 2afaf35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24087/
Test PASSed.


class ShuffleFaultToleranceSuite extends FunSuite {

test("[SPARK-4085] hash shuffle manager recovers when local shuffle files get deleted") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for clarification to me, this issue is not specific to hash-shuffles -- you just chose this because it's the clearest to delete files from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - this was a test case written by @kayousterhout.

@aarondav
Copy link
Contributor

aarondav commented Dec 3, 2014

It is plausible, though less likely, that the other un-Try'd sections of code such as wrapping with a compressed input stream or deserializer could fail as well. What would happen when this occurs? Does it hang Spark or fail the job?

@rxin
Copy link
Contributor Author

rxin commented Dec 3, 2014

Spark fails the job there, which makes sense I think.

@aarondav
Copy link
Contributor

aarondav commented Dec 3, 2014

Yeah, cool, just wanted to make sure we didn't enter into some unrecoverable state.

@aarondav
Copy link
Contributor

aarondav commented Dec 3, 2014

LGTM


test("[SPARK-4085] hash shuffle manager recovers when local shuffle files get deleted") {
val conf = new SparkConf(false)
conf.set("spark.shuffle.manager", "hash")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more follow up here re: Aaron's question: if this is an issue for sort too, make another test that tests the sort shuffle manager?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24107/
Test FAILed.

@rxin
Copy link
Contributor Author

rxin commented Dec 3, 2014

I rewrote the test case to cover both sort and hash shuffle.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24114 has started for PR 3579 at commit 255b4fd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24114 has finished for PR 3579 at commit 255b4fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24114/
Test PASSed.

@pwendell
Copy link
Contributor

pwendell commented Dec 4, 2014

Thanks @rxin and @aarondav, I'm going to pull this in for the RC.

asfgit pushed a commit that referenced this pull request Dec 4, 2014
…local shuffle file.

cc aarondav kayousterhout pwendell

This should go into 1.2?

Author: Reynold Xin <[email protected]>

Closes #3579 from rxin/SPARK-4085 and squashes the following commits:

255b4fd [Reynold Xin] Updated test.
f9814d9 [Reynold Xin] Code review feedback.
2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file.

(cherry picked from commit 1826372)
Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in 1826372 Dec 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants