Skip to content

[SPARK-9591][CORE]Job may fail for exception during getting remote block #7927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

jeanlyn
Copy link
Contributor

@jeanlyn jeanlyn commented Aug 4, 2015

SPARK-9591
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.

loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
} catch {
case e: Throwable =>
logWarning(s"Exception during getting remote block $blockId from $loc", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's valid to catch any throwable here and continue. This can be more targeted, and pushed down into fetchBlockSync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @srowen, otherwise it might hide other evils....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen and @CodingCat for comments!

  • If i am understanding correctly,the doGetRemote here will return None when fetched all the same block is null, and all the method which called the doGetRemote will handle the None case and throw exception when necessary,so i think it's safe here to catch the exception
  • fetchBlockSync just call fetchBlocks to fetch the block,so i think it's the same we catch exception here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that: since you are expecting a IOException (or whatever it is) when one of the remotes goes down, than you should only catch that exception. If we get some other weird random exception, we should probably still throw it, since it might be some bigger problem.

Also I don't think simply ignoring the exception is right. If you only get an exception from one location, but then another location is fine, sure, just forget the exception. But what if you get an exception from all locations? Then you should still throw an exception. You could do something like what is done in askWithRetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense,I will fix it later,thanks @squito a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squito So agree to do like askWithRetry. If we can get one block from any remote store successfully, it successes. We should not break the working path whenever meet the first exception.

So maybe, we need to catch all kinds of Exceptions (not IOException only). If some attempts failed, we need to log out the exception information but continue the fetching work. When we run to the final location and it still throws out certain exception, we need to throw out a NEW exception to tell that all attempts failed (i.e., no available location there). and meanwhile, maybe to add the last exception information into this NEW exception.

But if we only focus IOException, when we meet some types of exceptions for certain locations, it still breaks the entire workflow (to fetch data from the rest locations if possible).

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agreed. sorry seeing this so late (commented a little more down below)

@squito
Copy link
Contributor

squito commented Aug 5, 2015

thanks for updating @jeanlyn . Sorry that I didn't fully understand the issue earlier and for potentially changing the desired outcome on you.

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Aug 7, 2015

Thanks everyone for the review! I updated the code, and now doGetRemote will accept exception when we still have locations to fetch the block avoiding the work flow being broken, and the BlockFetchException(this is the behavior of doGetRemote, so i use a new exception and wrap the last exception,any suggestions?) will throw when there is no location.

package org.apache.spark.storage

private[spark]
case class BlockFetchException(throwable: Throwable) extends Exception(throwable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should extend SparkException

@andrewor14
Copy link
Contributor

ok to test

blockTransferService.fetchBlockSync(
loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
} catch {
case t: Throwable if attemptTimes < locations.size - 1 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use scala.util.controlNonFatal(e) instead. We don't want to catch OOM's here.

@andrewor14
Copy link
Contributor

@jeanlyn can you rename the title of this patch and the issue to remove references to "broadcast"? I believe this is applicable to all blocks in general.

@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41899 has finished for PR 7927 at commit 75db334.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BlockFetchException(messages: String, throwable: Throwable)

@jeanlyn jeanlyn changed the title [SPARK-9591][CORE]Job may fail for exception during getting broadcast variable [SPARK-9591][CORE]Job may fail for exception during getting remote block Sep 2, 2015
@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41907 has finished for PR 7927 at commit 6c6d53d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BlockFetchException(messages: String, throwable: Throwable)

@jeanlyn
Copy link
Contributor Author

jeanlyn commented Sep 2, 2015

It seems that the failure not related.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 3, 2015

Test build #41960 has finished for PR 7927 at commit 6c6d53d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BlockFetchException(messages: String, throwable: Throwable)

@andrewor14
Copy link
Contributor

LGTM merging into master, thanks everyone.

@asfgit asfgit closed this in db4c130 Sep 3, 2015
@jeanlyn jeanlyn deleted the catch_exception branch September 5, 2015 10:07
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Oct 14, 2015
…block

[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591)
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.

Author: jeanlyn <[email protected]>

Closes apache#7927 from jeanlyn/catch_exception.
@sprite311
Copy link

i have this problem in spark1.3.0, is there any other solutions? i can't update spark to 1.6

@GraceH
Copy link
Contributor

GraceH commented Aug 22, 2016

@Sprite331. According to my understanding, this patch tries to catch certain exceptions when the user introducing dynamic allocation. One quick solution is to disable dynamic allocation if possible, which can avoid certain exception (negative part is to miss that new function introduced since 1.3). Another one, you can try to catch that exception by yourself (if you upgrade your 1.3 deployments). I am not sure if either solutions work to you or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants