[SPARK-9591][CORE]Job may fail for exception during getting remote block #7927

jeanlyn · 2015-08-04T09:35:21Z

SPARK-9591
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.

srowen · 2015-08-04T09:52:37Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+          loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
+      } catch {
+        case e: Throwable =>
+          logWarning(s"Exception during getting remote block $blockId from $loc", e)


I'm not sure it's valid to catch any throwable here and continue. This can be more targeted, and pushed down into fetchBlockSync?

agree with @srowen, otherwise it might hide other evils....

Thanks @srowen and @CodingCat for comments!

If i am understanding correctly,the doGetRemote here will return None when fetched all the same block is null, and all the method which called the doGetRemote will handle the None case and throw exception when necessary,so i think it's safe here to catch the exception

fetchBlockSync just call fetchBlocks to fetch the block,so i think it's the same we catch exception here.

I think the point is that: since you are expecting a IOException (or whatever it is) when one of the remotes goes down, than you should only catch that exception. If we get some other weird random exception, we should probably still throw it, since it might be some bigger problem.

Also I don't think simply ignoring the exception is right. If you only get an exception from one location, but then another location is fine, sure, just forget the exception. But what if you get an exception from all locations? Then you should still throw an exception. You could do something like what is done in askWithRetry.

Make sense,I will fix it later,thanks @squito a lot!

@squito So agree to do like askWithRetry. If we can get one block from any remote store successfully, it successes. We should not break the working path whenever meet the first exception.

So maybe, we need to catch all kinds of Exceptions (not IOException only). If some attempts failed, we need to log out the exception information but continue the fetching work. When we run to the final location and it still throws out certain exception, we need to throw out a NEW exception to tell that all attempts failed (i.e., no available location there). and meanwhile, maybe to add the last exception information into this NEW exception.

But if we only focus IOException, when we meet some types of exceptions for certain locations, it still breaks the entire workflow (to fetch data from the rest locations if possible).

What do you think?

yes, agreed. sorry seeing this so late (commented a little more down below)

squito · 2015-08-05T13:05:17Z

thanks for updating @jeanlyn . Sorry that I didn't fully understand the issue earlier and for potentially changing the desired outcome on you.

jeanlyn · 2015-08-07T03:44:56Z

Thanks everyone for the review! I updated the code, and now doGetRemote will accept exception when we still have locations to fetch the block avoiding the work flow being broken, and the BlockFetchException(this is the behavior of doGetRemote, so i use a new exception and wrap the last exception,any suggestions?) will throw when there is no location.

squito · 2015-08-07T20:13:46Z

core/src/main/scala/org/apache/spark/storage/BlockFetchException.scala

+package org.apache.spark.storage
+
+private[spark]
+case class BlockFetchException(throwable: Throwable) extends Exception(throwable)


this should extend SparkException

andrewor14 · 2015-09-01T22:39:21Z

ok to test

andrewor14 · 2015-09-01T22:40:43Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+        blockTransferService.fetchBlockSync(
+          loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer()
+      } catch {
+        case t: Throwable if attemptTimes < locations.size - 1 =>


please use scala.util.controlNonFatal(e) instead. We don't want to catch OOM's here.

andrewor14 · 2015-09-01T23:00:18Z

@jeanlyn can you rename the title of this patch and the issue to remove references to "broadcast"? I believe this is applicable to all blocks in general.

SparkQA · 2015-09-02T01:35:30Z

Test build #41899 has finished for PR 7927 at commit 75db334.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BlockFetchException(messages: String, throwable: Throwable)

SparkQA · 2015-09-02T05:46:26Z

Test build #41907 has finished for PR 7927 at commit 6c6d53d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BlockFetchException(messages: String, throwable: Throwable)

jeanlyn · 2015-09-02T11:56:39Z

It seems that the failure not related.

andrewor14 · 2015-09-03T05:02:56Z

retest this please

SparkQA · 2015-09-03T08:03:28Z

Test build #41960 has finished for PR 7927 at commit 6c6d53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BlockFetchException(messages: String, throwable: Throwable)

andrewor14 · 2015-09-03T20:55:47Z

LGTM merging into master, thanks everyone.

…block [SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail. Author: jeanlyn <[email protected]> Closes apache#7927 from jeanlyn/catch_exception.

sprite311 · 2016-08-22T03:36:36Z

i have this problem in spark1.3.0, is there any other solutions? i can't update spark to 1.6

GraceH · 2016-08-22T11:23:01Z

@Sprite331. According to my understanding, this patch tries to catch certain exceptions when the user introducing dynamic allocation. One quick solution is to disable dynamic allocation if possible, which can avoid certain exception (negative part is to miss that new function introduced since 1.3). Another one, you can try to catch that exception by yourself (if you upgrade your 1.3 deployments). I am not sure if either solutions work to you or not.

srowen reviewed Aug 4, 2015
View reviewed changes

jeanlyn added 5 commits August 6, 2015 16:10

catch exception avoid task fail

c27d3fe

fix style

7deba6c

address comments

b273e2c

fix typos and add fail times to log

bb0fe18

throw exception when there no location to fetch

b23f39a

jeanlyn force-pushed the catch_exception branch from 94118a6 to b23f39a Compare August 7, 2015 03:42

squito reviewed Aug 7, 2015
View reviewed changes

add more informations when exception occur

75db334

andrewor14 reviewed Sep 1, 2015
View reviewed changes

jeanlyn changed the title ~~[SPARK-9591][CORE]Job may fail for exception during getting broadcast variable~~ [SPARK-9591][CORE]Job may fail for exception during getting remote block Sep 2, 2015

address @andrewor14 comments, do not catch Fatal exception

6c6d53d

asfgit closed this in db4c130 Sep 3, 2015

jeanlyn deleted the catch_exception branch September 5, 2015 10:07

JoshRosen mentioned this pull request Sep 9, 2016

[SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job #15037

Closed

[SPARK-9591][CORE]Job may fail for exception during getting remote block #7927

[SPARK-9591][CORE]Job may fail for exception during getting remote block #7927

Uh oh!

Conversation

jeanlyn commented Aug 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squito commented Aug 5, 2015

Uh oh!

jeanlyn commented Aug 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Sep 1, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

jeanlyn commented Sep 2, 2015

Uh oh!

andrewor14 commented Sep 3, 2015

Uh oh!

SparkQA commented Sep 3, 2015

Uh oh!

andrewor14 commented Sep 3, 2015

Uh oh!

sprite311 commented Aug 22, 2016

Uh oh!

GraceH commented Aug 22, 2016

Uh oh!

Uh oh!