-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-9591][CORE]Job may fail for exception during getting remote block #7927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() | ||
} catch { | ||
case e: Throwable => | ||
logWarning(s"Exception during getting remote block $blockId from $loc", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's valid to catch any throwable here and continue. This can be more targeted, and pushed down into fetchBlockSync
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @srowen, otherwise it might hide other evils....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srowen and @CodingCat for comments!
- If i am understanding correctly,the
doGetRemote
here will returnNone
when fetched all the same block is null, and all the method which called thedoGetRemote
will handle theNone
case and throw exception when necessary,so i think it's safe here to catch the exception fetchBlockSync
just callfetchBlocks
to fetch the block,so i think it's the same we catch exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is that: since you are expecting a IOException
(or whatever it is) when one of the remotes goes down, than you should only catch that exception. If we get some other weird random exception, we should probably still throw it, since it might be some bigger problem.
Also I don't think simply ignoring the exception is right. If you only get an exception from one location, but then another location is fine, sure, just forget the exception. But what if you get an exception from all locations? Then you should still throw an exception. You could do something like what is done in askWithRetry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense,I will fix it later,thanks @squito a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squito So agree to do like askWithRetry
. If we can get one block from any remote store successfully, it successes. We should not break the working path whenever meet the first exception.
So maybe, we need to catch all kinds of Exceptions (not IOException only). If some attempts failed, we need to log out the exception information but continue the fetching work. When we run to the final location and it still throws out certain exception, we need to throw out a NEW exception to tell that all attempts failed (i.e., no available location there). and meanwhile, maybe to add the last exception information into this NEW exception.
But if we only focus IOException, when we meet some types of exceptions for certain locations, it still breaks the entire workflow (to fetch data from the rest locations if possible).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agreed. sorry seeing this so late (commented a little more down below)
thanks for updating @jeanlyn . Sorry that I didn't fully understand the issue earlier and for potentially changing the desired outcome on you. |
94118a6
to
b23f39a
Compare
Thanks everyone for the review! I updated the code, and now |
package org.apache.spark.storage | ||
|
||
private[spark] | ||
case class BlockFetchException(throwable: Throwable) extends Exception(throwable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should extend SparkException
ok to test |
blockTransferService.fetchBlockSync( | ||
loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() | ||
} catch { | ||
case t: Throwable if attemptTimes < locations.size - 1 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use scala.util.controlNonFatal(e)
instead. We don't want to catch OOM's here.
@jeanlyn can you rename the title of this patch and the issue to remove references to "broadcast"? I believe this is applicable to all blocks in general. |
Test build #41899 has finished for PR 7927 at commit
|
Test build #41907 has finished for PR 7927 at commit
|
It seems that the failure not related. |
retest this please |
Test build #41960 has finished for PR 7927 at commit
|
LGTM merging into master, thanks everyone. |
…block [SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail. Author: jeanlyn <[email protected]> Closes apache#7927 from jeanlyn/catch_exception.
i have this problem in spark1.3.0, is there any other solutions? i can't update spark to 1.6 |
@Sprite331. According to my understanding, this patch tries to catch certain exceptions when the user introducing dynamic allocation. One quick solution is to disable dynamic allocation if possible, which can avoid certain exception (negative part is to miss that new function introduced since 1.3). Another one, you can try to catch that exception by yourself (if you upgrade your 1.3 deployments). I am not sure if either solutions work to you or not. |
SPARK-9591
When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail.