-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
@@ -1223,6 +1223,8 @@ private[spark] object Utils extends Logging { | |||
/** Returns true if the given exception was fatal. See docs for scala.util.control.NonFatal. */ | |||
def isFatalError(e: Throwable): Boolean = { | |||
e match { | |||
case _: IOException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some inline comment explaining why we are catching this IOException here?
Thanks for submitting this. Is there any way we can construct a unit test for this as well? |
OK. I will add a comment for my change. |
That's a good place to add it. Thanks! |
My PR handles IOException as fatal but I think it's not good because IOException is not always fatal. |
@rxin, I noticed some issues related to this issue. Should we exit executor at the situation? |
Sorry to come back to this after a while. Disk faults can be transient as well right? I'm not sure if we'd want to exit the executor simply because of one disk fault. |
Thanks - do you mind closing this one? |
OK. Instead, please watch this PR #1578 . |
If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*).
When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files.
I think when Executors cannot read bucket files from their local directory (spark-local-*), they should abort and marked as lost.
In this case, Executor which has bucket files throw FileNotFoundException, so I think, we should handle IOException as fatal in Utils.scala to abort.
After I modified the code as follows, an Executor which fetches bucket files from failed Executor could retry to fetch from another Executor.