Skip to content

[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Jul 12, 2014

If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*).
When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files.

I think when Executors cannot read bucket files from their local directory (spark-local-*), they should abort and marked as lost.
In this case, Executor which has bucket files throw FileNotFoundException, so I think, we should handle IOException as fatal in Utils.scala to abort.

After I modified the code as follows, an Executor which fetches bucket files from failed Executor could retry to fetch from another Executor.

def isFatalError(e: Throwable): Boolean = {                                                                                                                 
  e match {                                                                                                                                                 
    case _: IOException =>                                                                                                                                  
      true                                                                                                                                                  
    case NonFatal(_) | _: InterruptedException | _: NotImplementedError | _: ControlThrowable =>                                                            
      false                                                                                                                                                 
    case _ =>                                                                                                                                               
      true                                                                                                                                                  
  }                                                                                                                                                           
}  

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -1223,6 +1223,8 @@ private[spark] object Utils extends Logging {
/** Returns true if the given exception was fatal. See docs for scala.util.control.NonFatal. */
def isFatalError(e: Throwable): Boolean = {
e match {
case _: IOException =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some inline comment explaining why we are catching this IOException here?

@rxin
Copy link
Contributor

rxin commented Jul 15, 2014

Thanks for submitting this. Is there any way we can construct a unit test for this as well?

@sarutak
Copy link
Member Author

sarutak commented Jul 15, 2014

OK. I will add a comment for my change.
And I will also add test case for this issue to FailureSuite.scala. Is that proper?

@rxin
Copy link
Contributor

rxin commented Jul 15, 2014

That's a good place to add it. Thanks!

@sarutak
Copy link
Member Author

sarutak commented Jul 17, 2014

My PR handles IOException as fatal but I think it's not good because IOException is not always fatal.
The problem I want to solve is IOException thrown when writing to and reading from local directory managed by DiskBlockManager are failed.
In such case, I think it's almost caused by disk fault.
So, is it better to exit when reading from or writing to local directory are failed?

@sarutak
Copy link
Member Author

sarutak commented Jul 17, 2014

@rxin, I noticed some issues related to this issue.
When, following 3 situation which maybe disk fault , executor doesn't stop.
So, tasks assigned to the executor always fail.

Should we exit executor at the situation?

@rxin
Copy link
Contributor

rxin commented Jul 29, 2014

Sorry to come back to this after a while. Disk faults can be transient as well right? I'm not sure if we'd want to exit the executor simply because of one disk fault.

@sarutak
Copy link
Member Author

sarutak commented Jul 29, 2014

@rxin Thank you for your comment.
On a second thought, it's not good solution and I noticed the root cause of this issue is that FetchFailedException is not thrown when local fetch has failed.
#1578 may be better solution.

@rxin
Copy link
Contributor

rxin commented Jul 29, 2014

Thanks - do you mind closing this one?

@sarutak
Copy link
Member Author

sarutak commented Jul 29, 2014

OK. Instead, please watch this PR #1578 .
This maybe a solution for this issue.

@asfgit asfgit closed this in 2c35666 Jul 30, 2014
@sarutak sarutak deleted the SPARK-1667 branch April 11, 2015 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants