[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

sarutak · 2014-07-12T02:48:49Z

If jobs execute shuffle, bucket files are created in a temporary directory (named like spark-local-*).
When the bucket files are missing cased by disk failure or any reasons, jobs cannot execute shuffle which has same shuffle id for the bucket files.

I think when Executors cannot read bucket files from their local directory (spark-local-*), they should abort and marked as lost.
In this case, Executor which has bucket files throw FileNotFoundException, so I think, we should handle IOException as fatal in Utils.scala to abort.

After I modified the code as follows, an Executor which fetches bucket files from failed Executor could retry to fetch from another Executor.

def isFatalError(e: Throwable): Boolean = {                                                                                                                 
  e match {                                                                                                                                                 
    case _: IOException =>                                                                                                                                  
      true                                                                                                                                                  
    case NonFatal(_) | _: InterruptedException | _: NotImplementedError | _: ControlThrowable =>                                                            
      false                                                                                                                                                 
    case _ =>                                                                                                                                               
      true                                                                                                                                                  
  }                                                                                                                                                           
}

AmplabJenkins · 2014-07-12T02:51:20Z

Can one of the admins verify this patch?

rxin · 2014-07-15T22:38:21Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -1223,6 +1223,8 @@ private[spark] object Utils extends Logging {
  /** Returns true if the given exception was fatal. See docs for scala.util.control.NonFatal. */
  def isFatalError(e: Throwable): Boolean = {
    e match {
+      case _: IOException =>


Can you add some inline comment explaining why we are catching this IOException here?

rxin · 2014-07-15T22:38:36Z

Thanks for submitting this. Is there any way we can construct a unit test for this as well?

sarutak · 2014-07-15T23:02:49Z

OK. I will add a comment for my change.
And I will also add test case for this issue to FailureSuite.scala. Is that proper?

rxin · 2014-07-15T23:04:42Z

That's a good place to add it. Thanks!

sarutak · 2014-07-17T03:35:18Z

My PR handles IOException as fatal but I think it's not good because IOException is not always fatal.
The problem I want to solve is IOException thrown when writing to and reading from local directory managed by DiskBlockManager are failed.
In such case, I think it's almost caused by disk fault.
So, is it better to exit when reading from or writing to local directory are failed?

sarutak · 2014-07-17T23:35:08Z

@rxin, I noticed some issues related to this issue.
When, following 3 situation which maybe disk fault , executor doesn't stop.
So, tasks assigned to the executor always fail.

Should we exit executor at the situation?

rxin · 2014-07-29T07:13:21Z

Sorry to come back to this after a while. Disk faults can be transient as well right? I'm not sure if we'd want to exit the executor simply because of one disk fault.

sarutak · 2014-07-29T13:52:41Z

@rxin Thank you for your comment.
On a second thought, it's not good solution and I noticed the root cause of this issue is that FetchFailedException is not thrown when local fetch has failed.
#1578 may be better solution.

rxin · 2014-07-29T17:05:22Z

Thanks - do you mind closing this one?

sarutak · 2014-07-29T17:07:58Z

OK. Instead, please watch this PR #1578 .
This maybe a solution for this issue.

sarutak added 2 commits July 11, 2014 18:08

Modify for SPARK-1667

1fc1ba0

Modified Utils.scala to handle IOException as fatal

c2044d6

rxin reviewed Jul 15, 2014
View reviewed changes

asfgit closed this in 2c35666 Jul 30, 2014

sarutak deleted the SPARK-1667 branch April 11, 2015 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

Uh oh!

sarutak commented Jul 12, 2014

Uh oh!

AmplabJenkins commented Jul 12, 2014

Uh oh!

rxin Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

sarutak commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

sarutak commented Jul 17, 2014

Uh oh!

sarutak commented Jul 17, 2014

Uh oh!

rxin commented Jul 29, 2014

Uh oh!

sarutak commented Jul 29, 2014

Uh oh!

rxin commented Jul 29, 2014

Uh oh!

sarutak commented Jul 29, 2014

Uh oh!

Uh oh!

[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

[SPARK-1667] Jobs never finish successfully once bucket file missing occurred #1383

Uh oh!

Conversation

sarutak commented Jul 12, 2014

Uh oh!

AmplabJenkins commented Jul 12, 2014

Uh oh!

rxin Jul 15, 2014

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

sarutak commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

sarutak commented Jul 17, 2014

Uh oh!

sarutak commented Jul 17, 2014

Uh oh!

rxin commented Jul 29, 2014

Uh oh!

sarutak commented Jul 29, 2014

Uh oh!

rxin commented Jul 29, 2014

Uh oh!

sarutak commented Jul 29, 2014

Uh oh!

Uh oh!