-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4204][Core][WebUI] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly #3073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… the error information in Web UI more friendly
Test build #22812 has started for PR 3073 at commit
|
Test build #22812 has finished for PR 3073 at commit
|
Test PASSed. |
@@ -88,10 +88,11 @@ case class FetchFailed( | |||
case class ExceptionFailure( | |||
className: String, | |||
description: String, | |||
stackTrace: Array[StackTraceElement], | |||
stackTrace: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is @DeveloperApi
(and thus technically public), we can't change it in an incompatible way unless there's a really good reason, and it's even worse because it's a case class (so even providing a backwards-compatible constructor is insufficient). We could, however, do something of this form:
case class ExeceptionFailure(
className: String,
description: String,
stackTrace: Array[StackTraceElement],
metrics: Option[TaskMetrics])
extends TaskFailedReason {
// Original behavior preserved for backwards compatibility.
private var errorString = Utils.exceptionString(className, description, stackTrace)
/** Overrides the stack trace message with a full (recursive) stack trace. */
private[spark] def setFullStackTrace(fullStackTrace: String): this.type = {
errorString = fullStackTrace
this
}
override def toErrorString: String = errorString
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the semantics we provide is that the HistoryServer or the Master can read event logs from previous Spark versions. We need to maintain backward compatibility for all classes used in these events to provide that guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 Yes, parsing JSON must be backward compatibility. But here I believe @aarondav was talking about the API compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true. Doesn't this change here break both though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it does not write "Stack Trace", I think it might fail to parse on old HistoryServers, unless that's read in only optionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwendell Would you mind confirming that it is OK to breakingly change these messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it does not write "Stack Trace", I think it might fail to parse on old HistoryServers, unless that's read in only optionally.
Please take a look at #3073 (comment)
It will handle the old event logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant that the logs written by this guy will fail to be parsed by old history servers. That code looks like a new history server can parse old logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant that the logs written by this guy will fail to be parsed by old history servers.
Sorry I didn't realize it. Will update to support old history servers.
The core changes look good to me, save for the one comment on compatibility. I would rather not review the UI and JsonProtocol changes in detail, though, as I am not so well acclimated to that code. One thing I noticed from the screenshots, though, was that the height of the stack trace box seems relatively small -- is it decided by the height of the cell prior to expansion, or does it just have a small fixed height? We should probably add a minimum height that's significantly larger to make the stack trace more readable. I think we already do something similar for the other stack traces in the UI. @andrewor14 Would you mind reviewing the UI/JSON code? |
Test build #22871 has started for PR 3073 at commit
|
Test build #22871 has finished for PR 3073 at commit
|
Test PASSed. |
// backward compatibility | ||
val oldStackTrace = stackTraceFromJson(json \ "Stack Trace") | ||
Utils.exceptionString(className, description, oldStackTrace) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These codes are used to parse the old JSON event logs.
Test build #22933 has started for PR 3073 at commit
|
Test build #22933 has finished for PR 3073 at commit
|
Test PASSed. |
@@ -88,10 +88,32 @@ case class FetchFailed( | |||
case class ExceptionFailure( | |||
className: String, | |||
description: String, | |||
stackTrace: Array[StackTraceElement], | |||
stackTrace: Array[StackTraceElement], // backwards compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this comment doesn't make sense here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- we do need @pwendell to sign off on the fact that we broke compatibility in the prior patches for ExecutorLostFailure and FetchFailed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will hold back on reviewing this until we verify that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to add new fields to these case classes. It only breaks compatibility for matching. It make sense to change these to not be case classes in the future so that people don't try to match them (if we are going to break most uses of matching in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If adding new fields is fine, I would like to add a fullStackTrace
like this:
case class ExceptionFailure(
className: String,
description: String,
stackTrace: Array[StackTraceElement],
fullStackTrace: String,
metrics: Option[TaskMetrics])
extends TaskFailedReason {
instead of adding a var fullStackTrace
. What do you think? @aarondav
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing To be clear, adding new fields in the constructor actually breaks compatibility for case classes, because you can do things like
failure match {
case ExceptionFailure(className, description, stackTrace, metrics) =>
}
which would fail to compile when you add a new parameter.
However, Patrick's point was that this behavior makes extending these classes very difficult, so we should make a one-time breaking change where we get rid of the case-class-ness, and just use regular classes, where you can add new fields arbitrarily. You can even remove fields if you provide a def
which fakes them.
In this patch, let's go ahead and break compatibility by adding a new field as you said. Let's additionally file a JIRA to make these normal classes instead of case classes in a later patch, and, if possible, try to get both that and this into 1.2.0 to just embrace the compatibility issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished my change. Also opened a JIRA for changing case classes: https://issues.apache.org/jira/browse/SPARK-4265
Test build #22974 has started for PR 3073 at commit
|
Test build #22974 has finished for PR 3073 at commit
|
Test PASSed. |
Test FAILed. |
Jenkins, retest this please. |
Test build #22990 has started for PR 3073 at commit
|
Test build #22990 has finished for PR 3073 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #22993 has started for PR 3073 at commit
|
Test build #22993 has finished for PR 3073 at commit
|
Test PASSed. |
Core changes LGTM -- @andrewor14 would you mind completing the review on the ui/listener-related parts? |
Yeah I'll look at this shortly |
message: String) | ||
extends Exception(message) { | ||
message: String, | ||
cause: Throwable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make this default to cause: Throwable = null
then you can get rid of the first constructor
Left a few minor comments. Json changes LGTM |
Test build #23034 has started for PR 3073 at commit
|
@@ -83,15 +83,48 @@ case class FetchFailed( | |||
* :: DeveloperApi :: | |||
* Task failed due to a runtime exception. This is the most common failure case and also captures | |||
* user program exceptions. | |||
* | |||
* `stackTrace` contains the stack trace of the exception itself. It still exists for backward | |||
* compatibility. It's better that using `this(e: Throwable, metrics: Option[TaskMetrics])` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It's better to use", but I'll fix this when I merge it. (No action needed on your part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Test build #23034 has finished for PR 3073 at commit
|
Test PASSed. |
Alright I'm merging this into master and 1.2, thanks @zsxwing! |
… inner exceptions and make the error information in Web UI more friendly This PR fixed `Utils.exceptionString` to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking `+detail` will display the full info). Here are the screenshots: Stages:  Details for one stage:  The full information in the gray text field is: ```Java org.apache.spark.shuffle.FetchFailedException: Connection reset by peer at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198) at sun.nio.ch.IOUtil.read(IOUtil.java:166) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more ``` /cc aarondav Author: zsxwing <[email protected]> Closes #3073 from zsxwing/SPARK-4204 and squashes the following commits: 176d1e3 [zsxwing] Add comments to explain the stack trace difference ca509d3 [zsxwing] Add fullStackTrace to the constructor of ExceptionFailure a07057b [zsxwing] Core style fix dfb0032 [zsxwing] Backward compatibility for old history server 1e50f71 [zsxwing] Update as per review and increase the max height of the stack trace details 94f2566 [zsxwing] Change Utils.exceptionString to contain the inner exceptions and make the error information in Web UI more friendly (cherry picked from commit 3abdb1b) Signed-off-by: Andrew Or <[email protected]>
This PR fixed
Utils.exceptionString
to output the full exception information. However, the stack trace may become very huge, so I also updated the Web UI to collapse the error information by default (display the first line and clicking+detail
will display the full info).Here are the screenshots:
Stages:

Details for one stage:

The full information in the gray text field is:
/cc @aarondav