Skip to content

[SPARK-6980] [CORE] Akka timeout exceptions indicate which conf controls them (RPC Layer) #6205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
97523e0
[SPARK-6980] Akka ask timeout description refactored to RPC layer
BryanCutler May 15, 2015
78a2c0a
[SPARK-6980] Using RpcTimeout.awaitResult for future in AppClient now
BryanCutler May 16, 2015
5b59a44
[SPARK-6980] Added some RpcTimeout unit tests
BryanCutler May 16, 2015
49f9f04
[SPARK-6980] Minor cleanup and scala style fix
BryanCutler May 16, 2015
23d2f26
[SPARK-6980] Fixed await result not being handled by RpcTimeout
BryanCutler May 18, 2015
a294569
[SPARK-6980] Added creation of RpcTimeout with Seq of property keys
BryanCutler May 19, 2015
f74064d
Retrieving properties from property list using iterator and while loo…
May 21, 2015
0ee5642
Changing the loop condition to halt at the first match in the propert…
May 21, 2015
4be3a8d
Modifying loop condition to find property match
May 24, 2015
b7fb99f
Merge pull request #2 from hardmettle/configTimeoutUpdates_6980
BryanCutler May 24, 2015
c07d05c
Merge branch 'master' into configTimeout-6980-tmp
BryanCutler Jun 3, 2015
235919b
[SPARK-6980] Resolved conflicts after master merge
BryanCutler Jun 3, 2015
2f94095
[SPARK-6980] Added addMessageIfTimeout for when a Future is completed…
BryanCutler Jun 4, 2015
1607a5f
[SPARK-6980] Changed addMessageIfTimeout to PartialFunction, cleanup …
BryanCutler Jun 8, 2015
4351c48
[SPARK-6980] Added UT for addMessageIfTimeout, cleaned up UTs
BryanCutler Jun 10, 2015
7774d56
[SPARK-6980] Cleaned up UT imports
BryanCutler Jun 11, 2015
995d196
[SPARK-6980] Cleaned up import ordering, comments, spacing from PR fe…
BryanCutler Jun 11, 2015
d3754d1
[SPARK-6980] Added akkaConf to prevent dead letter logging
BryanCutler Jun 11, 2015
08f5afc
[SPARK-6980] Added UT for constructing RpcTimeout with default value
BryanCutler Jun 11, 2015
1b9beab
[SPARK-6980] Cleaned up import ordering
BryanCutler Jun 12, 2015
2206b4d
[SPARK-6980] Added unit test for ask then immediat awaitReply
BryanCutler Jun 12, 2015
1517721
[SPARK-6980] RpcTimeout object scope should be private[spark]
BryanCutler Jun 15, 2015
1394de6
[SPARK-6980] Moved MessagePrefix to createRpcTimeoutException directly
BryanCutler Jun 15, 2015
c6cfd33
[SPARK-6980] Changed UT ask message timeout to explicitly intercept a…
BryanCutler Jun 23, 2015
b05d449
[SPARK-6980] Changed constructor to use val duration instead of gette…
BryanCutler Jun 23, 2015
fa6ed82
[SPARK-6980] Had to increase timeout on positive test case because a …
BryanCutler Jun 23, 2015
fadaf6f
[SPARK-6980] Put back in deprecated RpcUtils askTimeout and lookupTim…
BryanCutler Jun 24, 2015
218aa50
[SPARK-6980] Corrected issues from feedback
BryanCutler Jun 24, 2015
039afed
[SPARK-6980] Corrected import organization
BryanCutler Jun 24, 2015
be11c4e
Merge branch 'master' into configTimeout-6980
BryanCutler Jun 24, 2015
7636189
[SPARK-6980] Fixed call to askWithReply in DAGScheduler to use RpcTim…
BryanCutler Jun 26, 2015
3a168c7
[SPARK-6980] Rewrote Akka RpcTimeout UTs in RpcEnvSuite
BryanCutler Jun 26, 2015
3d8b1ff
[SPARK-6980] Cleaned up imports in AkkaRpcEnvSuite
BryanCutler Jun 26, 2015
287059a
[SPARK-6980] Removed extra import in AkkaRpcEnvSuite
BryanCutler Jun 26, 2015
7f4d78e
[SPARK-6980] Fixed scala style checks
BryanCutler Jun 26, 2015
6a1c50d
[SPARK-6980] Minor cleanup of test case
BryanCutler Jun 27, 2015
4e89c75
[SPARK-6980] Missed one usage of deprecated RpcUtils.askTimeout in Ya…
BryanCutler Jun 30, 2015
dbd5f73
[SPARK-6980] Changed RpcUtils askRpcTimeout and lookupRpcTimeout scop…
BryanCutler Jul 1, 2015
7bb70f1
Merge branch 'master' into configTimeout-6980
BryanCutler Jul 1, 2015
06afa53
[SPARK-6980] RpcTimeout class extends Serializable, was causing error…
BryanCutler Jul 2, 2015
46c8d48
[SPARK-6980] Changed RpcEnvSuite test to never reply instead of just …
BryanCutler Jul 2, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ class WorkerWebUI(
extends WebUI(worker.securityMgr, requestedPort, worker.conf, name = "WorkerUI")
with Logging {

private[ui] val timeout = RpcUtils.askTimeout(worker.conf)
private[ui] val timeout = RpcUtils.askRpcTimeout(worker.conf)

initialize()

Expand Down
17 changes: 10 additions & 7 deletions core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@

package org.apache.spark.rpc

import scala.concurrent.{Await, Future}
import scala.concurrent.duration.FiniteDuration
import scala.concurrent.Future
import scala.reflect.ClassTag

import org.apache.spark.util.RpcUtils
Expand All @@ -32,7 +31,7 @@ private[spark] abstract class RpcEndpointRef(@transient conf: SparkConf)

private[this] val maxRetries = RpcUtils.numRetries(conf)
private[this] val retryWaitMs = RpcUtils.retryWaitMs(conf)
private[this] val defaultAskTimeout = RpcUtils.askTimeout(conf)
private[this] val defaultAskTimeout = RpcUtils.askRpcTimeout(conf)

/**
* return the address for the [[RpcEndpointRef]]
Expand All @@ -52,7 +51,7 @@ private[spark] abstract class RpcEndpointRef(@transient conf: SparkConf)
*
* This method only sends the message once and never retries.
*/
def ask[T: ClassTag](message: Any, timeout: FiniteDuration): Future[T]
def ask[T: ClassTag](message: Any, timeout: RpcTimeout): Future[T]

/**
* Send a message to the corresponding [[RpcEndpoint.receiveAndReply)]] and return a [[Future]] to
Expand Down Expand Up @@ -91,15 +90,15 @@ private[spark] abstract class RpcEndpointRef(@transient conf: SparkConf)
* @tparam T type of the reply message
* @return the reply message from the corresponding [[RpcEndpoint]]
*/
def askWithRetry[T: ClassTag](message: Any, timeout: FiniteDuration): T = {
def askWithRetry[T: ClassTag](message: Any, timeout: RpcTimeout): T = {
// TODO: Consider removing multiple attempts
var attempts = 0
var lastException: Exception = null
while (attempts < maxRetries) {
attempts += 1
try {
val future = ask[T](message, timeout)
val result = Await.result(future, timeout)
val result = timeout.awaitResult(future)
if (result == null) {
throw new SparkException("Actor returned null")
}
Expand All @@ -110,10 +109,14 @@ private[spark] abstract class RpcEndpointRef(@transient conf: SparkConf)
lastException = e
logWarning(s"Error sending message [message = $message] in $attempts attempts", e)
}
Thread.sleep(retryWaitMs)

if (attempts < maxRetries) {
Thread.sleep(retryWaitMs)
}
}

throw new SparkException(
s"Error sending message [message = $message]", lastException)
}

}
112 changes: 109 additions & 3 deletions core/src/main/scala/org/apache/spark/rpc/RpcEnv.scala
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@
package org.apache.spark.rpc

import java.net.URI
import java.util.concurrent.TimeoutException

import scala.concurrent.{Await, Future}
import scala.concurrent.{Awaitable, Await, Future}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need to import FiniteDuration explicitly since you're importing duration._. Also you are supposed to order direct class imports before package imports (not just alphabetically), so it should be:

import scala.concurrent.{Awaitable, Await, Future}
import scala.concurrent.duration._

the intellij import organizer will get this wrong, but aaron davidson wrote a plugin which does it right -- there are instructions for using it here: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, using the plugin now

import scala.concurrent.duration._
import scala.language.postfixOps

import org.apache.spark.{SecurityManager, SparkConf}
Expand Down Expand Up @@ -66,7 +68,7 @@ private[spark] object RpcEnv {
*/
private[spark] abstract class RpcEnv(conf: SparkConf) {

private[spark] val defaultLookupTimeout = RpcUtils.lookupTimeout(conf)
private[spark] val defaultLookupTimeout = RpcUtils.lookupRpcTimeout(conf)

/**
* Return RpcEndpointRef of the registered [[RpcEndpoint]]. Will be used to implement
Expand Down Expand Up @@ -94,7 +96,7 @@ private[spark] abstract class RpcEnv(conf: SparkConf) {
* Retrieve the [[RpcEndpointRef]] represented by `uri`. This is a blocking action.
*/
def setupEndpointRefByURI(uri: String): RpcEndpointRef = {
Await.result(asyncSetupEndpointRefByURI(uri), defaultLookupTimeout)
defaultLookupTimeout.awaitResult(asyncSetupEndpointRefByURI(uri))
}

/**
Expand Down Expand Up @@ -184,3 +186,107 @@ private[spark] object RpcAddress {
RpcAddress(host, port)
}
}


/**
* An exception thrown if RpcTimeout modifies a [[TimeoutException]].
*/
private[rpc] class RpcTimeoutException(message: String, cause: TimeoutException)
extends TimeoutException(message) { initCause(cause) }


/**
* Associates a timeout with a description so that a when a TimeoutException occurs, additional
* context about the timeout can be amended to the exception message.
* @param duration timeout duration in seconds
* @param timeoutProp the configuration property that controls this timeout
*/
private[spark] class RpcTimeout(val duration: FiniteDuration, val timeoutProp: String)
extends Serializable {

/** Amends the standard message of TimeoutException to include the description */
private def createRpcTimeoutException(te: TimeoutException): RpcTimeoutException = {
new RpcTimeoutException(te.getMessage() + ". This timeout is controlled by " + timeoutProp, te)
}

/**
* PartialFunction to match a TimeoutException and add the timeout description to the message
*
* @note This can be used in the recover callback of a Future to add to a TimeoutException
* Example:
* val timeout = new RpcTimeout(5 millis, "short timeout")
* Future(throw new TimeoutException).recover(timeout.addMessageIfTimeout)
*/
def addMessageIfTimeout[T]: PartialFunction[Throwable, T] = {
// The exception has already been converted to a RpcTimeoutException so just raise it
case rte: RpcTimeoutException => throw rte
// Any other TimeoutException get converted to a RpcTimeoutException with modified message
case te: TimeoutException => throw createRpcTimeoutException(te)
}

/**
* Wait for the completed result and return it. If the result is not available within this
* timeout, throw a [[RpcTimeoutException]] to indicate which configuration controls the timeout.
* @param awaitable the `Awaitable` to be awaited
* @throws RpcTimeoutException if after waiting for the specified time `awaitable`
* is still not ready
*/
def awaitResult[T](awaitable: Awaitable[T]): T = {
try {
Await.result(awaitable, duration)
} catch addMessageIfTimeout
}
}


private[spark] object RpcTimeout {

/**
* Lookup the timeout property in the configuration and create
* a RpcTimeout with the property key in the description.
* @param conf configuration properties containing the timeout
* @param timeoutProp property key for the timeout in seconds
* @throws NoSuchElementException if property is not set
*/
def apply(conf: SparkConf, timeoutProp: String): RpcTimeout = {
val timeout = { conf.getTimeAsSeconds(timeoutProp) seconds }
new RpcTimeout(timeout, timeoutProp)
}

/**
* Lookup the timeout property in the configuration and create
* a RpcTimeout with the property key in the description.
* Uses the given default value if property is not set
* @param conf configuration properties containing the timeout
* @param timeoutProp property key for the timeout in seconds
* @param defaultValue default timeout value in seconds if property not found
*/
def apply(conf: SparkConf, timeoutProp: String, defaultValue: String): RpcTimeout = {
val timeout = { conf.getTimeAsSeconds(timeoutProp, defaultValue) seconds }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be conf.getTimeAsMs instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, I don't think so -- lets stick w/ the current behavior of using conf.getTimeAsSeconds for now.

new RpcTimeout(timeout, timeoutProp)
}

/**
* Lookup prioritized list of timeout properties in the configuration
* and create a RpcTimeout with the first set property key in the
* description.
* Uses the given default value if property is not set
* @param conf configuration properties containing the timeout
* @param timeoutPropList prioritized list of property keys for the timeout in seconds
* @param defaultValue default timeout value in seconds if no properties found
*/
def apply(conf: SparkConf, timeoutPropList: Seq[String], defaultValue: String): RpcTimeout = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating RpcTimeout from a prioritized list of property keys, from @squito previous request

require(timeoutPropList.nonEmpty)

// Find the first set property or use the default value with the first property
val itr = timeoutPropList.iterator
var foundProp: Option[(String, String)] = None
while (itr.hasNext && foundProp.isEmpty){
val propKey = itr.next()
conf.getOption(propKey).foreach { prop => foundProp = Some(propKey, prop) }
}
val finalProp = foundProp.getOrElse(timeoutPropList.head, defaultValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above lines can be replaced by a single line: val finalProp = timeoutPropList.flatMap(key => conf.getOption(key).map(value => key -> value)).headOption.getOrElse(timeoutPropList.head -> defaultValue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. Looks it's long too.

val timeout = { Utils.timeStringAsSeconds(finalProp._2) seconds }
new RpcTimeout(timeout, finalProp._1)
}
}
15 changes: 9 additions & 6 deletions core/src/main/scala/org/apache/spark/rpc/akka/AkkaRpcEnv.scala
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ package org.apache.spark.rpc.akka
import java.util.concurrent.ConcurrentHashMap

import scala.concurrent.Future
import scala.concurrent.duration._
import scala.language.postfixOps
import scala.reflect.ClassTag
import scala.util.control.NonFatal
Expand Down Expand Up @@ -214,8 +213,11 @@ private[spark] class AkkaRpcEnv private[akka] (

override def asyncSetupEndpointRefByURI(uri: String): Future[RpcEndpointRef] = {
import actorSystem.dispatcher
actorSystem.actorSelection(uri).resolveOne(defaultLookupTimeout).
map(new AkkaRpcEndpointRef(defaultAddress, _, conf))
actorSystem.actorSelection(uri).resolveOne(defaultLookupTimeout.duration).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though this doesn't follow the same pattern of Await.result, can we catch the timeout here too? (I'm not 100% sure if its possible ...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there are two ways :

  1. If the return type changes from Future[RpcEndpointRef] to RpcEndpointRef then only the application of Await.result is possible else it will not be possible.
  2. Create one more overloaded function of awaitResult in RpcTimeout whose return type wraps a Future over RpcEndpointRef while returning the result.

In my opinion 2nd one is the best solution as the previous one will require modifications where ever the usages are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is a little trickier for Futures. From what I understand, creating the future is non-blocking, so we can't just call Await.result. If the Future isn't fulfilled within the timeout, its marked as failed, then sometime later if the future is acted upon, the TimeoutException is thrown.

I think we might be able to use the andThen method described here: http://doc.akka.io/docs/akka/snapshot/scala/futures.html. We would need to add a function to RpcTimeout that takes in a Future[T], then applies andThen to it which checks if the future completed with failure, then returns a Future[T]. The only problem I see here is that if the andThen amends a TimeoutException message, then awaitResult is called, the message could be amended twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you are describing sounds reasonable. We certainly don't want to change the return type or await on the result here.

You could potentially fix the doubly-appended msg by throwing a subclass of TimeoutException, which then you could check for before you amend a msg. However, this is really begging the question -- if there is one timeout passed in here, and another is passed to an Await.result later, how is akka using them both? I suppose it will timeout on whichever one is shorter? I don't see anything very definitive in the docs, perhaps we should confirm on akka-user. That would inform how we should do the error handling.

(Honestly I thiink it will also be fine to not stress too much about this case, it may not be worth it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do some research on this and see what our options are while still keeping the same return type.

map(new AkkaRpcEndpointRef(defaultAddress, _, conf)).
// this is just in case there is a timeout from creating the future in resolveOne, we want the
// exception to indicate the conf that determines the timeout
recover(defaultLookupTimeout.addMessageIfTimeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the usage with addMessageIfTimeout as a PartialFunction

}

override def uriOf(systemName: String, address: RpcAddress, endpointName: String): String = {
Expand Down Expand Up @@ -295,8 +297,8 @@ private[akka] class AkkaRpcEndpointRef(
actorRef ! AkkaMessage(message, false)
}

override def ask[T: ClassTag](message: Any, timeout: FiniteDuration): Future[T] = {
actorRef.ask(AkkaMessage(message, true))(timeout).flatMap {
override def ask[T: ClassTag](message: Any, timeout: RpcTimeout): Future[T] = {
actorRef.ask(AkkaMessage(message, true))(timeout.duration).flatMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here about catching the timeout

// The function will run in the calling thread, so it should be short and never block.
case msg @ AkkaMessage(message, reply) =>
if (reply) {
Expand All @@ -307,7 +309,8 @@ private[akka] class AkkaRpcEndpointRef(
}
case AkkaFailure(e) =>
Future.failed(e)
}(ThreadUtils.sameThread).mapTo[T]
}(ThreadUtils.sameThread).mapTo[T].
recover(timeout.addMessageIfTimeout)(ThreadUtils.sameThread)
}

override def toString: String = s"${getClass.getSimpleName}($actorRef)"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import org.apache.spark.broadcast.Broadcast
import org.apache.spark.executor.TaskMetrics
import org.apache.spark.partial.{ApproximateActionListener, ApproximateEvaluator, PartialResult}
import org.apache.spark.rdd.RDD
import org.apache.spark.rpc.RpcTimeout
import org.apache.spark.storage._
import org.apache.spark.unsafe.memory.TaskMemoryManager
import org.apache.spark.util._
Expand Down Expand Up @@ -188,7 +189,7 @@ class DAGScheduler(
blockManagerId: BlockManagerId): Boolean = {
listenerBus.post(SparkListenerExecutorMetricsUpdate(execId, taskMetrics))
blockManagerMaster.driverEndpoint.askWithRetry[Boolean](
BlockManagerHeartbeat(blockManagerId), 600 seconds)
BlockManagerHeartbeat(blockManagerId), new RpcTimeout(600 seconds, "BlockManagerHeartbeat"))
}

// Called by TaskScheduler when an executor fails.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ private[spark] abstract class YarnSchedulerBackend(
private val yarnSchedulerEndpoint = rpcEnv.setupEndpoint(
YarnSchedulerBackend.ENDPOINT_NAME, new YarnSchedulerEndpoint(rpcEnv))

private implicit val askTimeout = RpcUtils.askTimeout(sc.conf)
private implicit val askTimeout = RpcUtils.askRpcTimeout(sc.conf)

/**
* Request executors from the ApplicationMaster by specifying the total number desired.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class BlockManagerMaster(
isDriver: Boolean)
extends Logging {

val timeout = RpcUtils.askTimeout(conf)
val timeout = RpcUtils.askRpcTimeout(conf)

/** Remove a dead executor from the driver endpoint. This is only called on the driver side. */
def removeExecutor(execId: String) {
Expand Down Expand Up @@ -106,7 +106,7 @@ class BlockManagerMaster(
logWarning(s"Failed to remove RDD $rddId - ${e.getMessage}}", e)
}(ThreadUtils.sameThread)
if (blocking) {
Await.result(future, timeout)
timeout.awaitResult(future)
}
}

Expand All @@ -118,7 +118,7 @@ class BlockManagerMaster(
logWarning(s"Failed to remove shuffle $shuffleId - ${e.getMessage}}", e)
}(ThreadUtils.sameThread)
if (blocking) {
Await.result(future, timeout)
timeout.awaitResult(future)
}
}

Expand All @@ -132,7 +132,7 @@ class BlockManagerMaster(
s" with removeFromMaster = $removeFromMaster - ${e.getMessage}}", e)
}(ThreadUtils.sameThread)
if (blocking) {
Await.result(future, timeout)
timeout.awaitResult(future)
}
}

Expand Down Expand Up @@ -176,8 +176,8 @@ class BlockManagerMaster(
CanBuildFrom[Iterable[Future[Option[BlockStatus]]],
Option[BlockStatus],
Iterable[Option[BlockStatus]]]]
val blockStatus = Await.result(
Future.sequence[Option[BlockStatus], Iterable](futures)(cbf, ThreadUtils.sameThread), timeout)
val blockStatus = timeout.awaitResult(
Future.sequence[Option[BlockStatus], Iterable](futures)(cbf, ThreadUtils.sameThread))
if (blockStatus == null) {
throw new SparkException("BlockManager returned null for BlockStatus query: " + blockId)
}
Expand All @@ -199,7 +199,7 @@ class BlockManagerMaster(
askSlaves: Boolean): Seq[BlockId] = {
val msg = GetMatchingBlockIds(filter, askSlaves)
val future = driverEndpoint.askWithRetry[Future[Seq[BlockId]]](msg)
Await.result(future, timeout)
timeout.awaitResult(future)
}

/**
Expand Down
Loading