[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

andrewor14 · 2014-08-13T22:37:56Z

More detail on the issue is described in SPARK-3015, but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many RemoveBroadcast messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole BlockManager* message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

@tdas @pwendell @mengxr

SparkQA · 2014-08-13T22:40:28Z

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull

SparkQA · 2014-08-13T23:40:04Z

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull

SparkQA · 2014-08-14T00:10:12Z

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull

SparkQA · 2014-08-14T01:04:29Z

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull

The previous code used the length of the referenceBuffer, which is the number of elements registered for clean-up, rather than the number of elements registered AND de-referenced. What we want is the length of the referenceQueue. However, Java does not expose this, so we must access it through reflection. Since this is potentially expensive, we need to limit the number of times we access the queue length this way.

SparkQA · 2014-08-14T01:20:26Z

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull

SparkQA · 2014-08-14T02:16:27Z

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull

tdas · 2014-08-14T06:25:07Z

core/src/main/scala/org/apache/spark/ContextCleaner.scala

+      }
+    } catch {
+      case e: Exception =>
+        logDebug("Failed to access reference queue's length through reflection: " + e)


Add a note on why this is logDebug and not logWarning/logError.

tdas · 2014-08-14T06:50:19Z

Its a little ugly that the ContextCleaner class is being polluted with so many parameters, and all the temporary queue length code. Wouldnt it be much cleaner if we make a custom ReferenceQueue, which has the field length(), that does this reflection on itself to find the queue length. All the iteration counter, queue length checking and error message printing code can go inside that ReferenceQueue implementation, which is cleanly separated from the main context cleaner logic.

andrewor14 · 2014-08-15T02:57:46Z

Yeah, sounds good. I guess we'll use a ReferenceQueueWithSize or something instead

This simplifies the PR significantly. Now it's strictly a bug fix.

andrewor14 · 2014-08-16T01:35:16Z

I have removed the logic of logging queue length as a warning. This significantly simplifies the PR and fulfills its original purpose as a bug fix. We can add back some notion of warning later on if there is interest.

SparkQA · 2014-08-16T01:40:09Z

QA tests have started for PR 1931 at commit d0f7195.

This patch merges cleanly.

SparkQA · 2014-08-16T02:38:50Z

QA tests have finished for PR 1931 at commit d0f7195.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2014-08-16T05:54:58Z

Great - I like this version better!

More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies. We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads. In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks. tdas pwendell mengxr Author: Andrew Or <[email protected]> Closes #1931 from andrewor14/reference-blocking and squashes the following commits: d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking ce9daf5 [Andrew Or] Remove logic for logging queue length 111192a [Andrew Or] Add missing space in log message (minor) a183b83 [Andrew Or] Switch order of code blocks (minor) 9fd1fe6 [Andrew Or] Remove outdated log 104b366 [Andrew Or] Use the actual reference queue length 0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full (cherry picked from commit c9da466) Signed-off-by: Patrick Wendell <[email protected]>

witgo · 2014-08-20T08:39:25Z

core/src/main/scala/org/apache/spark/ContextCleaner.scala

   */
  private val blockOnCleanupTasks = sc.conf.getBoolean(
-    "spark.cleaner.referenceTracking.blocking", false)


The changes will not solve the problem here. see.
BlockManagerMasterActor.scala#L165

private def removeShuffle(shuffleId: Int): Future[Seq[Boolean]] = { // Nothing to do in the BlockManagerMasterActor data structures import context.dispatcher val removeMsg = RemoveShuffle(shuffleId) Future.sequence( blockManagerInfo.values.map { bm => // Here has set the akkaTimeout bm.slaveActor.ask(removeMsg)(akkaTimeout).mapTo[Boolean] }.toSeq ) }

More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of apache#1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies. We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads. In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks. tdas pwendell mengxr Author: Andrew Or <[email protected]> Closes apache#1931 from andrewor14/reference-blocking and squashes the following commits: d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking ce9daf5 [Andrew Or] Remove logic for logging queue length 111192a [Andrew Or] Add missing space in log message (minor) a183b83 [Andrew Or] Switch order of code blocks (minor) 9fd1fe6 [Andrew Or] Remove outdated log 104b366 [Andrew Or] Use the actual reference queue length 0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full

igreenfield · 2019-04-15T07:31:00Z

@andrewor14
I see in the code:

/**

Whether the cleaning thread will block on cleanup tasks (other than shuffle, which
is controlled by the spark.cleaner.referenceTracking.blocking.shuffle parameter).
Due to SPARK-3015, this is set to true by default. This is intended to be only a temporary
workaround for the issue, which is ultimately caused by the way the BlockManager endpoints
issue inter-dependent blocking RPC messages to each other at high frequencies. This happens,
for instance, when the driver performs a GC and cleans up all broadcast blocks that are no
longer in scope.
*/

does that still needed?

Block on cleaning tasks by default + log error on queue full

0b7e768

andrewor14 added 4 commits August 13, 2014 18:09

Remove outdated log

9fd1fe6

Switch order of code blocks (minor)

a183b83

Add missing space in log message (minor)

111192a

tdas reviewed Aug 14, 2014
View reviewed changes

andrewor14 added 2 commits August 15, 2014 18:32

Remove logic for logging queue length

ce9daf5

This simplifies the PR significantly. Now it's strictly a bug fix.

Merge branch 'master' of github.com:apache/spark into reference-blocking

d0f7195

asfgit closed this in c9da466 Aug 16, 2014

witgo reviewed Aug 20, 2014
View reviewed changes

andrewor14 deleted the reference-blocking branch August 27, 2014 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

Uh oh!

andrewor14 commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

tdas Aug 14, 2014

Uh oh!

tdas commented Aug 14, 2014

Uh oh!

andrewor14 commented Aug 15, 2014

Uh oh!

andrewor14 commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

pwendell commented Aug 16, 2014

Uh oh!

witgo Aug 20, 2014

Uh oh!

igreenfield commented Apr 15, 2019

Uh oh!

Uh oh!

[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

Uh oh!

Conversation

andrewor14 commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

tdas Aug 14, 2014

Choose a reason for hiding this comment

Uh oh!

tdas commented Aug 14, 2014

Uh oh!

andrewor14 commented Aug 15, 2014

Uh oh!

andrewor14 commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

SparkQA commented Aug 16, 2014

Uh oh!

pwendell commented Aug 16, 2014

Uh oh!

witgo Aug 20, 2014

Choose a reason for hiding this comment

Uh oh!

igreenfield commented Apr 15, 2019

Uh oh!

Uh oh!