Skip to content

[SPARK-3015] Block on cleaning tasks to prevent Akka timeouts #1931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

andrewor14
Copy link
Contributor

More detail on the issue is described in SPARK-3015, but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many RemoveBroadcast messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole BlockManager* message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

@tdas @pwendell @mengxr

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull

The previous code used the length of the referenceBuffer, which is
the number of elements registered for clean-up, rather than the
number of elements registered AND de-referenced.

What we want is the length of the referenceQueue. However, Java
does not expose this, so we must access it through reflection.
Since this is potentially expensive, we need to limit the number
of times we access the queue length this way.
@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1931. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1931:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull

}
} catch {
case e: Exception =>
logDebug("Failed to access reference queue's length through reflection: " + e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note on why this is logDebug and not logWarning/logError.

@tdas
Copy link
Contributor

tdas commented Aug 14, 2014

Its a little ugly that the ContextCleaner class is being polluted with so many parameters, and all the temporary queue length code. Wouldnt it be much cleaner if we make a custom ReferenceQueue, which has the field length(), that does this reflection on itself to find the queue length. All the iteration counter, queue length checking and error message printing code can go inside that ReferenceQueue implementation, which is cleanly separated from the main context cleaner logic.

@andrewor14
Copy link
Contributor Author

Yeah, sounds good. I guess we'll use a ReferenceQueueWithSize or something instead

@andrewor14
Copy link
Contributor Author

I have removed the logic of logging queue length as a warning. This significantly simplifies the PR and fulfills its original purpose as a bug fix. We can add back some notion of warning later on if there is interest.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1931 at commit d0f7195.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1931 at commit d0f7195.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

Great - I like this version better!

@asfgit asfgit closed this in c9da466 Aug 16, 2014
asfgit pushed a commit that referenced this pull request Aug 16, 2014
More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

tdas pwendell mengxr

Author: Andrew Or <[email protected]>

Closes #1931 from andrewor14/reference-blocking and squashes the following commits:

d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking
ce9daf5 [Andrew Or] Remove logic for logging queue length
111192a [Andrew Or] Add missing space in log message (minor)
a183b83 [Andrew Or] Switch order of code blocks (minor)
9fd1fe6 [Andrew Or] Remove outdated log
104b366 [Andrew Or] Use the actual reference queue length
0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full
(cherry picked from commit c9da466)

Signed-off-by: Patrick Wendell <[email protected]>
*/
private val blockOnCleanupTasks = sc.conf.getBoolean(
"spark.cleaner.referenceTracking.blocking", false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes will not solve the problem here. see.
BlockManagerMasterActor.scala#L165

  private def removeShuffle(shuffleId: Int): Future[Seq[Boolean]] = {
    // Nothing to do in the BlockManagerMasterActor data structures
    import context.dispatcher
    val removeMsg = RemoveShuffle(shuffleId)
    Future.sequence(
      blockManagerInfo.values.map { bm =>
        // Here has set the akkaTimeout
        bm.slaveActor.ask(removeMsg)(akkaTimeout).mapTo[Boolean]
      }.toSeq
    )
  }

@andrewor14 andrewor14 deleted the reference-blocking branch August 27, 2014 18:14
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of apache#1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

tdas pwendell mengxr

Author: Andrew Or <[email protected]>

Closes apache#1931 from andrewor14/reference-blocking and squashes the following commits:

d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking
ce9daf5 [Andrew Or] Remove logic for logging queue length
111192a [Andrew Or] Add missing space in log message (minor)
a183b83 [Andrew Or] Switch order of code blocks (minor)
9fd1fe6 [Andrew Or] Remove outdated log
104b366 [Andrew Or] Use the actual reference queue length
0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full
@igreenfield
Copy link

@andrewor14
I see in the code:

/**

  • Whether the cleaning thread will block on cleanup tasks (other than shuffle, which
  • is controlled by the spark.cleaner.referenceTracking.blocking.shuffle parameter).
  • Due to SPARK-3015, this is set to true by default. This is intended to be only a temporary
  • workaround for the issue, which is ultimately caused by the way the BlockManager endpoints
  • issue inter-dependent blocking RPC messages to each other at high frequencies. This happens,
  • for instance, when the driver performs a GC and cleans up all broadcast blocks that are no
  • longer in scope.
    */

does that still needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants