Skip to content

[SPARK-4454] Properly synchronize accesses to DAGScheduler cacheLocs map #4660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

JoshRosen
Copy link
Contributor

This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its cacheLocs map.

This map is accessed by the getCacheLocs and clearCacheLocs() methods, which can be called by separate threads, since DAGScheduler's getPreferredLocs() method is called by SparkContext and indirectly calls getCacheLocs(). If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.

Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread. An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations. However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop). Since cacheLocs is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.

See #3345 for additional context.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27651 has started for PR 4660 at commit 12d64ba.

  • This patch merges cleanly.

@JoshRosen JoshRosen changed the title [SPARK-4544] Properly synchronize accesses to DAGScheduler cacheLocs map [SPARK-4454] Properly synchronize accesses to DAGScheduler cacheLocs map Feb 17, 2015
@JoshRosen
Copy link
Contributor Author

/cc @pwendell @markhamstra

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27651 has finished for PR 4660 at commit 12d64ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27651/
Test PASSed.

@pwendell
Copy link
Contributor

This LGTM Josh - thanks for giving this such a thorough look.

asfgit pushed a commit that referenced this pull request Feb 18, 2015
This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its `cacheLocs` map.

This map is accessed by the `getCacheLocs` and `clearCacheLocs()` methods, which can be called by separate threads, since DAGScheduler's `getPreferredLocs()` method is called by SparkContext and indirectly calls `getCacheLocs()`.  If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.

Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread.  An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations.  However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop).  Since `cacheLocs` is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.

See #3345 for additional context.

Author: Josh Rosen <[email protected]>

Closes #4660 from JoshRosen/SPARK-4454 and squashes the following commits:

12d64ba [Josh Rosen] Properly synchronize accesses to DAGScheduler cacheLocs map.

(cherry picked from commit d46d624)
Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in d46d624 Feb 18, 2015
@JoshRosen
Copy link
Contributor Author

I pushed a a51fc7e hotfix to roll back a bit of code cleanup that I did which added a getOrElseUpdate call, since this might have a performance impact as this code is called O(num tasks) times.

@JoshRosen JoshRosen deleted the SPARK-4454 branch February 18, 2015 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants