-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution #10170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reduces coupling between failed tests.
…o be dropped. Previously, ensureFreeSpace() might end up not dropping blocks if the total storage memory pool usage was less than the maximum possible storage pool usage.
Test build #47251 has finished for PR 10170 at commit
|
I think that the test failures in Previously, it looks like After the fixes implemented here, we'll first claim as much free storage memory as possible, then subtract that from our memory goal and request the remaining memory via spilling. As a result, we are more prone to evict, which might be throwing off the original test case (it's a little tricky to say due to size estimation; I'll try to see if I can decouple that via mocking in order to make the test easier to reason about). One minor question of semantics: up until now (and still) it looks like @andrewor14, is the original idea behind |
if (numBytesToAcquire > memoryFree && maxNumBytesToFree > 0) { | ||
val additionalMemoryRequired = numBytesToAcquire - memoryFree | ||
memoryStore.evictBlocksToFreeSpace( | ||
Some(blockId), Math.min(maxNumBytesToFree, additionalMemoryRequired), evictedBlocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use math.min
like we do in other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, to improve readability a little:
val additionalMemoryRequired = ...
val numBytesToFree = math.min(maxNumBytesToFree, additionalMemoryRequired)
memoryStore.evictBlocksToFreeSpace(Some(blockId), numBytesToFree, evictedBlocks)
@JoshRosen despite the number of comments I left I think this patch looks good. I did a close review to verify its correctness and that the two bugs were real issues. On the side I will investigate the test failures and hopefully get this merged soon. |
This commit also adds a regression test for SPARK-12189 through the existing test "execution evicts storage" by adding an assert.
Rewrite some tests after changes in eviction
Test build #47378 has finished for PR 10170 at commit
|
Test build #47380 has finished for PR 10170 at commit
|
Note to self RE: unroll fraction (just to be precise): At a high-level, the StaticMemoryManager a request for unroll memory will be able to obtain more memory if:
An unroll request can only evict blocks if |
Test build #47382 has finished for PR 10170 at commit
|
Test build #47383 has finished for PR 10170 at commit
|
Before this commit, unrolling would evict too many blocks, resulting in test failures in BlockManagerSuite. The root cause is that we used `maxUnrollMemory` as a cap for the extra amount of memory to evict for unrolling, which is incorrect. Instead, we should use it as a cap for the total amount of unroll memory and calculate the amount of memory to evict from there. The goal of this commit is to preserve the old behavior (in 1.5) as much as possible. This can be seen from the fact that BlockManagerSuite now passes without any modifications.
One way to fix the tests
LGTM, I'll merge this once tests pass. |
Updated PR description to remove the following text from the end (which will be incorporated into separate PRs, most likely): TODOs
|
Test build #47433 has finished for PR 10170 at commit
|
Merging into master and 1.6!! |
…execution This patch fixes a bug in the eviction of storage memory by execution. ## The bug: In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug: ```scala val maxMemory = 1000L val taskAttemptId = 0L val (mm, ms) = makeThings(maxMemory) // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes // of storage memory which are immune to eviction by execution memory pressure. // Acquire enough storage memory to exceed the storage region size assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks)) assertEvictBlocksToFreeSpaceNotCalled(ms) assert(mm.executionMemoryUsed === 0L) assert(mm.storageMemoryUsed === 750L) // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution // should be able to reclaim up to 250 bytes of storage memory. // Therefore, execution should now be able to require up to 500 bytes of memory: assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L assert(mm.storageMemoryUsed === 500L) assert(mm.executionMemoryUsed === 500L) assertEvictBlocksToFreeSpaceCalled(ms, 250L) ``` The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used. The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`. ## The solution: The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes: - Remove a layer of inheritance which made the memory manager code harder to understand (5384117). - Move some bounds checks earlier in the call chain (13ba7ad). - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca0). - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c). - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842a). Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided. I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](7c68ca0#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details. Author: Josh Rosen <[email protected]> Author: Andrew Or <[email protected]> Closes #10170 from JoshRosen/SPARK-12165. (cherry picked from commit aec5ea0) Signed-off-by: Andrew Or <[email protected]>
This patch fixes a bug in the eviction of storage memory by execution.
The bug:
In general, execution should be able to evict storage memory when the total storage memory usage is greater than
maxMemory * spark.memory.storageFraction
. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was betweenmaxMemory * spark.memory.storageFraction
andmaxMemory
. For example, here is a regression test which illustrates the bug:The problem relates to the control flow / interaction between
StorageMemoryPool.shrinkPoolToReclaimSpace()
andMemoryStore.ensureFreeSpace()
. While trying to allocate the 500 bytes of execution memory, theUnifiedMemoryManager
discovers that it will need to reclaim 250 bytes of memory from storage, so it callsStorageMemoryPool.shrinkPoolToReclaimSpace(250L)
. This method, in turn, callsMemoryStore.ensureFreeSpace(250L)
. However,ensureFreeSpace()
first checks whether the requested space is less thanmaxStorageMemory - storageMemoryUsed
, which will be true if there is any free execution memory because it turns out thatMemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)
when theUnifiedMemoryManager
is used.The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code,
ensureFreeSpace
was called directly by theMemoryStore
itself, whereas in 1.6 it's involved in a confusing control flow whereMemoryStore
callsMemoryManager.acquireStorageMemory
, which then calls back intoMemoryStore.ensureFreeSpace
, which, in turn, callsMemoryManager.freeStorageMemory
.The solution:
The solution implemented in this patch is to remove the confusing circular control flow between
MemoryManager
andMemoryStore
, making the storage memory acquisition process much more linear / straightforward. The key changes:ensureFreeSpace()
so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca0).ensureFreeSpace
(eec4f6c).ensureFreeSpace()
can simply be replaced with anevictBlocksToFreeSpace()
method which is called after we've already figured out how much memory needs to be reclaimed via eviction; (2dc842a).Along the way, I fixed some problems with the mocks in
MemoryManagerSuite
: the old mocks would unconditionally report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided.I also fixed a problem where
StorageMemoryPool._memoryUsed
might become negative due to freed memory being double-counted when excution evicts storage. The problem was thatStorageMemoryPoolshrinkPoolToFreeSpace
would decrement_memoryUsed
even thoughStorageMemoryPool.freeMemory
had already decremented it as each evicted block was freed. See SPARK-12189 for details.