SKIPME Merging Apache branch-1.6 #128

markhamstra · 2015-12-11T19:56:22Z

No description provided.

…execution This patch fixes a bug in the eviction of storage memory by execution. ## The bug: In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug: ```scala val maxMemory = 1000L val taskAttemptId = 0L val (mm, ms) = makeThings(maxMemory) // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes // of storage memory which are immune to eviction by execution memory pressure. // Acquire enough storage memory to exceed the storage region size assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks)) assertEvictBlocksToFreeSpaceNotCalled(ms) assert(mm.executionMemoryUsed === 0L) assert(mm.storageMemoryUsed === 750L) // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution // should be able to reclaim up to 250 bytes of storage memory. // Therefore, execution should now be able to require up to 500 bytes of memory: assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L assert(mm.storageMemoryUsed === 500L) assert(mm.executionMemoryUsed === 500L) assertEvictBlocksToFreeSpaceCalled(ms, 250L) ``` The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used. The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`. ## The solution: The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes: - Remove a layer of inheritance which made the memory manager code harder to understand (5384117). - Move some bounds checks earlier in the call chain (13ba7ad). - Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca0). - Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c). - Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](https://github.com/apache/spark/blob/2dc842aea82c8895125d46a00aa43dfb0d121de9/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala#L88) how much memory needs to be reclaimed via eviction; (2dc842a). Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](https://github.com/apache/spark/blob/80a824d36eec9d9a9f092ee1741453851218ec73/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala#L84) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided. I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](apache@7c68ca0#diff-935c68a9803be144ed7bafdd2f756a0fL133) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details. Author: Josh Rosen <[email protected]> Author: Andrew Or <[email protected]> Closes apache#10170 from JoshRosen/SPARK-12165. (cherry picked from commit aec5ea0) Signed-off-by: Andrew Or <[email protected]>

…ML, throws console error Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1" The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above CC tdas as author of this bit of code Author: Sean Owen <[email protected]> Closes apache#10159 from srowen/SPARK-11824. (cherry picked from commit 1eb7c22) Signed-off-by: Sean Owen <[email protected]>

…de_example PR on behalf of somideshmukh, thanks! Author: Xusen Yin <[email protected]> Author: somideshmukh <[email protected]> Closes apache#10219 from yinxusen/SPARK-11551. (cherry picked from commit 051c6a0) Signed-off-by: Xiangrui Meng <[email protected]>

…tion from 1.1 Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#migrating-from-spark-11) uses \{{site.SPARK_VERSION}} as the version where changes were introduced, it should be just 1.2. Author: Andrew Ray <[email protected]> Closes apache#10206 from aray/graphx-doc-1.1-migration. (cherry picked from commit 7a8e587) Signed-off-by: Joseph K. Bradley <[email protected]>

JoshRosen Author: Andrew Or <[email protected]> Closes apache#10229 from andrewor14/unroll-test-comments. (cherry picked from commit 8770bd1) Signed-off-by: Josh Rosen <[email protected]>

This PR adds document for `basePath`, which is a new parameter used by `HadoopFsRelation`. The compiled doc is shown below. ![image](https://cloud.githubusercontent.com/assets/2072857/11673132/1ba01192-9dcb-11e5-98d9-ac0b4e92e98c.png) JIRA: https://issues.apache.org/jira/browse/SPARK-11678 Author: Yin Huai <[email protected]> Closes apache#10211 from yhuai/basePathDoc. (cherry picked from commit ac8cdf1) Signed-off-by: Yin Huai <[email protected]>

… docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <[email protected]> Closes apache#9876 from markgrover/master_docker. (cherry picked from commit 2166c2a) Signed-off-by: Josh Rosen <[email protected]>

…thState and change tracking function signature SPARK-12244: Based on feedback from early users and personal experience attempting to explain it, the name trackStateByKey had two problem. "trackState" is a completely new term which really does not give any intuition on what the operation is the resultant data stream of objects returned by the function is called in docs as the "emitted" data for the lack of a better. "mapWithState" makes sense because the API is like a mapping function like (Key, Value) => T with State as an additional parameter. The resultant data stream is "mapped data". So both problems are solved. SPARK-12245: From initial experiences, not having the key in the function makes it hard to return mapped stuff, as the whole information of the records is not there. Basically the user is restricted to doing something like mapValue() instead of map(). So adding the key as a parameter. Author: Tathagata Das <[email protected]> Closes apache#10224 from tdas/rename.

…x and suffix parameters The original code does not properly handle the cases where the prefix is null, but suffix is not null - the suffix should be used but is not. The fix is using StringBuilder to construct the proper file name. Author: bomeng <[email protected]> Author: Bo Meng <[email protected]> Closes apache#10185 from bomeng/SPARK-12136. (cherry picked from commit e29704f) Signed-off-by: Sean Owen <[email protected]>

Author: Reynold Xin <[email protected]> Closes apache#10226 from rxin/df-transform. (cherry picked from commit 76540b6) Signed-off-by: Reynold Xin <[email protected]>

…etFile SparkR support ```read.parquet``` and deprecate ```parquetFile```. This change is similar with apache#10145 for ```jsonFile```. Author: Yanbo Liang <[email protected]> Closes apache#10191 from yanboliang/spark-12198. (cherry picked from commit eeb5872) Signed-off-by: Shivaram Venkataraman <[email protected]>

jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <[email protected]> Closes apache#9939 from hhbyyh/auditScala. (cherry picked from commit 9fba9c8) Signed-off-by: Joseph K. Bradley <[email protected]>

…``select``` argument Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it. cc sun-rui felixcheung shivaram Author: Yanbo Liang <[email protected]> Closes apache#10217 from yanboliang/spark-12234. (cherry picked from commit d9d354e) Signed-off-by: Shivaram Venkataraman <[email protected]>

…tadata when visualizing SQL query plan This PR backports PR apache#10004 to branch-1.6 It adds a private[sql] method metadata to SparkPlan, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of PhysicalRDDs translated from a data source relation. Author: Cheng Lian <[email protected]> Closes apache#10250 from liancheng/spark-12012.for-1.6.

…ails of its inputSchema https://issues.apache.org/jira/browse/SPARK-12250 Author: Yin Huai <[email protected]> Closes apache#10236 from yhuai/SPARK-12250. (cherry picked from commit bc5f56a) Signed-off-by: Yin Huai <[email protected]>

…rk.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with apache#10207 Author: Timothy Hunter <[email protected]> Closes apache#10234 from thunterdb/12212. (cherry picked from commit 2ecbe02) Signed-off-by: Joseph K. Bradley <[email protected]>

This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs. - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6). - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix. - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (apache#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion. - Document these configurations on the configuration page. Author: Josh Rosen <[email protected]> Closes apache#10237 from JoshRosen/SPARK-12251. (cherry picked from commit 23a9e62) Signed-off-by: Andrew Or <[email protected]>

**Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills. **Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen. **Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B. This patch also guards against the race condition (SPARK-12253): (1) Existing tasks collectively occupy all execution memory (2) New task comes in and blocks while existing tasks spill (3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory (4) New task still cannot acquire memory and goes back to sleep Author: Andrew Or <[email protected]> Closes apache#10240 from andrewor14/fix-oom. (cherry picked from commit 5030923) Signed-off-by: Andrew Or <[email protected]>

… doc With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise. zsxwing tdas , please review, thanks a lot. Author: jerryshao <[email protected]> Closes apache#10246 from jerryshao/direct-kafka-doc-update. (cherry picked from commit 24d3357) Signed-off-by: Shixiong Zhu <[email protected]>

Check nullability and passing them into ScalaUDF. Closes apache#10249 Author: Davies Liu <[email protected]> Closes apache#10259 from davies/udf_null. (cherry picked from commit b1b4ee7) Signed-off-by: Yin Huai <[email protected]>

This is a follow-up PR for apache#10259 Author: Davies Liu <[email protected]> Closes apache#10266 from davies/null_udf2. (cherry picked from commit c119a34) Signed-off-by: Davies Liu <[email protected]>

…iles * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <[email protected]> Closes apache#10145 from yanboliang/spark-12146. (cherry picked from commit 0fb9825) Signed-off-by: Shivaram Venkataraman <[email protected]>

SKIPME Merging Apache branch-1.6

JoshRosen and others added 28 commits December 9, 2015 11:40

[SPARK-12165][ADDENDUM] Fix outdated comments on unroll test

d86a88d

JoshRosen Author: Andrew Or <[email protected]> Closes apache#10229 from andrewor14/unroll-test-comments. (cherry picked from commit 8770bd1) Signed-off-by: Josh Rosen <[email protected]>

[SPARK-12242][SQL] Add DataFrame.transform method

f939c71

Author: Reynold Xin <[email protected]> Closes apache#10226 from rxin/df-transform. (cherry picked from commit 76540b6) Signed-off-by: Reynold Xin <[email protected]>

[SPARK-12258][SQL] passing null into ScalaUDF

d09af2c

Check nullability and passing them into ScalaUDF. Closes apache#10249 Author: Davies Liu <[email protected]> Closes apache#10259 from davies/udf_null. (cherry picked from commit b1b4ee7) Signed-off-by: Yin Huai <[email protected]>

Preparing Spark release v1.6.0-rc2

3e39925

Preparing development version 1.6.0-SNAPSHOT

250249e

[SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)

eec3660

This is a follow-up PR for apache#10259 Author: Davies Liu <[email protected]> Closes apache#10266 from davies/null_udf2. (cherry picked from commit c119a34) Signed-off-by: Davies Liu <[email protected]>

Preparing Spark release v1.6.0-rc2

23f8dfd

Preparing development version 1.6.0-SNAPSHOT

2e45231

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

2fdb9d0

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

c6f2744

markhamstra added a commit that referenced this pull request Dec 11, 2015

Merge pull request #128 from markhamstra/csd-1.6

3877619

SKIPME Merging Apache branch-1.6

markhamstra merged commit 3877619 into alteryx:csd-1.6 Dec 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SKIPME Merging Apache branch-1.6 #128

SKIPME Merging Apache branch-1.6 #128

Uh oh!

markhamstra commented Dec 11, 2015

Uh oh!

Uh oh!

SKIPME Merging Apache branch-1.6 #128

SKIPME Merging Apache branch-1.6 #128

Uh oh!

Conversation

markhamstra commented Dec 11, 2015

Uh oh!

Uh oh!