[SPARK-7251][Spark Core] Perform sequential scan when iterating over entries in BytesToBytesMap #5836

kaka1992 · 2015-05-01T10:07:54Z

No description provided.

AmplabJenkins · 2015-05-01T10:12:09Z

Can one of the admins verify this patch?

JoshRosen · 2015-05-03T20:21:06Z

unsafe/src/main/java/org/apache/spark/unsafe/PlatformDependent.java

@@ -89,7 +89,9 @@ public static void putDouble(Object object, long offset, double value) {
    }

    public static long allocateMemory(long size) {
-      return _UNSAFE.allocateMemory(size);
+      long address = _UNSAFE.allocateMemory(size);
+      _UNSAFE.setMemory(address, size, (byte) 0);


I don't think that we should zero out our allocated memory by default:

In many cases, it is unnecessary and carries a performance penalty.

setMemory isn't available in all versions of Java 6; this is one of the reasons why I chose not to expose it in the UNSAFE facade.

JoshRosen · 2015-05-03T22:03:01Z

To address your comment from JIRA:

Can the key could be empty? If we allow the key could be zero, it's impossible to deal the useless memory of the end page. Because I don't know if it's at the end of the page or just the key is empty. Of couse I have to setMemory to zero.

We should handle this corner case. One approach might be to store a negative value for the record length. We should add a test case to BytesToBytesMapSuite that tries storing empty keys and values.

kaka1992 · 2015-05-04T05:27:54Z

@JoshRosen Please test this PR.

JoshRosen · 2015-05-04T05:47:08Z

Jenkins, this is ok to test.

AmplabJenkins · 2015-05-04T05:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-04T05:52:19Z

Merged build started.

SparkQA · 2015-05-04T05:53:43Z

Test build #31726 has started for PR 5836 at commit 851d34a.

SparkQA · 2015-05-04T05:55:37Z

Test build #31726 has finished for PR 5836 at commit 851d34a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-04T05:55:38Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-04T05:55:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31726/
Test FAILed.

kaka1992 · 2015-05-04T09:31:09Z

@JoshRosen oro#oro;2.0.8!oro.jar origin location must be absolute. Building failed. Please retest.

JoshRosen · 2015-05-04T23:38:13Z

We're still investigating this Jenkins flakiness issue (it might have something to do with certain build machines' ivy caches). In the meantime, let's restest this.

I'll loop back later to do a review pass on this.

JoshRosen · 2015-05-04T23:38:14Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-04T23:42:10Z

Merged build triggered.

AmplabJenkins · 2015-05-04T23:42:15Z

Merged build started.

SparkQA · 2015-05-04T23:42:51Z

Test build #31795 has started for PR 5836 at commit 851d34a.

SparkQA · 2015-05-05T00:59:42Z

Test build #31795 has finished for PR 5836 at commit 851d34a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-05T00:59:47Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-05T00:59:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31795/
Test FAILed.

kaka1992 · 2015-05-05T01:56:32Z

@JoshRosen Please retest this.

AmplabJenkins · 2015-05-05T01:57:10Z

Merged build triggered.

AmplabJenkins · 2015-05-05T01:57:17Z

Merged build started.

SparkQA · 2015-05-05T01:59:02Z

Test build #31815 has started for PR 5836 at commit baaefc5.

AmplabJenkins · 2015-05-05T03:10:01Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-05T03:10:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31815/
Test FAILed.

kaka1992 · 2015-05-05T03:32:47Z

@JoshRosen Please retest this. Failed because SparkSubmitSuit timeout.

JoshRosen · 2015-05-05T16:31:12Z

Jenkins, retest this please.

JoshRosen · 2015-05-05T16:31:30Z

(Sorry, we've been battling some test flakiness this week; I think that this test is either fixed or ignored now).

AmplabJenkins · 2015-05-05T16:32:10Z

Merged build triggered.

AmplabJenkins · 2015-05-05T16:32:17Z

Merged build started.

SparkQA · 2015-05-05T16:33:56Z

Test build #31885 has started for PR 5836 at commit baaefc5.

SparkQA · 2015-05-05T18:17:25Z

Test build #31885 has finished for PR 5836 at commit baaefc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-05T18:17:30Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-05T18:17:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31885/
Test PASSed.

kaka1992 · 2015-05-06T02:07:29Z

@JoshRosen Please merge this to master.

JoshRosen · 2015-05-07T18:53:48Z

unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java

-        nextPos = bitset.nextSetBit(nextPos + 1);
-        return loc.with(pos, 0, true);
+        if (currentPage == null) {
+          currentPage = dataPages.get(pageCur++);


We should never use inline increment operations like this. When I see this, I have to take time to remember whether get() is called with the value before the increment or after. I don't think that it makes sense to trade off brevity for clarity like this; please move the increment to a separate line.

JoshRosen · 2015-05-07T19:09:36Z

I decided to try to benchmark this using the following harness: https://gist.github.com/JoshRosen/286b26494ab27e657051

When I ran this benchmark, I immediately ran into a bug:

[error] Exception in thread "main" java.lang.NullPointerException
[error]     at org.apache.spark.unsafe.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:194)
[error]     at org.apache.spark.unsafe.map.BytesToBytesMap$1.next(BytesToBytesMap.java:195)
[error]     at org.apache.spark.unsafe.map.BytesToBytesMap$1.next(BytesToBytesMap.java:174)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$.runBenchmark(BytesToBytesMapIterationBenchmark.scala:36)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2$$anonfun$apply$mcVI$sp$1$$anonfun$apply$1.apply$mcVI$sp(BytesToBytesMapIterationBenchmark.scala:64)
[error]     at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2$$anonfun$apply$mcVI$sp$1.apply(BytesToBytesMapIterationBenchmark.scala:63)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2$$anonfun$apply$mcVI$sp$1.apply(BytesToBytesMapIterationBenchmark.scala:58)
[error]     at scala.collection.immutable.List.foreach(List.scala:318)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2.apply$mcVI$sp(BytesToBytesMapIterationBenchmark.scala:58)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2.apply(BytesToBytesMapIterationBenchmark.scala:57)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$$anonfun$main$2.apply(BytesToBytesMapIterationBenchmark.scala:57)
[error]     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
[error]     at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
[error]     at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
[error]     at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark$.main(BytesToBytesMapIterationBenchmark.scala:57)
[error]     at org.apache.spark.sql.BytesToBytesMapIterationBenchmark.main(BytesToBytesMapIterationBenchmark.scala)

I noticed many issues in this patch, which implies that we need better unit tests for this code (the external stress test harnesses manage to test this pretty well, but we should integrate those into our CI pipeline).

Also, we need to have a better description of the changes; this PR shouldn't have an empty description. The code also needs comments to make it clear that we first iterate through data pages, then through records in a page, rolling over to the next page or stopping when we encounter the special negative length. We also need comments on some of the various +8 calls to clarify that we're reserving space for the last length marker.

JoshRosen · 2015-05-07T19:11:43Z

It's fine if you want to continue to work on this patch, but I'd also be happy to just take this over myself, since I don't think it will take me more than 30 minutes tops to implement this with good comments + tests.

JoshRosen · 2015-05-14T22:06:49Z

I've opened #6159, which cleans up this patch and brings it up to 100% line coverage.

This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations. This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries. In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length). This patch incorporates / closes #5836. Author: Josh Rosen <[email protected]> Closes #6159 from JoshRosen/SPARK-7251 and squashes the following commits: 05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY 2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap f5feadf [Josh Rosen] Add test for iterating over an empty map 273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap (cherry picked from commit f2faa7a) Signed-off-by: Josh Rosen <[email protected]>

This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations. This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries. In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length). This patch incorporates / closes apache#5836. Author: Josh Rosen <[email protected]> Closes apache#6159 from JoshRosen/SPARK-7251 and squashes the following commits: 05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY 2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap f5feadf [Josh Rosen] Add test for iterating over an empty map 273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap

云峤 added 2 commits May 1, 2015 18:04

add new iter

a620b94

add new iter

5957d11

JoshRosen reviewed May 3, 2015
View reviewed changes

云峤 added 2 commits May 4, 2015 10:10

update

3e1a3ec

update

851d34a

Fix empty map bug.

baaefc5

JoshRosen reviewed May 7, 2015
View reviewed changes

JoshRosen mentioned this pull request May 14, 2015

[SPARK-7251] Perform sequential scan when iterating over BytesToBytesMap #6159

Closed

asfgit closed this in f2faa7a May 20, 2015

[SPARK-7251][Spark Core] Perform sequential scan when iterating over entries in BytesToBytesMap #5836

[SPARK-7251][Spark Core] Perform sequential scan when iterating over entries in BytesToBytesMap #5836

Uh oh!

Conversation

kaka1992 commented May 1, 2015

Uh oh!

AmplabJenkins commented May 1, 2015

Uh oh!

JoshRosen May 3, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented May 3, 2015

Uh oh!

kaka1992 commented May 4, 2015

Uh oh!

JoshRosen commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

SparkQA commented May 4, 2015

Uh oh!

SparkQA commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

kaka1992 commented May 4, 2015

Uh oh!

JoshRosen commented May 4, 2015

Uh oh!

JoshRosen commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

AmplabJenkins commented May 4, 2015

Uh oh!

SparkQA commented May 4, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

kaka1992 commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

kaka1992 commented May 5, 2015

Uh oh!

JoshRosen commented May 5, 2015

Uh oh!

JoshRosen commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

SparkQA commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

AmplabJenkins commented May 5, 2015

Uh oh!

kaka1992 commented May 6, 2015

Uh oh!

JoshRosen May 7, 2015