[Spark] RDD take() method: overestimate too much #2648

yingjieMiao · 2014-10-03T22:37:09Z

In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."

(1.5 * num * partsScanned / buf.size).toInt is the guess of "num of total partitions needed". In every iteration, we should consider the increment (1.5 * num * partsScanned / buf.size).toInt - partsScanned
Existing implementation 'exponentially' grows partsScanned ( roughly: x_{n+1} >= (1.5 + 1) x_n)

This could be a performance problem. (unless this is the intended behavior)

AmplabJenkins · 2014-10-03T22:42:10Z

Can one of the admins verify this patch?

ash211 · 2014-10-05T22:41:13Z

This seems right to me yingjie. Let's see if the tests work

rxin · 2014-10-06T08:00:53Z

Jenkins, test this please.

rxin · 2014-10-06T08:04:20Z

Changes LGTM.

SparkQA · 2014-10-06T08:04:34Z

QA tests have started for PR 2648 at commit a2aa36b.

This patch merges cleanly.

SparkQA · 2014-10-06T08:06:49Z

QA tests have started for PR 2648 at commit a2aa36b.

This patch merges cleanly.

SparkQA · 2014-10-06T10:04:34Z

Tests timed out for PR 2648 at commit a2aa36b after a configured wait of 120m.

AmplabJenkins · 2014-10-06T10:04:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21320/Test FAILed.

SparkQA · 2014-10-06T10:06:49Z

Tests timed out for PR 2648 at commit a2aa36b after a configured wait of 120m.

rxin · 2014-10-06T17:46:36Z

It seems like this leads to some infinite loop and tests are timing out because of that.

yingjieMiao · 2014-10-06T18:15:58Z

hmm... (1.5 * num * partsScanned / buf.size).toInt >= partsScanned + 1 whenever partsScanned >= 2. This fails when partsScanned == 1, which is exactly after 1 iteration. Will fix that.

yingjieMiao · 2014-10-06T18:26:13Z

@rxin
updated. Please see the comments.

aarondav · 2014-10-06T19:13:20Z

Could you make this change in rdd.py as well? The code should be kept equivalent.

yingjieMiao · 2014-10-06T19:40:31Z

@aarondav

updated rdd.py and AsyncRDDActions.scala

rxin · 2014-10-06T20:04:42Z

Jenkins, retest this please.

SparkQA · 2014-10-06T20:09:33Z

QA tests have started for PR 2648 at commit 1d2c410.

This patch merges cleanly.

SparkQA · 2014-10-06T20:10:32Z

QA tests have finished for PR 2648 at commit 1d2c410.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-06T20:10:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21339/Test FAILed.

yingjieMiao · 2014-10-06T20:20:58Z

oops, looks like Scalastyle checks failed at following occurrences: message=File line length exceeds 100 characters line=1087. will fix that.

yingjieMiao · 2014-10-06T20:37:49Z

@rxin
previous build failed due to code style. Should be fixed. Thank you!

davies · 2014-10-07T18:19:02Z

core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala

@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
          if (results.size == 0) {
            numPartsToTry = totalParts - 1


Could you also change this to partsScanned * 4 ?

Sure, I can. The comment says: "If we didn't find any rows after the first iteration, just try all partitions next" . I had little context about these decisions. But I agree that we should keep logic equivalent in these methods.

The take() was fixed in yingjieMiao@ba5bcad, but AsyncRDDActions was missed in that patch, thanks for bringing this on top of the table.

yingjieMiao · 2014-10-07T19:44:03Z

@davies addressed your comments.

davies · 2014-10-07T19:50:59Z

core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala

          if (results.size == 0) {
-            numPartsToTry = totalParts - 1
+            numPartsToTry = totalParts * 4


totalParts should be partsScanned

davies · 2014-10-07T21:00:05Z

LGTM, thanks!

Jenkins, retest this please.

yingjieMiao · 2014-10-09T17:59:27Z

retest? @davies

SparkQA · 2014-10-09T18:05:55Z

QA tests have started for PR 2648 at commit 4391d3b.

This patch merges cleanly.

SparkQA · 2014-10-09T18:06:59Z

QA tests have finished for PR 2648 at commit 4391d3b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-10-09T18:23:29Z

core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala

-            numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
+            // the left side of max is >=1 whenever partsScanned >= 2
+            numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1
+            numPartsToTry = numPartsToTry min (partsScanned * 4) 


Infix Methods
Don't use infix notation for methods that aren't operators. For example, instead of list map func, use list.map(func), or instead of string contains "foo", use string.contains("foo"). This is to improve familiarity to developers coming from other languages.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

thank you for the pointer. Will fix!

yingjieMiao · 2014-10-09T18:36:36Z

@davies style fixed. thanks!

davies · 2014-10-09T18:55:33Z

it failed in python style check:

PEP 8 checks failed.
./python/pyspark/rdd.py:1077:21: E265 block comment should start with '# '
./python/pyspark/rdd.py:1078:101: E501 line too long (101 > 100 characters)

yingjieMiao · 2014-10-09T19:45:32Z

@davies fixed. thank you.

AmplabJenkins · 2014-10-09T20:32:43Z

Can one of the admins verify this patch?

SparkQA · 2014-10-09T21:36:10Z

QA tests have started for PR 2648 at commit a8e74bb.

This patch merges cleanly.

SparkQA · 2014-10-09T21:37:13Z

QA tests have finished for PR 2648 at commit a8e74bb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2014-10-09T23:56:46Z

Jenkins, test this please.

SparkQA · 2014-10-10T00:00:13Z

QA tests have started for PR 2648 at commit d758218.

This patch merges cleanly.

SparkQA · 2014-10-10T01:08:32Z

QA tests have finished for PR 2648 at commit d758218.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-10T01:08:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21558/Test PASSed.

yingjieMiao · 2014-10-13T15:53:34Z

@davies OK to merge?

davies · 2014-10-13T17:21:39Z

@yingjieMiao it looks good to me, waiting for other people.

rxin · 2014-10-13T20:11:40Z

Merging in master. Thanks!

rxin · 2014-10-13T20:12:32Z

I just realized we didn't have a jira for this. Let's make sure we create a jira ticket tracking updates. Thanks.

RDD take method: overestimate too much

a2aa36b

handle the edge case after 1 iteration

d31ff7e

also change in rdd.py and AsyncRDD

1d2c410

style fix

c4483dc

davies reviewed Oct 7, 2014
View reviewed changes

cap numPartsToTry

692f4e6

davies reviewed Oct 7, 2014
View reviewed changes

typo fix.

4391d3b

davies reviewed Oct 9, 2014
View reviewed changes

infix operator style fix

4b6e777

python style fix

a8e74bb

scala style fix

d758218

asfgit closed this in 49bbdcb Oct 13, 2014

		@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
		if (results.size == 0) {
		numPartsToTry = totalParts - 1

[Spark] RDD take() method: overestimate too much #2648

[Spark] RDD take() method: overestimate too much #2648

Uh oh!

Conversation

yingjieMiao commented Oct 3, 2014

Uh oh!

AmplabJenkins commented Oct 3, 2014

Uh oh!

ash211 commented Oct 5, 2014

Uh oh!

rxin commented Oct 6, 2014

Uh oh!

rxin commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

AmplabJenkins commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

rxin commented Oct 6, 2014

Uh oh!

yingjieMiao commented Oct 6, 2014

Uh oh!

yingjieMiao commented Oct 6, 2014

Uh oh!

aarondav commented Oct 6, 2014

Uh oh!

yingjieMiao commented Oct 6, 2014

Uh oh!

rxin commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

AmplabJenkins commented Oct 6, 2014

Uh oh!

yingjieMiao commented Oct 6, 2014

Uh oh!

yingjieMiao commented Oct 6, 2014

Uh oh!

davies Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

yingjieMiao Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

davies Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

yingjieMiao commented Oct 7, 2014

Uh oh!

davies Oct 7, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Oct 7, 2014

Uh oh!

yingjieMiao commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

davies Oct 9, 2014

Choose a reason for hiding this comment

Uh oh!

yingjieMiao Oct 9, 2014

Choose a reason for hiding this comment

Uh oh!

yingjieMiao commented Oct 9, 2014

Uh oh!

davies commented Oct 9, 2014

Uh oh!

yingjieMiao commented Oct 9, 2014

Uh oh!