Skip to content

Conversation

yingjieMiao
Copy link
Contributor

In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."

(1.5 * num * partsScanned / buf.size).toInt is the guess of "num of total partitions needed". In every iteration, we should consider the increment (1.5 * num * partsScanned / buf.size).toInt - partsScanned
Existing implementation 'exponentially' grows partsScanned ( roughly: x_{n+1} >= (1.5 + 1) x_n)

This could be a performance problem. (unless this is the intended behavior)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ash211
Copy link
Contributor

ash211 commented Oct 5, 2014

This seems right to me yingjie. Let's see if the tests work

@rxin
Copy link
Contributor

rxin commented Oct 6, 2014

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Oct 6, 2014

Changes LGTM.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2648 at commit a2aa36b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2648 at commit a2aa36b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

Tests timed out for PR 2648 at commit a2aa36b after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21320/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

Tests timed out for PR 2648 at commit a2aa36b after a configured wait of 120m.

@rxin
Copy link
Contributor

rxin commented Oct 6, 2014

It seems like this leads to some infinite loop and tests are timing out because of that.

@yingjieMiao
Copy link
Contributor Author

hmm... (1.5 * num * partsScanned / buf.size).toInt >= partsScanned + 1 whenever partsScanned >= 2. This fails when partsScanned == 1, which is exactly after 1 iteration. Will fix that.

@yingjieMiao
Copy link
Contributor Author

@rxin
updated. Please see the comments.

@aarondav
Copy link
Contributor

aarondav commented Oct 6, 2014

Could you make this change in rdd.py as well? The code should be kept equivalent.

@yingjieMiao
Copy link
Contributor Author

@aarondav

updated rdd.py and AsyncRDDActions.scala

@rxin
Copy link
Contributor

rxin commented Oct 6, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2648 at commit 1d2c410.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2648 at commit 1d2c410.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21339/Test FAILed.

@yingjieMiao
Copy link
Contributor Author

oops, looks like Scalastyle checks failed at following occurrences: message=File line length exceeds 100 characters line=1087. will fix that.

@yingjieMiao
Copy link
Contributor Author

@rxin
previous build failed due to code style. Should be fixed. Thank you!

@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi
if (results.size == 0) {
numPartsToTry = totalParts - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also change this to partsScanned * 4 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can. The comment says: "If we didn't find any rows after the first iteration, just try all partitions next" . I had little context about these decisions. But I agree that we should keep logic equivalent in these methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The take() was fixed in yingjieMiao@ba5bcad, but AsyncRDDActions was missed in that patch, thanks for bringing this on top of the table.

@yingjieMiao
Copy link
Contributor Author

@davies addressed your comments.

if (results.size == 0) {
numPartsToTry = totalParts - 1
numPartsToTry = totalParts * 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totalParts should be partsScanned

@davies
Copy link
Contributor

davies commented Oct 7, 2014

LGTM, thanks!

Jenkins, retest this please.

@yingjieMiao
Copy link
Contributor Author

retest? @davies

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2648 at commit 4391d3b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2648 at commit 4391d3b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
// the left side of max is >=1 whenever partsScanned >= 2
numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1
numPartsToTry = numPartsToTry min (partsScanned * 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infix Methods
Don't use infix notation for methods that aren't operators. For example, instead of list map func, use list.map(func), or instead of string contains "foo", use string.contains("foo"). This is to improve familiarity to developers coming from other languages.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the pointer. Will fix!

@yingjieMiao
Copy link
Contributor Author

@davies style fixed. thanks!

@davies
Copy link
Contributor

davies commented Oct 9, 2014

it failed in python style check:

PEP 8 checks failed.
./python/pyspark/rdd.py:1077:21: E265 block comment should start with '# '
./python/pyspark/rdd.py:1078:101: E501 line too long (101 > 100 characters)

@yingjieMiao
Copy link
Contributor Author

@davies fixed. thank you.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2648 at commit a8e74bb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2648 at commit a8e74bb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

pwendell commented Oct 9, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2648 at commit d758218.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2648 at commit d758218.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21558/Test PASSed.

@yingjieMiao
Copy link
Contributor Author

@davies OK to merge?

@davies
Copy link
Contributor

davies commented Oct 13, 2014

@yingjieMiao it looks good to me, waiting for other people.

@rxin
Copy link
Contributor

rxin commented Oct 13, 2014

Merging in master. Thanks!

@rxin
Copy link
Contributor

rxin commented Oct 13, 2014

I just realized we didn't have a jira for this. Let's make sure we create a jira ticket tracking updates. Thanks.

@asfgit asfgit closed this in 49bbdcb Oct 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants