Skip to content

[SPARK-6040][SQL] Fix the percent bug in tablesample #4789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

watermen
Copy link
Contributor

HiveQL expression like select count(1) from src tablesample(1 percent); means take 1% sample to select. But it means 100% in the current version of the Spark.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@yhuai
Copy link
Contributor

yhuai commented Feb 26, 2015

test this please

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28014 has started for PR 4789 at commit 92cbc4a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28014 has finished for PR 4789 at commit 92cbc4a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28014/
Test FAILed.

@scwf
Copy link
Contributor

scwf commented Feb 27, 2015

test this please

@watermen
Copy link
Contributor Author

watermen commented Mar 2, 2015

@yhuai Can you trigger the test for me?

@chenghao-intel
Copy link
Contributor

retest this please

@yhuai
Copy link
Contributor

yhuai commented Mar 2, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28153 has started for PR 4789 at commit 01abcce.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28153 has finished for PR 4789 at commit 01abcce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28153/
Test PASSed.

@@ -850,7 +851,14 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
case Token("TOK_TABLESPLITSAMPLE",
Token("TOK_PERCENT", Nil) ::
Token(fraction, Nil) :: Nil) =>
Sample(fraction.toDouble, withReplacement = false, (math.random * 1000).toInt, relation)
// RDD's sample function is on interval [0, 1], but HiveQL's tablesample is on
// interval [0, 100]. So we need to judge the interval individually.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "The range of fraction accepted by Sample is [0, 1]. Because Hive's block sampling function takes X PERCENT as the input and the range of X is [0, 100], we need to adjust the fraction."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai Thanks for you help and I had done it.

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28160 has started for PR 4789 at commit 2453ebe.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28160 has finished for PR 4789 at commit 2453ebe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28160/
Test PASSed.

@@ -467,6 +467,7 @@ class HiveQuerySuite extends HiveComparisonTest with BeforeAndAfter {

test("sampling") {
sql("SELECT * FROM src TABLESAMPLE(0.1 PERCENT) s")
sql("SELECT * FROM src TABLESAMPLE(100 PERCENT) s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go ahead and merge this since it changes semantics and we are close to the release where we remove the alpha tag, but it would be great if you could add a test that actually checks to make sure sampling is happening and we are getting something close to the expected number of results.

asfgit pushed a commit that referenced this pull request Mar 2, 2015
HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark.

Author: q00251598 <[email protected]>

Closes #4789 from watermen/SPARK-6040 and squashes the following commits:

2453ebe [q00251598] check and adjust the fraction.

(cherry picked from commit 582e5a2)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 582e5a2 Mar 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants