Skip to content

Conversation

chenghao-intel
Copy link
Contributor

Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29778 has started for PR 5383 at commit 94ce8d0.

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29778 has finished for PR 5383 at commit 94ce8d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LazyIterator(func: () => TraversableOnce[Row]) extends Iterator[Row]
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29778/
Test PASSed.

child.execute().mapPartitions(iter => iter.flatMap(row => boundGenerator.eval(row)))
child.execute().mapPartitions(iter =>
iter.flatMap(row => boundGenerator.eval(row)) ++
LazyIterator(() => boundGenerator.terminate())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are calling terminate on each partition. Is that not the same as how Hive does? In Hive it seems that close is called after all rows are processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about codes below to simplify that?

child.execute().mapPartitions(iter =>
    iter.flatMap(row => boundGenerator.eval(row)))
   .mapPartitions(_ ++ boundGenerator.terminate())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu thanks for the suggestion, I'd like to try that locally. BTW can you confirm if this PR work for you, as we talked offline? Just to double confirm the concern of @viirya.
Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try and let you know the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it works correctly, thanks.

@maropu
Copy link
Member

maropu commented Apr 9, 2015

I found one issue; the current implementation of HiveGenericUdtf always calls terminate() though, it does not call initialize() in some cases because of lazy initialization.

 protected lazy val outputInspector = function.initialize(inputInspectors.toArray)

@chenghao-intel
Copy link
Contributor Author

@maropu thanks for the comments, I've updated the code.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29954 has started for PR 5383 at commit c45faf0.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29954 has finished for PR 5383 at commit c45faf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class UDTFCollector extends Collector with Serializable
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29954/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

@liancheng @yhuai

}.mapPartitions { iter =>
val nullOriginalRow = Row(Seq.fill(generator.output.size)(Literal(null)): _*)
val joinedRow = new JoinedRow
iter ++ boundGenerator.terminate().map(joinedRow(nullOriginalRow, _))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this .mapPartitions into the above one?

@liancheng
Copy link
Contributor

Please also add a test case to test the case where Generator.join is true.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30195 has started for PR 5383 at commit 63c88cc.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30195 has finished for PR 5383 at commit 63c88cc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected class UDTFCollector extends Collector with Serializable
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30195/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30202 has started for PR 5383 at commit d719983.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30202 has finished for PR 5383 at commit d719983.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30202/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

Thank you @liancheng I've updated the code and it passed the unit test.

createQueryTest("Test UDTF.close in Lateral Views",
"""
| SELECT key, cc
| FROM src LATERAL VIEW udtf_count2(value) dd AS cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove spaces at the beginning of these two lines.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30620 has finished for PR 5383 at commit e1635b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30620/
Test PASSed.

@@ -21,6 +21,16 @@ import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.expressions._

// for lazy computing, be sure the generator.terminate() called in the very last
// TODO reusing the CompletionIterator?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ScalaDoc style for class comments.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30845 has started for PR 5383 at commit 8953be3.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30845 has finished for PR 5383 at commit 8953be3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30845/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30890 has started for PR 5383 at commit 1799ba5.

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30890 has finished for PR 5383 at commit 1799ba5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30890/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

cc @liancheng @marmbrus

@chenghao-intel
Copy link
Contributor Author

@liancheng @marmbrus Any more comments?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32593 has started for PR 5383 at commit 98b4e4b.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32593 has finished for PR 5383 at commit 98b4e4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32593/
Test PASSed.

@liancheng
Copy link
Contributor

Thanks for working on this! Merging to master and branch-1.4.

asfgit pushed a commit that referenced this pull request May 13, 2015
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <[email protected]>

Closes #5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close

(cherry picked from commit 0da254f)
Signed-off-by: Cheng Lian <[email protected]>
@asfgit asfgit closed this in 0da254f May 13, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <[email protected]>

Closes apache#5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <[email protected]>

Closes apache#5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <[email protected]>

Closes apache#5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close
@chenghao-intel chenghao-intel deleted the udtf_close branch July 2, 2015 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants