[SPARK-6734] [SQL] Add UDTF.close support in Generate #5383

chenghao-intel · 2015-04-07T04:34:00Z

Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

SparkQA · 2015-04-07T04:38:46Z

Test build #29778 has started for PR 5383 at commit 94ce8d0.

SparkQA · 2015-04-07T06:02:53Z

Test build #29778 has finished for PR 5383 at commit 94ce8d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LazyIterator(func: () => TraversableOnce[Row]) extends Iterator[Row]
This patch does not change any dependencies.

AmplabJenkins · 2015-04-07T06:02:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29778/
Test PASSed.

viirya · 2015-04-07T08:54:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Generate.scala

-      child.execute().mapPartitions(iter => iter.flatMap(row => boundGenerator.eval(row)))
+      child.execute().mapPartitions(iter =>
+        iter.flatMap(row => boundGenerator.eval(row)) ++
+          LazyIterator(() => boundGenerator.terminate())


Looks like you are calling terminate on each partition. Is that not the same as how Hive does? In Hive it seems that close is called after all rows are processed.

Yes, I that's OK by called in each partition.

See
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java#L278
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapper.java#L192
The Operator.close() is called in MapReduceBase.close, which mean they are supposed to run once within each Mapper/Reducer.

And
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java#L144
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java#L616
shows the genericUDTF.close() is called within the Operator.close.

Sorry, please correct me if my understanding is wrong.

How about codes below to simplify that?

child.execute().mapPartitions(iter => iter.flatMap(row => boundGenerator.eval(row))) .mapPartitions(_ ++ boundGenerator.terminate())

@maropu thanks for the suggestion, I'd like to try that locally. BTW can you confirm if this PR work for you, as we talked offline? Just to double confirm the concern of @viirya.
Thanks.

Ok, I'll try and let you know the result.

yeah, it works correctly, thanks.

maropu · 2015-04-09T09:28:03Z

I found one issue; the current implementation of HiveGenericUdtf always calls terminate() though, it does not call initialize() in some cases because of lazy initialization.

 protected lazy val outputInspector = function.initialize(inputInspectors.toArray)

chenghao-intel · 2015-04-09T18:17:57Z

@maropu thanks for the comments, I've updated the code.

SparkQA · 2015-04-09T18:19:26Z

Test build #29954 has started for PR 5383 at commit c45faf0.

SparkQA · 2015-04-09T20:05:08Z

Test build #29954 has finished for PR 5383 at commit c45faf0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class UDTFCollector extends Collector with Serializable
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T20:05:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29954/
Test PASSed.

chenghao-intel · 2015-04-10T02:12:11Z

@liancheng @yhuai

liancheng · 2015-04-12T17:42:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Generate.scala

+      }.mapPartitions { iter =>
+        val nullOriginalRow = Row(Seq.fill(generator.output.size)(Literal(null)): _*)
+        val joinedRow = new JoinedRow
+        iter ++ boundGenerator.terminate().map(joinedRow(nullOriginalRow, _))


Can we merge this .mapPartitions into the above one?

liancheng · 2015-04-12T18:14:20Z

Please also add a test case to test the case where Generator.join is true.

SparkQA · 2015-04-13T21:13:40Z

Test build #30195 has started for PR 5383 at commit 63c88cc.

SparkQA · 2015-04-13T21:15:20Z

Test build #30195 has finished for PR 5383 at commit 63c88cc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class UDTFCollector extends Collector with Serializable
This patch does not change any dependencies.

AmplabJenkins · 2015-04-13T21:15:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30195/
Test FAILed.

SparkQA · 2015-04-13T22:43:31Z

Test build #30202 has started for PR 5383 at commit d719983.

SparkQA · 2015-04-14T00:12:56Z

Test build #30202 has finished for PR 5383 at commit d719983.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-14T00:13:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30202/
Test PASSed.

chenghao-intel · 2015-04-14T00:21:20Z

Thank you @liancheng I've updated the code and it passed the unit test.

liancheng · 2015-04-14T07:32:55Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

+  createQueryTest("Test UDTF.close in Lateral Views",
+     """
+       | SELECT key, cc
+       |   FROM src LATERAL VIEW udtf_count2(value) dd AS cc


Please remove spaces at the beginning of these two lines.

SparkQA · 2015-04-21T04:15:31Z

Test build #30620 has finished for PR 5383 at commit e1635b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T04:15:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30620/
Test PASSed.

marmbrus · 2015-04-21T22:26:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Generate.scala

@@ -21,6 +21,16 @@ import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.expressions._

+// for lazy computing, be sure the generator.terminate() called in the very last
+// TODO reusing the CompletionIterator?


Use ScalaDoc style for class comments.

SparkQA · 2015-04-23T14:48:41Z

Test build #30845 has started for PR 5383 at commit 8953be3.

SparkQA · 2015-04-23T15:41:36Z

Test build #30845 has finished for PR 5383 at commit 8953be3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-23T15:41:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30845/
Test FAILed.

SparkQA · 2015-04-24T00:58:49Z

Test build #30890 has started for PR 5383 at commit 1799ba5.

SparkQA · 2015-04-24T02:34:21Z

Test build #30890 has finished for PR 5383 at commit 1799ba5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-24T02:34:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30890/
Test PASSed.

chenghao-intel · 2015-04-24T02:59:50Z

cc @liancheng @marmbrus

chenghao-intel · 2015-04-27T06:12:15Z

@liancheng @marmbrus Any more comments?

AmplabJenkins · 2015-05-13T07:02:15Z

Merged build triggered.

AmplabJenkins · 2015-05-13T07:02:24Z

Merged build started.

SparkQA · 2015-05-13T07:03:04Z

Test build #32593 has started for PR 5383 at commit 98b4e4b.

SparkQA · 2015-05-13T08:55:17Z

Test build #32593 has finished for PR 5383 at commit 98b4e4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-13T08:55:21Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-13T08:55:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32593/
Test PASSed.

liancheng · 2015-05-13T16:12:56Z

Thanks for working on this! Merging to master and branch-1.4.

Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL. Author: Cheng Hao <[email protected]> Closes #5383 from chenghao-intel/udtf_close and squashes the following commits: 98b4e4b [Cheng Hao] Support UDTF.close (cherry picked from commit 0da254f) Signed-off-by: Cheng Lian <[email protected]>

Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL. Author: Cheng Hao <[email protected]> Closes apache#5383 from chenghao-intel/udtf_close and squashes the following commits: 98b4e4b [Cheng Hao] Support UDTF.close

viirya reviewed Apr 7, 2015
View reviewed changes

liancheng reviewed Apr 12, 2015
View reviewed changes

chenghao-intel force-pushed the udtf_close branch from c45faf0 to 63c88cc Compare April 13, 2015 21:07

liancheng reviewed Apr 14, 2015
View reviewed changes

marmbrus reviewed Apr 21, 2015
View reviewed changes

chenghao-intel force-pushed the udtf_close branch from e1635b4 to 8953be3 Compare April 23, 2015 14:43

chenghao-intel force-pushed the udtf_close branch from 8953be3 to 1799ba5 Compare April 24, 2015 00:54

Support UDTF.close

98b4e4b

chenghao-intel force-pushed the udtf_close branch from 1799ba5 to 98b4e4b Compare May 13, 2015 06:59

asfgit closed this in 0da254f May 13, 2015

chenghao-intel deleted the udtf_close branch July 2, 2015 08:44

[SPARK-6734] [SQL] Add UDTF.close support in Generate #5383

[SPARK-6734] [SQL] Add UDTF.close support in Generate #5383

Uh oh!

Conversation

chenghao-intel commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 7, 2015

Uh oh!

SparkQA commented Apr 7, 2015

Uh oh!

AmplabJenkins commented Apr 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Apr 9, 2015

Uh oh!

chenghao-intel commented Apr 9, 2015

Uh oh!

SparkQA commented Apr 9, 2015

Uh oh!

SparkQA commented Apr 9, 2015

Uh oh!

AmplabJenkins commented Apr 9, 2015

Uh oh!

chenghao-intel commented Apr 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Apr 12, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

AmplabJenkins commented Apr 13, 2015

Uh oh!

SparkQA commented Apr 13, 2015

Uh oh!

SparkQA commented Apr 14, 2015

Uh oh!

AmplabJenkins commented Apr 14, 2015

Uh oh!

chenghao-intel commented Apr 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

AmplabJenkins commented Apr 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 23, 2015

Uh oh!

SparkQA commented Apr 23, 2015

Uh oh!

AmplabJenkins commented Apr 23, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

AmplabJenkins commented Apr 24, 2015

Uh oh!

chenghao-intel commented Apr 24, 2015

Uh oh!

chenghao-intel commented Apr 27, 2015

Uh oh!

AmplabJenkins commented May 13, 2015

Uh oh!