[SPARK-4233] [SQL] UDAF Interface Refactoring #5542

chenghao-intel · 2015-04-16T18:46:27Z

This PR will keep both old / new versions of UDAF, and switch them by

SET spark.sql.aggregate2=true/false;

The new interface is

trait AggregateFunction2 {
  self: Product =>

  // Specify the BoundReference for Aggregate Buffer
  def initialize(buffers: Seq[BoundReference]): Unit

  // Initialize (reinitialize) the aggregation buffer
  def reset(buf: MutableRow): Unit

  // Get the children value from the input row, and then
  // merge it with the given aggregate buffer,
  // `seen` is the set that the value showed up, that's will
  // be useful for distinct aggregate. And it probably be
  // null for non-distinct aggregate
  def update(input: Row, buf: MutableRow, seen: JSet[Any]): Unit

  // Merge 2 aggregation buffers, and write back to the later one
  def merge(value: Row, buf: MutableRow): Unit

  // Semantically we probably don't need this, however, we need it when
  // integrating with Hive UDAF(GenericUDAF)
  @deprecated
  def terminatePartial(buf: MutableRow): Unit = {}

  // Output the final result by feeding the aggregation buffer
  def terminate(buffer: Row): Any
}

SparkQA · 2015-04-16T18:48:33Z

Test build #30431 has started for PR 5542 at commit e9017ed.

SparkQA · 2015-04-16T18:50:05Z

Test build #30431 has finished for PR 5542 at commit e9017ed.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression)
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-16T18:50:06Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30431/
Test FAILed.

SparkQA · 2015-04-17T01:48:24Z

Test build #30451 has started for PR 5542 at commit e213e5e.

SparkQA · 2015-04-17T03:40:37Z

Test build #30451 has finished for PR 5542 at commit e213e5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(child: Expression) extends UnaryAggregateExpression
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression) extends UnaryAggregateExpression
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed class BufferAndKey(leftLen: Int, rightLen: Int)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-17T03:40:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30451/
Test PASSed.

SparkQA · 2015-04-21T04:23:38Z

Test build #30629 has started for PR 5542 at commit 4aa56c2.

SparkQA · 2015-04-21T04:25:20Z

Test build #30629 has finished for PR 5542 at commit 4aa56c2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(child: Expression) extends UnaryAggregateExpression
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression) extends UnaryAggregateExpression
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed class BufferAndKey(leftLen: Int, rightLen: Int)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T04:25:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30629/
Test FAILed.

SparkQA · 2015-04-21T05:33:34Z

Test build #30637 has started for PR 5542 at commit b45f487.

SparkQA · 2015-04-21T05:51:37Z

Test build #30637 has finished for PR 5542 at commit b45f487.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(child: Expression) extends UnaryAggregateExpression
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression) extends UnaryAggregateExpression
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed class BufferAndKey(leftLen: Int, rightLen: Int)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T05:51:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30637/
Test FAILed.

SparkQA · 2015-04-21T06:28:40Z

Test build #30644 has started for PR 5542 at commit 9806266.

SparkQA · 2015-04-21T07:59:06Z

Test build #30644 has finished for PR 5542 at commit 9806266.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(child: Expression) extends UnaryAggregateExpression
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression) extends UnaryAggregateExpression
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed class BufferAndKey(leftLen: Int, rightLen: Int)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T07:59:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30644/
Test FAILed.

rxin · 2015-04-23T08:47:02Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -562,3 +563,13 @@ class SQLQuerySuite extends QueryTest {
      .queryExecution.analyzed
  }
 }
+
+class SQLQuerySuite2 extends SQLQuerySuite with BeforeAndAfter {


you should name this something more explicit, maybe "SQLQueryNewUDAFSuite"

SparkQA · 2015-04-24T02:07:50Z

Test build #30901 has started for PR 5542 at commit 71f1bd5.

SparkQA · 2015-04-24T03:31:14Z

Test build #30901 has finished for PR 5542 at commit 71f1bd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-24T03:31:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30901/
Test FAILed.

SparkQA · 2015-04-24T07:33:42Z

Test build #30921 has started for PR 5542 at commit 6b594f0.

SparkQA · 2015-04-24T09:34:20Z

Test build #30921 has finished for PR 5542 at commit 6b594f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait AggregateFunction2
- trait AggregateExpression2 extends Expression with AggregateFunction2
- abstract class UnaryAggregateExpression extends UnaryExpression with AggregateExpression2
- case class Min(child: Expression) extends UnaryAggregateExpression
- case class Average(child: Expression, distinct: Boolean = false)
- case class Max(child: Expression) extends UnaryAggregateExpression
- case class Count(child: Expression)
- case class CountDistinct(children: Seq[Expression])
- case class Sum(child: Expression, distinct: Boolean = false)
- case class First(child: Expression, distinct: Boolean = false)
- case class Last(child: Expression, distinct: Boolean = false)
- class AggregateExpressionSubsitution
- class HashAggregation2(aggrSubsitution: AggregateExpressionSubsitution) extends Strategy
- sealed class BufferSeens(var buffer: MutableRow, var seens: Array[JSet[Any]] = null)
- sealed class BufferAndKey(leftLen: Int, rightLen: Int)
- sealed trait Aggregate
- sealed trait PostShuffle extends Aggregate
- case class AggregatePreShuffle(
- case class AggregatePostShuffle(
- case class DistinctAggregate(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-24T09:34:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30921/
Test PASSed.

SparkQA · 2015-04-27T18:19:16Z

Test build #31007 has started for PR 5542 at commit 6b594f0.

tiffanyTown · 2015-05-14T03:02:32Z

found an issue when running the query with SET spark.sql.aggregate2=true configuration after applying this patch.
ERROR message:
15/05/07 17:11:14 WARN TaskSetManager: Lost task 15.0 in stage 101.0 (TID 2056, qac8-node2): java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at scala.math.Numeric$LongIsIntegral$.toInt(Numeric.scala:117)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$5.apply(Cast.scala:274)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$5.apply(Cast.scala:274)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:435)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:101)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:83)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:83)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
query.sql file:
INSERT INTO TABLE ${hiveconf:TEMP_TABLE}
SELECT
cid,
100.0 * COUNT(distinct (CASE WHEN r_date IS NOT NULL THEN oid ELSE 0L END)) / COUNT(distinct oid) AS r_order_ratio,
SUM(CASE WHEN r_date IS NOT NULL THEN 1 ELSE 0 END) / COUNT(item) * 100 AS r_item_ratio,
CASE WHEN SUM(s_amount)=0.0 THEN 0.0 ELSE (SUM(CASE WHEN r_date IS NOT NULL THEN r_amount ELSE 0.0 END) / SUM(s_amount) * 100) END AS r_amount_ratio,
COUNT(distinct (CASE WHEN r_date IS NOT NULL THEN r_date ELSE 0L END)) AS r_freq
FROM (
SELECT
r.sr_returned_date_sk AS r_date,
s.ss_item_sk AS item,
s.ss_ticket_number AS oid,
s.ss_net_paid AS s_amount,
CASE WHEN r.sr_return_amt IS NULL THEN 0.0 ELSE r.sr_return_amt END AS r_amount,
(CASE WHEN s.ss_customer_sk IS NULL THEN r.sr_customer_sk ELSE s.ss_customer_sk END) AS cid
FROM store_sales s
LEFT OUTER JOIN store_returns r ON (
r.sr_item_sk = s.ss_item_sk
AND r.sr_ticket_number = s.ss_ticket_number
AND s.ss_sold_date_sk IS NOT NULL
)
) q20_sales_returns

WHERE cid IS NOT NULL
GROUP BY cid
;

AmplabJenkins · 2015-06-08T13:22:11Z

Merged build triggered.

AmplabJenkins · 2015-06-08T13:22:22Z

Merged build started.

SparkQA · 2015-06-08T13:23:31Z

Test build #34435 has started for PR 5542 at commit 68dd625.

SparkQA · 2015-06-08T13:37:42Z

Test build #34435 has finished for PR 5542 at commit 68dd625.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-08T13:37:46Z

Merged build finished. Test FAILed.

adamv · 2015-06-08T15:49:08Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

+      return mr;
+  }
+
+    @Override


Indentation on this line looks off

rxin · 2015-07-07T05:55:50Z

Can we close this ticket first? I think @yhuai will revisit this with you soon.

chenghao-intel · 2015-07-07T07:03:11Z

yes, thanks for the reminding. closing it.

chenghao-intel mentioned this pull request Apr 21, 2015

[SPARK-4233] [SQL] WIP:Simplify the UDAF API (Interface) #3247

Closed

3 tasks

chenghao-intel force-pushed the udaf_refactor branch from e213e5e to 4aa56c2 Compare April 21, 2015 04:18

rxin reviewed Apr 23, 2015
View reviewed changes

chenghao-intel force-pushed the udaf_refactor branch from 9806266 to 71f1bd5 Compare April 24, 2015 02:03

chenghao-intel force-pushed the udaf_refactor branch from 71f1bd5 to 6b594f0 Compare April 24, 2015 07:29

chenghao-intel changed the title ~~[SPARK-4233] [SQL] [WIP] UDAF Interface Refactoring~~ [SPARK-4233] [SQL] UDAF Interface Refactoring Apr 28, 2015

yhuai mentioned this pull request May 1, 2015

[SPARK-1442][SQL] Window Function Support for Spark SQL #5604

Closed

chenghao-intel added 20 commits June 7, 2015 23:48

migrate to support both version of UDAF

440b689

Update the unit test to comment out the not support ones

7fb0662

update the interface name

f118ffc

change the update method from Any to Row

f0b9ec0

move the distinct into the udaf

bee0f95

simpify the aggregate expression by uing the Projection

760164e

revert the uncessary changes

0849ca3

Add Unit test

472a440

Add some doc

241aee1

style issues

de96a13

more style issues

483b381

fix bug in the for unit test

58b1481

use BufferAndKey class manully maitain the MutableRow

39a6243

fix bug of with BufferAndKeys

5b01518

Add golden files

feac4d0

enable more unit test

ec7deaa

disable the codegen for aggregate2 in unit test

393b0d1

Add more unit test

021431f

rebase to the latest master

8ad5fc5

rebase again

68dd625

chenghao-intel force-pushed the udaf_refactor branch from f0f907f to 68dd625 Compare June 8, 2015 13:19

adamv reviewed Jun 8, 2015
View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

return mr;

}

@Override

Copy link

adamv Jun 8, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation on this line looks off

chenghao-intel closed this Jul 7, 2015

[SPARK-4233] [SQL] UDAF Interface Refactoring #5542

[SPARK-4233] [SQL] UDAF Interface Refactoring #5542

Uh oh!

Conversation

chenghao-intel commented Apr 16, 2015

Uh oh!

SparkQA commented Apr 16, 2015

Uh oh!

SparkQA commented Apr 16, 2015

Uh oh!

AmplabJenkins commented Apr 16, 2015

Uh oh!

SparkQA commented Apr 17, 2015

Uh oh!

SparkQA commented Apr 17, 2015

Uh oh!

AmplabJenkins commented Apr 17, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

AmplabJenkins commented Apr 21, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

AmplabJenkins commented Apr 21, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

SparkQA commented Apr 21, 2015

Uh oh!

AmplabJenkins commented Apr 21, 2015

Uh oh!

rxin Apr 23, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

AmplabJenkins commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 24, 2015

Uh oh!

AmplabJenkins commented Apr 24, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

tiffanyTown commented May 14, 2015

Uh oh!

AmplabJenkins commented Jun 8, 2015

Uh oh!

AmplabJenkins commented Jun 8, 2015

Uh oh!

SparkQA commented Jun 8, 2015

Uh oh!

SparkQA commented Jun 8, 2015

Uh oh!

AmplabJenkins commented Jun 8, 2015

Uh oh!

adamv Jun 8, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 7, 2015

Uh oh!

chenghao-intel commented Jul 7, 2015

Uh oh!

Uh oh!