[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes #18421

mbasmanova · 2017-06-26T16:15:30Z

What changes were proposed in this pull request?

Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.

When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.

For example, table t has 4 partitions with the following specs:

Partition1: (ds='2008-04-08', hr=11)
Partition2: (ds='2008-04-08', hr=12)
Partition3: (ds='2008-04-09', hr=11)
Partition4: (ds='2008-04-09', hr=12)

'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.

'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.

'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.

When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.

The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.

How was this patch tested?

Added tests.

gatorsmile · 2017-06-26T23:48:48Z

ok to test

gatorsmile · 2017-06-26T23:48:57Z

add to whitelist

gatorsmile · 2017-06-26T23:49:24Z

cc @wzhfy

SparkQA · 2017-06-27T02:29:22Z

Test build #78663 has finished for PR 18421 at commit e18f144.

This patch fails PySpark pip packaging tests.
This patch does not merge cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-27T04:26:36Z

Please update the PR title.

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes

gatorsmile · 2017-06-27T04:44:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+
+    val table = visitTableIdentifier(ctx.tableIdentifier)
+    if (ctx.identifierSeq() == null) {
+      AnalyzeTableCommand(table, noscan, partitionSpec)


AnalyzeTableCommand(table, noscan = ctx.identifier != null, partitionSpec)

gatorsmile · 2017-06-27T04:45:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+      true
+    } else {
+      false
+    }


if (ctx.identifier != null && ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") { throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx) }

gatorsmile · 2017-06-27T04:48:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

    } else {
+      None
+    }


val partitionSpec = Option(ctx.partitionSpec).map(visitNonOptionalPartitionSpec)

gatorsmile · 2017-06-27T04:55:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

 import org.apache.spark.sql.internal.SessionState


 /**
- * Analyzes the given table to generate statistics, which will be used in query optimizations.
+ * Analyzes the given table or partition to generate statistics, which will be used in
+ * query optimizations.


Could you please add the description about partitionSpec?

If certain partition specs are specified, then statistics are gathered for only those partitions.

gatorsmile · 2017-06-27T05:01:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala

-      AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
+      AnalyzeTableCommand(TableIdentifier("t"), noscan = true,
+        partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"))))
+    intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",


This should be legal based on the description of Hive?

https://cwiki.apache.org/confluence/display/Hive/StatsDev

I see. Thanks for pointing that out. Currently, this PR supports only exact partition spec. It doesn't support partial partition specs describing the document above. My preference would be to keep it simple for this PR and support only exact spec and add support for partial specs in a follow up PR. What do you think?

I am fine about this. Please update your PR description, the class description of visitAnalyze and AnalyzeTableCommand

If we first supports only exact partition spec, in the near future we will change (replace) the syntax and also the related implementation. I mean we are not doing it incrementally, so I prefer support the right syntax in this single pr. Would support that syntax require lots of work?

@wzhfy, I expect syntax change to be quite small and incremental. Currently, it is necessary to specify all partition columns along with values. The change will be to allow only a subset of partition columns and allow partition columns without values.

PARTITION (partcol1=val1,...) -> PARTITION (partcol1[=val1],...)

That said, I want to try to allow partial partition specs in this PR. Let me spend some time on it and report back my findings.

gatorsmile · 2017-06-28T06:01:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
   * {{{
   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
   * }}}
+   * Example SQL for analyzing a single partition :
+   * {{{
+   *   ANALYZE TABLE table PARTITION (key=value,..) COMPUTE STATISTICS [NOSCAN];


The existing syntax is not very clear. Could you improve them?

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NOSCAN]

In addition, since we have a restriction, please do not call visitNonOptionalPartitionSpec. Instead, we can capture the non-set partition column and issue a more user-friendly exception message.

I'm seeing that visitNonOptionalPartitionSpec detects unset partition columns and throws an exception: Found an empty partition key '$key'. Is this sufficient or do you have something else in mind?

/**

Create a partition specification map without optional values.
*/
protected def visitNonOptionalPartitionSpec(
ctx: PartitionSpecContext): Map[String, String] = withOrigin(ctx) {
visitPartitionSpec(ctx).map {
case (key, None) => throw new ParseException(s"Found an empty partition key '$key'.", ctx)
case (key, Some(value)) => key -> value
}
}

gatorsmile · 2017-06-28T06:02:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
   * {{{
   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];


SparkQA · 2017-06-28T06:16:43Z

Test build #78750 has finished for PR 18421 at commit 090be78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T07:04:58Z

Test build #78749 has finished for PR 18421 at commit f9327b0.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T16:29:56Z

Test build #78809 has finished for PR 18421 at commit 2d1e817.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mbasmanova · 2017-06-29T21:20:33Z

@wzhfy, @gatorsmile, I updated this PR to support partial partition specs where values are defined only for a subset of partition columns. For example, if table has 2 partition columns ds and hr, both PARTITION (ds='2010-01-01', hr=10) and PARTITION (hr=10) specs are valid. I did not provide support for partition specs where partition column is mentioned without any value. E.g. PARTITION (ds, hr=10) spec is still not allowed. I didn't implement this because I realized that I don't understand why this syntax is useful. My understanding is that (ds, hr=10) is equivalent to (hr=10). Do you think it is important to implement this syntax? If so, would you help me understand why? Is this because we want to support HQL at the same level as Hive to allow seamless transition from Hive to Spark?

SparkQA · 2017-06-29T21:28:39Z

Test build #78923 has finished for PR 18421 at commit 70b35ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mbasmanova · 2017-06-29T22:02:26Z

@wzhfy, @gatorsmile, I updated PR to add full support for partition partition specs. This version supports a subset of partition columns with or without values specified. I'm going away for the long weekend tomorrow and will be back next Wed (July 5). I may not be able to respond to comments or provide updates to this PR until then.

SparkQA · 2017-06-29T23:00:56Z

Test build #78927 has finished for PR 18421 at commit aecb5a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-30T00:30:42Z

Test build #78929 has finished for PR 18421 at commit b0804c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-30T00:33:12Z

Test build #78930 has finished for PR 18421 at commit b527f18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-07-01T01:59:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+        if (filteredSpec.isEmpty) {
+          None
+        } else {
+          Some(filteredSpec.mapValues(v => v.get))


mapValues(_.get)

wzhfy · 2017-07-01T01:59:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala

+
+    val partitionSpec =
+      if (ctx.partitionSpec != null) {
+        val filteredSpec = visitPartitionSpec(ctx.partitionSpec).filter(x => x._2.isDefined)


_._2.isDefined

wzhfy · 2017-07-01T02:01:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

 */
 case class AnalyzeTableCommand(
    tableIdent: TableIdentifier,
-    noscan: Boolean = true) extends RunnableCommand {
+    noscan: Boolean = true,
+    partitionSpec: Option[TablePartitionSpec] = None) extends RunnableCommand {


How about adding a new AnalyzePartitionCommand? We can put all partition-level logic there (including partition-level column stats in the future). I think that would make the logic clearer.

wzhfy · 2017-07-01T02:06:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

+        val newRowCount = rowCounts.get(p.spec)
+
+        def updateStats(newStats: CatalogStatistics): Unit = {
+          sessionState.catalog.alterPartitions(tableMeta.identifier,


Can we collect all partitions with new stats and alter them together? Now alterPartitions is called per partition.

wzhfy · 2017-07-01T02:07:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala

+          calculateRowCountsPerPartition(sparkSession, tableMeta)
+        }
+
+      partitions.foreach(p => {


partitions.foreach { p =>

wzhfy · 2017-07-01T03:22:43Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    }
+  }
+
+  test("analyze single partition noscan") {


We can combine test cases for noscan and non-noscan by first analyze with noscan and check, then analyze without noscan and check. Currently there are many redundant codes.

wzhfy · 2017-07-01T03:24:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

@@ -1028,25 +994,115 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
    currentFullPath
  }

+  private def statsToHiveProperties(


How about statsToProperties and statsFromProperties? the current names seem related to hive stats, it's a little ambiguous.

wzhfy · 2017-07-01T03:26:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

+      sessionState: SessionState,
+      catalogTable: CatalogTable,
+      partition: CatalogTablePartition): Long = {
+    calculateLocationSize(sessionState, catalogTable.identifier, partition.storage.locationUri)


Shall we remove this method? It's only one line and can be called directly using calculateLocationSize.

wzhfy · 2017-07-01T06:36:56Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    }
+  }
+
+  test("analyze a set of partitions") {


Please also check the following three cases:

ANALYZE TABLE tab PARTITION(ds='2010-01-01', hr) COMPUTE STATISTICS; -- analyze two partitions ANALYZE TABLE tab PARTITION(ds, hr) COMPUTE STATISTICS; -- analyze four partitions ANALYZE TABLE tab PARTITION(ds, hr='10') COMPUTE STATISTICS; -- Is this allowed in hive?

@wzhfy, given that these cases are covered in SparkSqlParserSuite, is it still necessary to cover them again in StatisticsSuite? Partition columns without values are removed at parsing stage so that AnalyzePartitionCommand always receives partition column along with a value.

re: (ds, hr='10') - https://cwiki.apache.org/confluence/display/Hive/StatsDev suggests that it is allowed; this PR supports this syntax.

@mbasmanova IIUC, the logic is wrong here. For example, when analyzing partition (ds, hr), we should not remove them in parser. Currently we parse it to AnalyzeTableCommand, which collects table-level stats. But what we need to do is to collect partition-level stats for all partitions.
Check hive's behavior here

wzhfy · 2017-07-01T06:39:02Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+      sql(
+        s"""
+           |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+           |SELECT * FROM src


How about using SELECT '1', 'A' in these tests? We don't need to make the table this big.

SparkQA · 2017-07-05T16:17:48Z

Test build #79232 has finished for PR 18421 at commit 0150fea.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

mbasmanova · 2017-07-05T22:06:46Z

@wzhfy, thank you for review. I created AnalyzePartitionCommand class, modified the logic to update all partitions in a single call to Metastore, combined noscan and non-noscan test cases. I believe I addressed all comments except for requests to make queryStats and assertStats common functions. After merging test cases, each of these functions is used only in one test, hence, there is no copy-pasta anymore. Would you take another look?

SparkQA · 2017-07-06T00:07:58Z

Test build #79248 has finished for PR 18421 at commit 8c0b61b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-06T00:26:57Z

Test build #79246 has finished for PR 18421 at commit 2d8732e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalyzePartitionCommand(

…) case

…ed other review comments

…iew comments

… form a prefix of the partition columns defined in table schema

gatorsmile · 2017-08-17T20:26:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala

+
+  private def getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = {
+    val partitionColumnNames = table.partitionColumnNames.toSet
+    val partitionSpecWithCase =


Instead of changing the case, could you call sparkSession.sessionState.conf.resolver? There are many examples in the code base. Thanks!

SparkQA · 2017-08-17T22:50:47Z

Test build #80806 has finished for PR 18421 at commit 87594d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ndle conf.caseSensitiveAnalysis

SparkQA · 2017-08-18T04:03:46Z

Test build #80816 has finished for PR 18421 at commit 3353afa.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-18T07:08:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala

+
+    // Report an error if partition columns in partition specification do not form
+    // a prefix of the list of partition columns defined in the table schema
+    val isSpecified =


-> isNotSpecified

gatorsmile · 2017-08-18T07:08:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala

+  private def getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = {
+    val normalizedPartitionSpec =
+      PartitioningUtils.normalizePartitionSpec(partitionSpec, table.partitionColumnNames,
+        table.identifier.quotedString, conf.resolver);


Nit: remove ;

gatorsmile · 2017-08-18T16:03:58Z

LGTM pending Jenkins

SparkQA · 2017-08-18T16:48:58Z

Test build #80846 has finished for PR 18421 at commit 8ffb140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-18T16:53:21Z

Thanks!!! Merging to master.

mbasmanova · 2017-08-18T17:25:11Z

@gatorsmile , so excited to see this change merged. Thank you for your careful detailed reviews.

wzhfy · 2017-08-28T23:58:18Z

@mbasmanova Great work! I was really busy in the past two months so I didn't have time to look at this.
Thank @gatorsmile for reviewing and merging this PR!

gatorsmile · 2017-08-29T16:07:19Z

@wzhfy Welcome back!

gatorsmile reviewed Jun 27, 2017

View reviewed changes

mbasmanova changed the title ~~[SPARK-21213][SQL] Support collecting partition-level statistics: row…~~ [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes Jun 27, 2017

mbasmanova force-pushed the mbasmanova-analyze-partition branch from f9327b0 to 090be78 Compare June 28, 2017 04:25

gatorsmile reviewed Jun 28, 2017

View reviewed changes

mbasmanova force-pushed the mbasmanova-analyze-partition branch from 2d1e817 to 70b35ce Compare June 29, 2017 18:52

wzhfy reviewed Jul 1, 2017

View reviewed changes

mbasmanova force-pushed the mbasmanova-analyze-partition branch from 8c0b61b to 43f2112 Compare July 6, 2017 13:41

mbasmanova added 14 commits August 16, 2017 14:22

[SPARK-21213][SQL] removed extra space

89c0767

[SPARK-21213][SQL] addressed easy review comments

7210568

[SPARK-21213][SQL] addressed remaining review comments

9aa2a1e

[SPARK-21213][SQL] added test case for (ds, hr=11) partition spec

fa21860

[SPARK-21213][SQL] addressed review comments; fixed PARTITION (ds, hr…

f76f49f

…) case

[SPARK-21213][SQL] shorted new test

8f31f53

[SPARK-21213][SQL] added documentation; added test for an empty table

fae6d49

[SPARK-21213][SQL] review comments

8880fbd

[SPARK-21213][SQL] fixed bad merge of SPARK-21599

1053991

[SPARK-21213][SQL] added support for spark.sql.caseSensitive; address…

41ab30d

…ed other review comments

[SPARK-21213][SQL] addressed remaining review comments

dc488e5

[SPARK-21213][SQL] Added a test for DESC PARTITION after ANALYZE; rev…

c839855

…iew comments

[SPARK-21213][SQL] added DROP TABLE to describe-part-after-analyze.sql

72e2cd5

[SPARK-21213][SQL] check that partition columns in the partition spec…

87594d6

… form a prefix of the partition columns defined in table schema

mbasmanova force-pushed the mbasmanova-analyze-partition branch from b1e5079 to 87594d6 Compare August 17, 2017 20:14

gatorsmile reviewed Aug 17, 2017

View reviewed changes

[SPARK-21213][SQL] use PartitioningUtils.normalizePartitionSpec to ha…

3353afa

…ndle conf.caseSensitiveAnalysis

gatorsmile reviewed Aug 18, 2017

View reviewed changes

[SPARK-21213][SQL] review comments

8ffb140

asfgit closed this in 23ea898 Aug 18, 2017

		@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
		* {{{
		* ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes #18421

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes #18421

Uh oh!

Conversation

mbasmanova commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Jun 26, 2017

Uh oh!

gatorsmile commented Jun 26, 2017

Uh oh!

gatorsmile commented Jun 26, 2017

Uh oh!

SparkQA commented Jun 27, 2017

Uh oh!

gatorsmile commented Jun 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 28, 2017

Uh oh!

SparkQA commented Jun 28, 2017

Uh oh!

SparkQA commented Jun 28, 2017

Uh oh!

mbasmanova commented Jun 29, 2017

Uh oh!

SparkQA commented Jun 29, 2017

Uh oh!

mbasmanova commented Jun 29, 2017

Uh oh!

SparkQA commented Jun 29, 2017

Uh oh!

SparkQA commented Jun 30, 2017

Uh oh!

SparkQA commented Jun 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mbasmanova commented Jun 26, 2017 •

edited

Loading

gatorsmile Jun 28, 2017 •

edited

Loading

gatorsmile Jun 28, 2017 •

edited

Loading

mbasmanova Jun 28, 2017 •

edited

Loading