Skip to content

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes #18421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mbasmanova
Copy link
Contributor

@mbasmanova mbasmanova commented Jun 26, 2017

What changes were proposed in this pull request?

Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.

When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.

For example, table t has 4 partitions with the following specs:

  • Partition1: (ds='2008-04-08', hr=11)
  • Partition2: (ds='2008-04-08', hr=12)
  • Partition3: (ds='2008-04-09', hr=11)
  • Partition4: (ds='2008-04-09', hr=12)

'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.

'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.

'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.

When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.

The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.

How was this patch tested?

Added tests.

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

add to whitelist

@gatorsmile
Copy link
Member

cc @wzhfy

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78663 has finished for PR 18421 at commit e18f144.

  • This patch fails PySpark pip packaging tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Please update the PR title.

[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes


val table = visitTableIdentifier(ctx.tableIdentifier)
if (ctx.identifierSeq() == null) {
AnalyzeTableCommand(table, noscan, partitionSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalyzeTableCommand(table, noscan = ctx.identifier != null, partitionSpec)

true
} else {
false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    if (ctx.identifier != null &&
        ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
      throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`", ctx)
    }

} else {
None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val partitionSpec = Option(ctx.partitionSpec).map(visitNonOptionalPartitionSpec)

import org.apache.spark.sql.internal.SessionState


/**
* Analyzes the given table to generate statistics, which will be used in query optimizations.
* Analyzes the given table or partition to generate statistics, which will be used in
* query optimizations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the description about partitionSpec?

If certain partition specs are specified, then statistics are gathered for only those partitions.

AnalyzeTableCommand(TableIdentifier("t"), noscan = true))
AnalyzeTableCommand(TableIdentifier("t"), noscan = true,
partitionSpec = Some(Map("ds" -> "2008-04-09", "hr" -> "11"))))
intercept("ANALYZE TABLE t PARTITION(ds, hr) COMPUTE STATISTICS",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be legal based on the description of Hive?

https://cwiki.apache.org/confluence/display/Hive/StatsDev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for pointing that out. Currently, this PR supports only exact partition spec. It doesn't support partial partition specs describing the document above. My preference would be to keep it simple for this PR and support only exact spec and add support for partial specs in a follow up PR. What do you think?

Copy link
Member

@gatorsmile gatorsmile Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine about this. Please update your PR description, the class description of visitAnalyze and AnalyzeTableCommand

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we first supports only exact partition spec, in the near future we will change (replace) the syntax and also the related implementation. I mean we are not doing it incrementally, so I prefer support the right syntax in this single pr. Would support that syntax require lots of work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy, I expect syntax change to be quite small and incremental. Currently, it is necessary to specify all partition columns along with values. The change will be to allow only a subset of partition columns and allow partition columns without values.

PARTITION (partcol1=val1,...) -> PARTITION (partcol1[=val1],...)

That said, I want to try to allow partial partition specs in this PR. Let me spend some time on it and report back my findings.

@mbasmanova mbasmanova changed the title [SPARK-21213][SQL] Support collecting partition-level statistics: row… [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes Jun 27, 2017
@mbasmanova mbasmanova force-pushed the mbasmanova-analyze-partition branch from f9327b0 to 090be78 Compare June 28, 2017 04:25
@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
* {{{
* ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
* }}}
* Example SQL for analyzing a single partition :
* {{{
* ANALYZE TABLE table PARTITION (key=value,..) COMPUTE STATISTICS [NOSCAN];
Copy link
Member

@gatorsmile gatorsmile Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing syntax is not very clear. Could you improve them?

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] 
COMPUTE STATISTICS [NOSCAN]

In addition, since we have a restriction, please do not call visitNonOptionalPartitionSpec. Instead, we can capture the non-set partition column and issue a more user-friendly exception message.

Copy link
Contributor Author

@mbasmanova mbasmanova Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing that visitNonOptionalPartitionSpec detects unset partition columns and throws an exception: Found an empty partition key '$key'. Is this sufficient or do you have something else in mind?

/**

  • Create a partition specification map without optional values.
    */
    protected def visitNonOptionalPartitionSpec(
    ctx: PartitionSpecContext): Map[String, String] = withOrigin(ctx) {
    visitPartitionSpec(ctx).map {
    case (key, None) => throw new ParseException(s"Found an empty partition key '$key'.", ctx)
    case (key, Some(value)) => key -> value
    }
    }

@@ -95,25 +95,32 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) {
* {{{
* ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78750 has finished for PR 18421 at commit 090be78.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78749 has finished for PR 18421 at commit f9327b0.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78809 has finished for PR 18421 at commit 2d1e817.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mbasmanova mbasmanova force-pushed the mbasmanova-analyze-partition branch from 2d1e817 to 70b35ce Compare June 29, 2017 18:52
@mbasmanova
Copy link
Contributor Author

@wzhfy, @gatorsmile, I updated this PR to support partial partition specs where values are defined only for a subset of partition columns. For example, if table has 2 partition columns ds and hr, both PARTITION (ds='2010-01-01', hr=10) and PARTITION (hr=10) specs are valid. I did not provide support for partition specs where partition column is mentioned without any value. E.g. PARTITION (ds, hr=10) spec is still not allowed. I didn't implement this because I realized that I don't understand why this syntax is useful. My understanding is that (ds, hr=10) is equivalent to (hr=10). Do you think it is important to implement this syntax? If so, would you help me understand why? Is this because we want to support HQL at the same level as Hive to allow seamless transition from Hive to Spark?

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78923 has finished for PR 18421 at commit 70b35ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mbasmanova
Copy link
Contributor Author

@wzhfy, @gatorsmile, I updated PR to add full support for partition partition specs. This version supports a subset of partition columns with or without values specified. I'm going away for the long weekend tomorrow and will be back next Wed (July 5). I may not be able to respond to comments or provide updates to this PR until then.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78927 has finished for PR 18421 at commit aecb5a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2017

Test build #78929 has finished for PR 18421 at commit b0804c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2017

Test build #78930 has finished for PR 18421 at commit b527f18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (filteredSpec.isEmpty) {
None
} else {
Some(filteredSpec.mapValues(v => v.get))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapValues(_.get)


val partitionSpec =
if (ctx.partitionSpec != null) {
val filteredSpec = visitPartitionSpec(ctx.partitionSpec).filter(x => x._2.isDefined)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._2.isDefined

*/
case class AnalyzeTableCommand(
tableIdent: TableIdentifier,
noscan: Boolean = true) extends RunnableCommand {
noscan: Boolean = true,
partitionSpec: Option[TablePartitionSpec] = None) extends RunnableCommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a new AnalyzePartitionCommand? We can put all partition-level logic there (including partition-level column stats in the future). I think that would make the logic clearer.

val newRowCount = rowCounts.get(p.spec)

def updateStats(newStats: CatalogStatistics): Unit = {
sessionState.catalog.alterPartitions(tableMeta.identifier,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we collect all partitions with new stats and alter them together? Now alterPartitions is called per partition.

calculateRowCountsPerPartition(sparkSession, tableMeta)
}

partitions.foreach(p => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitions.foreach { p =>

}
}

test("analyze single partition noscan") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can combine test cases for noscan and non-noscan by first analyze with noscan and check, then analyze without noscan and check. Currently there are many redundant codes.

@@ -1028,25 +994,115 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
currentFullPath
}

private def statsToHiveProperties(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about statsToProperties and statsFromProperties? the current names seem related to hive stats, it's a little ambiguous.

sessionState: SessionState,
catalogTable: CatalogTable,
partition: CatalogTablePartition): Long = {
calculateLocationSize(sessionState, catalogTable.identifier, partition.storage.locationUri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove this method? It's only one line and can be called directly using calculateLocationSize.

}
}

test("analyze a set of partitions") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check the following three cases:

ANALYZE TABLE tab PARTITION(ds='2010-01-01', hr) COMPUTE STATISTICS; -- analyze two partitions
ANALYZE TABLE tab PARTITION(ds, hr) COMPUTE STATISTICS; -- analyze four partitions
ANALYZE TABLE tab PARTITION(ds, hr='10') COMPUTE STATISTICS; -- Is this allowed in hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzhfy, given that these cases are covered in SparkSqlParserSuite, is it still necessary to cover them again in StatisticsSuite? Partition columns without values are removed at parsing stage so that AnalyzePartitionCommand always receives partition column along with a value.

re: (ds, hr='10') - https://cwiki.apache.org/confluence/display/Hive/StatsDev suggests that it is allowed; this PR supports this syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova IIUC, the logic is wrong here. For example, when analyzing partition (ds, hr), we should not remove them in parser. Currently we parse it to AnalyzeTableCommand, which collects table-level stats. But what we need to do is to collect partition-level stats for all partitions.
Check hive's behavior here

sql(
s"""
|INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
|SELECT * FROM src
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using SELECT '1', 'A' in these tests? We don't need to make the table this big.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79232 has finished for PR 18421 at commit 0150fea.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mbasmanova
Copy link
Contributor Author

@wzhfy, thank you for review. I created AnalyzePartitionCommand class, modified the logic to update all partitions in a single call to Metastore, combined noscan and non-noscan test cases. I believe I addressed all comments except for requests to make queryStats and assertStats common functions. After merging test cases, each of these functions is used only in one test, hence, there is no copy-pasta anymore. Would you take another look?

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79248 has finished for PR 18421 at commit 8c0b61b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79246 has finished for PR 18421 at commit 2d8732e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AnalyzePartitionCommand(

@mbasmanova mbasmanova force-pushed the mbasmanova-analyze-partition branch from 8c0b61b to 43f2112 Compare July 6, 2017 13:41
@mbasmanova mbasmanova force-pushed the mbasmanova-analyze-partition branch from b1e5079 to 87594d6 Compare August 17, 2017 20:14

private def getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = {
val partitionColumnNames = table.partitionColumnNames.toSet
val partitionSpecWithCase =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the case, could you call sparkSession.sessionState.conf.resolver? There are many examples in the code base. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80806 has finished for PR 18421 at commit 87594d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80816 has finished for PR 18421 at commit 3353afa.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


// Report an error if partition columns in partition specification do not form
// a prefix of the list of partition columns defined in the table schema
val isSpecified =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> isNotSpecified

private def getPartitionSpec(table: CatalogTable): Option[TablePartitionSpec] = {
val normalizedPartitionSpec =
PartitioningUtils.normalizePartitionSpec(partitionSpec, table.partitionColumnNames,
table.identifier.quotedString, conf.resolver);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove ;

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80846 has finished for PR 18421 at commit 8ffb140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks!!! Merging to master.

@asfgit asfgit closed this in 23ea898 Aug 18, 2017
@mbasmanova
Copy link
Contributor Author

@gatorsmile , so excited to see this change merged. Thank you for your careful detailed reviews.

@wzhfy
Copy link
Contributor

wzhfy commented Aug 28, 2017

@mbasmanova Great work! I was really busy in the past two months so I didn't have time to look at this.
Thank @gatorsmile for reviewing and merging this PR!

@gatorsmile
Copy link
Member

@wzhfy Welcome back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants