[SPARK-23034][SQL] Override `nodeName` for all *ScanExec operators #20226

tejasapatil · 2018-01-11T00:13:11Z

What changes were proposed in this pull request?

For queries which scan multiple tables, it will be convenient if the DAG shown in Spark UI also showed which table is being scanned. This will make debugging easier. For this JIRA, I am scoping those for hive table scans only. In case table scans which happen via codegen (eg. convertMetastore and spark native tables), this PR will not affect things.

How was this patch tested?

dongjoon-hyun · 2018-01-11T00:54:51Z

It looks useful, @tejasapatil . Given that this is one line addition, why don't you handle the others?

For this JIRA, I am scoping those for hive table scans only.

BTW, can we have a title prefix [SQL] instead of [Hive]?

tejasapatil · 2018-01-11T00:58:13Z

@dongjoon-hyun : For Spark native tables, the table scan node is abstracted out as a WholeStageCodegen node in the DAG. A codegen node might be doing more things besides table scan so it is debatable how people would think if we start calling that as tablescan-TableA.

dongjoon-hyun · 2018-01-11T01:01:02Z

Then, specifically, what happens in this PR for Parquet/ORC Hive table which is converted to data source tables with convertMetastoreParquet/Orc? Now, both parameters are true by default.

vanzin · 2018-01-11T01:05:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

@@ -62,6 +62,8 @@ case class HiveTableScanExec(

  override def conf: SQLConf = sparkSession.sessionState.conf

+  override def nodeName: String = s"${super.nodeName}-${relation.tableMeta.qualifiedName}"


s"${super.nodeName}(${relation.tableMeta.qualifiedName})" looks clearer to me, but up to you.

I like this format. I added a space in between for better readability

(updated the screenshot in the PR description)

tejasapatil · 2018-01-11T01:29:03Z

@dongjoon-hyun : I tried it out over master and since the table scan goes via codegen, it wont show the table name. Will update the PR description with this finding. Lets move this discussion to the JIRA and see what people have to say about the concern I had speculated.

SparkQA · 2018-01-11T01:56:03Z

Test build #85941 has finished for PR 20226 at commit b972d1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2018-01-11T02:49:50Z

Jenkins retest this please.

dongjoon-hyun · 2018-01-11T02:59:58Z

Thank you, @tejasapatil . I see.
Although this is not applicable for ORC/Parquet hive tables, the PR looks useful for me.
Could you put the condition ‘convertMetastore’ and limitation in PR description? Otherwise, people may be confused about it.

SparkQA · 2018-01-11T03:26:30Z

Test build #85943 has finished for PR 20226 at commit 6271804.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2018-01-11T04:09:19Z

@dongjoon-hyun : I have updated the PR description

dongjoon-hyun

+1, LGTM.

SparkQA · 2018-01-11T05:15:26Z

Test build #85945 has finished for PR 20226 at commit 6271804.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-01-11T14:59:02Z

LGTM too

gatorsmile · 2018-01-11T16:36:49Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

@@ -62,6 +62,8 @@ case class HiveTableScanExec(

  override def conf: SQLConf = sparkSession.sessionState.conf

+  override def nodeName: String = s"${super.nodeName} (${relation.tableMeta.qualifiedName})"


Our DataSourceScanExec is using unquotedString in nodeName. We need to make these LeafNode consistent. Could you check all the other LeafExecNode?

DataSourceScanExec is using unquotedString and so does this PR. Rest implementations of LeafExecNode do not override nodeName and / or depend on base class (ie. TreeNode)

How about

s"Scan HiveTable ${relation.tableMeta.qualifiedName}"

Just to be more consistent

DataSourceV2ScanExec faces the same issue. How about InMemoryTableScanExec?

@gatorsmile : I have updated the PR after going through all the *ScanExec implementations

Changes introduced in this PR:

Scan impl overridden nodeName

DataSourceV2ScanExec Scan DataSourceV2 [output_attribute1, output_attribute2, ..]

ExternalRDDScanExec Scan ExternalRDD [output_attribute1, output_attribute2, ..]

FileSourceScanExec Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation.location)}"

HiveTableScanExec Scan HiveTable relation.tableMeta.qualifiedName

InMemoryTableScanExec Scan In-memory relation.tableName

LocalTableScanExec Scan LocalTable [output_attribute1, output_attribute2, ..]

RDDScanExec Scan RDD name [output_attribute1, output_attribute2, ..]

RowDataSourceScanExec Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation)}

Things not affected:

DataSourceScanExec : already uses Scan relation tableIdentifier.unquotedString

RDDScanExec forces clients to specify the nodeName

SparkQA · 2018-01-12T06:36:38Z

Test build #86022 has finished for PR 20226 at commit 9bcd905.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-12T17:03:29Z

Test build #86044 has finished for PR 20226 at commit f16b73b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-13T13:13:16Z

Overall, the fixes looks good to me. We just need to resolve the test cases.

Thanks for improving it!

tejasapatil · 2018-01-16T18:39:56Z

The test failure does look legit to me. I have been not able to repro it on my laptop. Intellij doesn't treat it as a test case. Command-line does recognize it as test case but hits runtime failure with jar mismatch. I am using this to run the test:

build/mvn -Phive-thriftserver -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite test

Is there special setup needed to run these tests ?

gatorsmile · 2018-01-17T02:31:33Z

build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly "hive-thriftserver/test-only org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite"

Try this? It worked for me before.

tejasapatil · 2018-01-17T23:22:13Z

Jenkins retest this please.

tejasapatil · 2018-01-18T00:56:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala

@@ -45,7 +46,12 @@ trait CodegenSupport extends SparkPlan {
    case _: SortMergeJoinExec => "smj"
    case _: RDDScanExec => "rdd"
    case _: DataSourceScanExec => "scan"
-    case _ => nodeName.toLowerCase(Locale.ROOT)


This caused one of the tests to fail as the nodeName generated was not a single word (like before) but something like scan in-memory my_table... which does not compile with codegen. The change done was to retain only the alpha-numeric characters in the nodeName while generating variablePrefix

SparkQA · 2018-01-18T01:56:53Z

Test build #86300 has finished for PR 20226 at commit 65ec7a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T01:59:29Z

Test build #86299 has finished for PR 20226 at commit 65ec7a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-18T22:50:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala

@@ -30,6 +30,8 @@ case class LocalTableScanExec(
    output: Seq[Attribute],
    @transient rows: Seq[InternalRow]) extends LeafExecNode {

+  override val nodeName: String = s"Scan LocalTable ${output.map(_.name).mkString("[", ",", "]")}"


It sounds like we have duplicate info about output in stringArgs.

I believe you are referring to the duplication at :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

Line 466 in 3f958a9

def simpleString: String = s"$nodeName $argString".trim

Am changing this line to just have Scan LocalTable

SparkQA · 2018-01-19T01:28:05Z

Test build #86353 has finished for PR 20226 at commit 0c0aa94.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-20T04:40:28Z

Test build #86407 has finished for PR 20226 at commit bf90ac7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-20T06:52:28Z

sql/core/src/test/resources/sql-tests/results/operators.sql.out

@@ -233,7 +233,7 @@ struct<plan:string>
 -- !query 28 output
 == Physical Plan ==
 *Project [null AS (CAST(concat(a, CAST(1 AS STRING)) AS DOUBLE) + CAST(2 AS DOUBLE))#x]
-+- Scan OneRowRelation[]
+- Scan Scan RDD OneRowRelation [][]


SparkQA · 2018-01-31T02:51:35Z

Test build #86850 has finished for PR 20226 at commit 1facc05.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-31T04:00:03Z

It sounds like we still need to fix a test in PySpark. Thanks!

cloud-fan · 2018-02-05T10:30:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -86,6 +86,9 @@ case class RowDataSourceScanExec(

  def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)

+  override val nodeName: String =


DataSourceScanExec.nodeName is defined as s"Scan $relation ${tableIdentifier.map(_.unquotedString).getOrElse("")}", do we really need to overwrite it here?

My intent was to be able to distinguish between RowDataSourceScan and FileSourceScan. Removing those overrides.

cloud-fan · 2018-02-05T10:33:19Z

By default simpleString is defined as s"$nodeName $argString".trim, if we overwrite nodeName in some nodes, we should also overwrite argString, otherwise we may have duplicated information in simpleString, which is used with explain.

Can we just change the UI code to put plan.simpleString in the plan graph?

cloud-fan · 2018-02-06T05:38:36Z

After more thoughts, I feel it's reasonable to include table information in the node name.

The UI displays nodeName in the plan graph, and displays simpleString in a pop-up window when users hover over the plan graph. Since table information is pretty important, it makes sense to display it in the plan graph instead of the pop-up window.

Data Source table scan does follow this rule

+1 on this PR to fix the hive table scan, or any other scan nodes that don't follow this rule.

cloud-fan · 2018-02-06T05:55:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

@@ -103,6 +103,8 @@ case class ExternalRDDScanExec[T](
  override lazy val metrics = Map(
    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))

+  override val nodeName: String = s"Scan ExternalRDD ${output.map(_.name).mkString("[", ",", "]")}"


I don't think including the output in the node name is a good idea.

My intention here was to be able to distinguish between ExternalRDDScanExec nodes. If we remove the output part from nodename, then these nodes would be named as Scan ExternalRDD which is generic.

cloud-fan · 2018-02-06T05:55:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

    override val outputPartitioning: Partitioning = UnknownPartitioning(0),
    override val outputOrdering: Seq[SortOrder] = Nil) extends LeafExecNode {

+  override val nodeName: String = s"Scan RDD $name ${output.map(_.name).mkString("[", ",", "]")}"


removed output. The name in there would help in identifying the nodes uniquely

cloud-fan · 2018-02-06T05:59:02Z

After went through the changes here, I think we only need to update 2 nodes to include table name in nodeName: hive table scan and in-memory table scan.

SparkQA · 2018-08-15T07:05:02Z

Test build #94786 has finished for PR 20226 at commit 1facc05.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

gatorsmile · 2018-08-18T00:14:33Z

@maropu Could you take this over?

maropu · 2018-08-18T00:48:23Z

sure, will do, too.

vanzin reviewed Jan 11, 2018

View reviewed changes

tejasapatil changed the title ~~[SPARK-23034][Hive][UI] Display tablename for HiveTableScan node in UI~~ [SPARK-23034][SQL][UI] Display tablename for HiveTableScan node in UI Jan 11, 2018

dongjoon-hyun approved these changes Jan 11, 2018

View reviewed changes

gatorsmile reviewed Jan 11, 2018

View reviewed changes

tejasapatil force-pushed the SPARK-23034 branch from 6271804 to 9bcd905 Compare January 12, 2018 06:12

tejasapatil changed the title ~~[SPARK-23034][SQL][UI] Display tablename for HiveTableScan node in UI~~ [SPARK-23034][SQL] Override nodeName for all *ScanExec operators Jan 12, 2018

tejasapatil force-pushed the SPARK-23034 branch from f16b73b to 65ec7a2 Compare January 17, 2018 23:19

tejasapatil commented Jan 18, 2018

View reviewed changes

tejasapatil added 3 commits January 18, 2018 11:19

[SPARK-23034][Hive][UI] Display tablename for HiveTableScan node in UI

49da476

using nodename (tablename)

22f4e48

review comment

edd44fd

tejasapatil added 3 commits January 18, 2018 11:19

fix HiveThriftBinaryServerSuite

ec85a02

fix nodename issue which caused test failure

9c29252

update sql/core/src/test/resources/sql-tests/results/*

0c0aa94

tejasapatil force-pushed the SPARK-23034 branch from 65ec7a2 to 0c0aa94 Compare January 18, 2018 20:25

gatorsmile reviewed Jan 18, 2018

View reviewed changes

review comment

bf90ac7

gatorsmile reviewed Jan 20, 2018

View reviewed changes

de-dupe scan

1facc05

cloud-fan reviewed Feb 5, 2018

View reviewed changes

cloud-fan reviewed Feb 6, 2018

View reviewed changes

maropu mentioned this pull request Aug 20, 2018

[SPARK-23034][SQL] Show RDD/relation names in RDD/Hive table scan nodes #22153

Closed

asfgit closed this in 2a0a8f7 Aug 23, 2018

		@@ -62,6 +62,8 @@ case class HiveTableScanExec(

		override def conf: SQLConf = sparkSession.sessionState.conf

		override def nodeName: String = s"${super.nodeName}-${relation.tableMeta.qualifiedName}"

Scan impl	overridden `nodeName`
DataSourceV2ScanExec	`Scan DataSourceV2 [output_attribute1, output_attribute2, ..]`
ExternalRDDScanExec	`Scan ExternalRDD [output_attribute1, output_attribute2, ..]`
FileSourceScanExec	`Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation.location)}"`
HiveTableScanExec	`Scan HiveTable relation.tableMeta.qualifiedName`
InMemoryTableScanExec	`Scan In-memory relation.tableName`
LocalTableScanExec	`Scan LocalTable [output_attribute1, output_attribute2, ..]`
RDDScanExec	`Scan RDD name [output_attribute1, output_attribute2, ..]`
RowDataSourceScanExec	`Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation)}`

		@@ -86,6 +86,9 @@ case class RowDataSourceScanExec(

		def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)

		override val nodeName: String =

[SPARK-23034][SQL] Override nodeName for all *ScanExec operators #20226

[SPARK-23034][SQL] Override nodeName for all *ScanExec operators #20226

Uh oh!

Conversation

tejasapatil commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tejasapatil commented Jan 11, 2018

Uh oh!

dongjoon-hyun commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Jan 11, 2018

Uh oh!

SparkQA commented Jan 11, 2018

Uh oh!

tejasapatil commented Jan 11, 2018

Uh oh!

dongjoon-hyun commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 11, 2018

Uh oh!

tejasapatil commented Jan 11, 2018

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 11, 2018

Uh oh!

mgaido91 commented Jan 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 12, 2018

Uh oh!

SparkQA commented Jan 12, 2018

Uh oh!

gatorsmile commented Jan 13, 2018

Uh oh!

tejasapatil commented Jan 16, 2018

Uh oh!

gatorsmile commented Jan 17, 2018

Uh oh!

tejasapatil commented Jan 17, 2018

Uh oh!

tejasapatil Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 18, 2018

Uh oh!

SparkQA commented Jan 18, 2018

Uh oh!

Choose a reason for hiding this comment

[SPARK-23034][SQL] Override `nodeName` for all *ScanExec operators #20226

[SPARK-23034][SQL] Override `nodeName` for all *ScanExec operators #20226

tejasapatil commented Jan 11, 2018 •

edited

Loading

dongjoon-hyun commented Jan 11, 2018 •

edited

Loading

dongjoon-hyun commented Jan 11, 2018 •

edited

Loading

tejasapatil Jan 11, 2018 •

edited

Loading

dongjoon-hyun commented Jan 11, 2018 •

edited

Loading

tejasapatil Jan 11, 2018 •

edited

Loading

gatorsmile Jan 12, 2018 •

edited

Loading

tejasapatil Jan 18, 2018 •

edited

Loading

tejasapatil Jan 20, 2018 •

edited

Loading

tejasapatil Feb 6, 2018 •

edited

Loading

cloud-fan commented Feb 5, 2018 •

edited

Loading