Skip to content

[SPARK-23034][SQL] Override nodeName for all *ScanExec operators #20226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Jan 11, 2018

What changes were proposed in this pull request?

For queries which scan multiple tables, it will be convenient if the DAG shown in Spark UI also showed which table is being scanned. This will make debugging easier. For this JIRA, I am scoping those for hive table scans only. In case table scans which happen via codegen (eg. convertMetastore and spark native tables), this PR will not affect things.

How was this patch tested?

screen shot 2018-01-10 at 3 52 58 pm

screen shot 2018-01-10 at 5 12 39 pm

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 11, 2018

It looks useful, @tejasapatil . Given that this is one line addition, why don't you handle the others?

For this JIRA, I am scoping those for hive table scans only.

BTW, can we have a title prefix [SQL] instead of [Hive]?

@tejasapatil
Copy link
Contributor Author

@dongjoon-hyun : For Spark native tables, the table scan node is abstracted out as a WholeStageCodegen node in the DAG. A codegen node might be doing more things besides table scan so it is debatable how people would think if we start calling that as tablescan-TableA.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 11, 2018

Then, specifically, what happens in this PR for Parquet/ORC Hive table which is converted to data source tables with convertMetastoreParquet/Orc? Now, both parameters are true by default.

@@ -62,6 +62,8 @@ case class HiveTableScanExec(

override def conf: SQLConf = sparkSession.sessionState.conf

override def nodeName: String = s"${super.nodeName}-${relation.tableMeta.qualifiedName}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s"${super.nodeName}(${relation.tableMeta.qualifiedName})" looks clearer to me, but up to you.

Copy link
Contributor Author

@tejasapatil tejasapatil Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this format. I added a space in between for better readability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(updated the screenshot in the PR description)

@tejasapatil tejasapatil changed the title [SPARK-23034][Hive][UI] Display tablename for HiveTableScan node in UI [SPARK-23034][SQL][UI] Display tablename for HiveTableScan node in UI Jan 11, 2018
@tejasapatil
Copy link
Contributor Author

@dongjoon-hyun : I tried it out over master and since the table scan goes via codegen, it wont show the table name. Will update the PR description with this finding. Lets move this discussion to the JIRA and see what people have to say about the concern I had speculated.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85941 has finished for PR 20226 at commit b972d1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 11, 2018

Thank you, @tejasapatil . I see.
Although this is not applicable for ORC/Parquet hive tables, the PR looks useful for me.
Could you put the condition ‘convertMetastore’ and limitation in PR description? Otherwise, people may be confused about it.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85943 has finished for PR 20226 at commit 6271804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@dongjoon-hyun : I have updated the PR description

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85945 has finished for PR 20226 at commit 6271804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

LGTM too

@@ -62,6 +62,8 @@ case class HiveTableScanExec(

override def conf: SQLConf = sparkSession.sessionState.conf

override def nodeName: String = s"${super.nodeName} (${relation.tableMeta.qualifiedName})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our DataSourceScanExec is using unquotedString in nodeName. We need to make these LeafNode consistent. Could you check all the other LeafExecNode?

Copy link
Contributor Author

@tejasapatil tejasapatil Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSourceScanExec is using unquotedString and so does this PR. Rest implementations of LeafExecNode do not override nodeName and / or depend on base class (ie. TreeNode)

Copy link
Member

@gatorsmile gatorsmile Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

s"Scan HiveTable ${relation.tableMeta.qualifiedName}"

Just to be more consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSourceV2ScanExec faces the same issue. How about InMemoryTableScanExec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile : I have updated the PR after going through all the *ScanExec implementations

Changes introduced in this PR:

Scan impl overridden nodeName
DataSourceV2ScanExec Scan DataSourceV2 [output_attribute1, output_attribute2, ..]
ExternalRDDScanExec Scan ExternalRDD [output_attribute1, output_attribute2, ..]
FileSourceScanExec Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation.location)}"
HiveTableScanExec Scan HiveTable relation.tableMeta.qualifiedName
InMemoryTableScanExec Scan In-memory relation.tableName
LocalTableScanExec Scan LocalTable [output_attribute1, output_attribute2, ..]
RDDScanExec Scan RDD name [output_attribute1, output_attribute2, ..]
RowDataSourceScanExec Scan FileSource ${tableIdentifier.map(_.unquotedString).getOrElse(relation)}

Things not affected:

  • DataSourceScanExec : already uses Scan relation tableIdentifier.unquotedString
  • RDDScanExec forces clients to specify the nodeName

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86022 has finished for PR 20226 at commit 9bcd905.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil changed the title [SPARK-23034][SQL][UI] Display tablename for HiveTableScan node in UI [SPARK-23034][SQL] Override nodeName for all *ScanExec operators Jan 12, 2018
@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86044 has finished for PR 20226 at commit f16b73b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Overall, the fixes looks good to me. We just need to resolve the test cases.

Thanks for improving it!

@tejasapatil
Copy link
Contributor Author

The test failure does look legit to me. I have been not able to repro it on my laptop. Intellij doesn't treat it as a test case. Command-line does recognize it as test case but hits runtime failure with jar mismatch. I am using this to run the test:

build/mvn -Phive-thriftserver -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite test

Is there special setup needed to run these tests ?

@gatorsmile
Copy link
Member

build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly "hive-thriftserver/test-only org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite"

Try this? It worked for me before.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please.

@@ -45,7 +46,12 @@ trait CodegenSupport extends SparkPlan {
case _: SortMergeJoinExec => "smj"
case _: RDDScanExec => "rdd"
case _: DataSourceScanExec => "scan"
case _ => nodeName.toLowerCase(Locale.ROOT)
Copy link
Contributor Author

@tejasapatil tejasapatil Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused one of the tests to fail as the nodeName generated was not a single word (like before) but something like scan in-memory my_table... which does not compile with codegen. The change done was to retain only the alpha-numeric characters in the nodeName while generating variablePrefix

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86300 has finished for PR 20226 at commit 65ec7a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86299 has finished for PR 20226 at commit 65ec7a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -30,6 +30,8 @@ case class LocalTableScanExec(
output: Seq[Attribute],
@transient rows: Seq[InternalRow]) extends LeafExecNode {

override val nodeName: String = s"Scan LocalTable ${output.map(_.name).mkString("[", ",", "]")}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we have duplicate info about output in stringArgs.

Copy link
Contributor Author

@tejasapatil tejasapatil Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you are referring to the duplication at :

def simpleString: String = s"$nodeName $argString".trim

Am changing this line to just have Scan LocalTable

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86353 has finished for PR 20226 at commit 0c0aa94.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #86407 has finished for PR 20226 at commit bf90ac7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -233,7 +233,7 @@ struct<plan:string>
-- !query 28 output
== Physical Plan ==
*Project [null AS (CAST(concat(a, CAST(1 AS STRING)) AS DOUBLE) + CAST(2 AS DOUBLE))#x]
+- Scan OneRowRelation[]
+- Scan Scan RDD OneRowRelation [][]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86850 has finished for PR 20226 at commit 1facc05.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

It sounds like we still need to fix a test in PySpark. Thanks!

@@ -86,6 +86,9 @@ case class RowDataSourceScanExec(

def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)

override val nodeName: String =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSourceScanExec.nodeName is defined as s"Scan $relation ${tableIdentifier.map(_.unquotedString).getOrElse("")}", do we really need to overwrite it here?

Copy link
Contributor Author

@tejasapatil tejasapatil Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to be able to distinguish between RowDataSourceScan and FileSourceScan. Removing those overrides.

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 5, 2018

By default simpleString is defined as s"$nodeName $argString".trim, if we overwrite nodeName in some nodes, we should also overwrite argString, otherwise we may have duplicated information in simpleString, which is used with explain.

Can we just change the UI code to put plan.simpleString in the plan graph?

@cloud-fan
Copy link
Contributor

After more thoughts, I feel it's reasonable to include table information in the node name.

The UI displays nodeName in the plan graph, and displays simpleString in a pop-up window when users hover over the plan graph. Since table information is pretty important, it makes sense to display it in the plan graph instead of the pop-up window.

Data Source table scan does follow this rule
image
image

+1 on this PR to fix the hive table scan, or any other scan nodes that don't follow this rule.

@@ -103,6 +103,8 @@ case class ExternalRDDScanExec[T](
override lazy val metrics = Map(
"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))

override val nodeName: String = s"Scan ExternalRDD ${output.map(_.name).mkString("[", ",", "]")}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think including the output in the node name is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention here was to be able to distinguish between ExternalRDDScanExec nodes. If we remove the output part from nodename, then these nodes would be named as Scan ExternalRDD which is generic.

override val outputPartitioning: Partitioning = UnknownPartitioning(0),
override val outputOrdering: Seq[SortOrder] = Nil) extends LeafExecNode {

override val nodeName: String = s"Scan RDD $name ${output.map(_.name).mkString("[", ",", "]")}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed output. The name in there would help in identifying the nodes uniquely

@cloud-fan
Copy link
Contributor

After went through the changes here, I think we only need to update 2 nodes to include table name in nodeName: hive table scan and in-memory table scan.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94786 has finished for PR 20226 at commit 1facc05.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@maropu Could you take this over?

@maropu
Copy link
Member

maropu commented Aug 18, 2018

sure, will do, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants