Skip to content

[SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api #3960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

[SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api #3960

wants to merge 17 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Jan 8, 2015

With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25277 has started for PR 3960 at commit 49bf1ac.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25277 has finished for PR 3960 at commit 49bf1ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider
    • case class TableIdent(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25277/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25282 has started for PR 3960 at commit 172db80.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25282 has finished for PR 3960 at commit 172db80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider
    • case class TableIdent(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25282/
Test PASSed.

…hSchema2

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
	sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala
	sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala
	sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
	sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #25372 has started for PR 3960 at commit feb88aa.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #25372 has finished for PR 3960 at commit feb88aa.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • case class TableIdent(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25372/
Test FAILed.

@yhuai yhuai changed the title [WIP][SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api [SPARK-4912][SQL] Persistent tables for the Spark SQL data sources api Jan 12, 2015
@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25387 has started for PR 3960 at commit 7fc4b56.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25387 has finished for PR 3960 at commit 7fc4b56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class TableIdent(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25387/
Test PASSed.

@@ -50,8 +52,75 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
/** Connection to hive metastore. Usages should lock on `this`. */
protected[hive] val client = Hive.get(hive.hiveconf)

// TODO: Use this everywhere instead of tuples or databaseName, tableName,.
/** A fully qualified identifier for a table (i.e., database.tableName) */
case class TableIdent(database: String, name: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about QualifiedTable?

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25429 has finished for PR 3960 at commit 5315dfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class QualifiedTableName(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25429/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 12, 2015

Test build #25435 has started for PR 3960 at commit 4456e98.

  • This patch merges cleanly.


checkAnswer(
sql("SELECT * FROM jsonTable"),
jsonFile("src/test/resources/sample.json").collect().toSeq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven and sbt use different paths for running tests. I am not sure specifying path this way will work. You probably need to use getClass.getResources to get the absolute path.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25438 has started for PR 3960 at commit c07cbc6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25435 has finished for PR 3960 at commit 4456e98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25435/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25438 has finished for PR 3960 at commit c07cbc6.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25438/
Test FAILed.

@@ -310,4 +311,17 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] {
case _ => Nil
}
}

object DDLStrategy extends Strategy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid make this strategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was original in CommandStrategy. I was trying to find a good place for these, but I did not find a suitable Strategy. Any suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwf Actually, I think that it is better to put all rules for the data data source API in the same place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai, i mean since CreateTableUsing and CreateTempTableUsing is command, we'd better make it follow strategy:

  object BasicOperators extends Strategy {
    def numPartitions = self.numPartitions

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
      case r: RunnableCommand => ExecutedCommand(r) :: Nil

i will try for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not sure we should put them in BasicOperators. We cannot just create a RunnableCommand in ddl.scala since SQLContext does not allow persistent table and we need to throw the error in SparkStrategies. Also, I feel code is clear when we put stuff related to the data source API together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai,i write a draft version for this, can you have a look(https://github.com/scwf/spark/compare/apache:master...scwf:createDataSourceTable?expand=1)

why we put case r: RunnableCommand => ExecutedCommand(r) in BasicOperators is because we no need make a new strategy for only one rule.

And after we refactor command implementation in spark sql, we should make the newly added command follow RunnableCommand if possible, then we can avoid adding new strategy for newly added command.

/cc @marmbrus

@rxin
Copy link
Contributor

rxin commented Jan 13, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25456 has started for PR 3960 at commit c07cbc6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25456 has finished for PR 3960 at commit c07cbc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class QualifiedTableName(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25456/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25482 has started for PR 3960 at commit 069c235.

  • This patch merges cleanly.

@@ -50,8 +52,76 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
/** Connection to hive metastore. Usages should lock on `this`. */
protected[hive] val client = Hive.get(hive.hiveconf)

// TODO: Use this everywhere instead of tuples or databaseName, tableName,.
/** A fully qualified identifier for a table (i.e., database.tableName) */
case class QualifiedTableName(database: String, name: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really match the rest of the API any more now that we have the concept of a tableIdentifier. We can fix this in a followup PR.

@marmbrus
Copy link
Contributor

LGTM once tests pass.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25482 has finished for PR 3960 at commit 069c235.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class QualifiedTableName(database: String, name: String)
    • case class CreateMetastoreDataSource(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25482/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants