Skip to content

[SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Nov 24, 2014

Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:

CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")

With this PR user can define schema instead of infer from file, so support ddl command as follows:

CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")

@SparkQA
Copy link

SparkQA commented Nov 24, 2014

Test build #23787 has started for PR 3431 at commit c203ce2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 24, 2014

Test build #23787 has finished for PR 3431 at commit c203ce2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23787/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24044 has started for PR 3431 at commit 12fab0a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24044 has finished for PR 3431 at commit 12fab0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24044/
Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 2, 2014

Cool feature :) I wanted to include this in the first cut but ran out of time. It's too late for 1.2 but I'll try and review this soon.

@SparkQA
Copy link

SparkQA commented Dec 12, 2014

Test build #24396 has started for PR 3431 at commit 6f1259c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 12, 2014

Test build #24396 has finished for PR 3431 at commit 6f1259c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Analyzer(catalog: Catalog,
    • case class ParquetRelation2(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24396/
Test PASSed.

import org.apache.spark.sql.sources._

private[sql] class DefaultSource extends RelationProvider {
/** Returns a new base relation with the given parameters. */
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
parameters: Map[String, String],
schema: Option[StructType]): BaseRelation = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot change the function signature, otherwise we will break existing libraries. Instead I think we need to create a new interface SchemaRelationProvider maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or using a default value for schema: schema: Option[StructType] = None ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default values do not preserve binary compatibility, only source compatibility.

@marmbrus
Copy link
Contributor

Thanks for working on this! I know a couple of people who want to use this in data sources they are writing.

@marmbrus
Copy link
Contributor

ping. any progress here?

@scwf
Copy link
Contributor Author

scwf commented Dec 30, 2014

Hi @marmbrus, still working on this, tomorrow i will update this.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24897 has started for PR 3431 at commit 44eb70c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24897 has finished for PR 3431 at commit 44eb70c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24897/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24898 has started for PR 3431 at commit 02a662c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24898 has finished for PR 3431 at commit 02a662c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25279/
Test FAILed.

@scwf
Copy link
Contributor Author

scwf commented Jan 9, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25283 has started for PR 3431 at commit f336a16.

  • This patch merges cleanly.

protected def cleanIdentifier(ident: String): String = ident match {
case escapedIdentifier(i) => i
case plainIdent => plainIdent
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems when we use ident, the parser will automatically take care backticks. We can remove it. I am sorry I just noticed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you:)

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25287 has started for PR 3431 at commit a852b10.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25283 has finished for PR 3431 at commit f336a16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25283/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25287 has finished for PR 3431 at commit a852b10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends SchemaRelationProvider
    • case class ParquetRelation2(
    • trait SchemaRelationProvider

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25287/
Test PASSed.

def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String],
schema: Option[StructType]): BaseRelation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an option? we have two traits and option is not very friendly to java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea is to compatible with the old traits, since we will have two traits i will fix this.

@yhuai
Copy link
Contributor

yhuai commented Jan 10, 2015

@scwf I have done it and will have a PR to your branch.

@yhuai
Copy link
Contributor

yhuai commented Jan 10, 2015

scwf#22

Remove Option from createRelation.
@scwf
Copy link
Contributor Author

scwf commented Jan 10, 2015

ok, merged!

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25354 has started for PR 3431 at commit 7e79ce5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25354 has finished for PR 3431 at commit 7e79ce5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait SchemaRelationProvider

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25354/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks for working on this guys! Merging to master.

@yhuai can you clarify the difference between SchemaRelationProvider and RelationProvider in the scala doc in your next PR?

@asfgit asfgit closed this in 693a323 Jan 10, 2015
@yhuai
Copy link
Contributor

yhuai commented Jan 10, 2015

Yeah, no problem.

@scwf scwf deleted the ddl branch January 10, 2015 23:23
asfgit pushed a commit that referenced this pull request Jan 13, 2015
With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs.

Author: Yin Huai <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits:

069c235 [Yin Huai] Make exception messages user friendly.
c07cbc6 [Yin Huai] Get the location of test file in a correct way.
4456e98 [Yin Huai] Test data.
5315dfc [Yin Huai] rxin's comments.
7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API.
aeaf4b3 [Yin Huai] Add comments.
06f9b0c [Yin Huai] Revert unnecessary changes.
feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2
172db80 [Yin Huai] Fix unit test.
49bf1ac [Yin Huai] Unit tests.
8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431
f47fda1 [Yin Huai] Unit tests.
2b59723 [Michael Armbrust] Set external when creating tables
c00bb1b [Michael Armbrust] Don't use reflection to read options
1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist
6edc710 [Michael Armbrust] Add tests.
d7da491 [Michael Armbrust] First draft of persistent tables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants