Skip to content

[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

chutium
Copy link
Contributor

@chutium chutium commented Jul 27, 2014

SPARK-2710 Build SchemaRDD from a JdbcRDD with MetaData

and a small bug fix on JdbcRDD, line 109
it seems conn will never be closed

@chutium chutium changed the title SPARK-2710 Build SchemaRDD from a JdbcRDD with MetaData [SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData Jul 27, 2014
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -67,6 +69,28 @@ class JdbcRDD[T: ClassTag](
}).toArray
}

def getSchema: Seq[(String, Int, Boolean)] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here i tried to return a java.sql.ResultSetMetaData object, then build the Seq[(String, Int, Boolean)] for schemaRDD in Spark SQL scope, but when i run this SchemaRDD, i got "java.io.NotSerializableException: org.postgresql.jdbc4.Jdbc4ResultSetMetaData"

so i let this method return a Seq[(String, Int, Boolean)], and in Spark SQL scope, map this Seq[(String, Int, Boolean)] to Seq[StructField]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this as a comment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also make this private[spark].

@chutium
Copy link
Contributor Author

chutium commented Jul 30, 2014

Test Suite added

@@ -57,6 +61,8 @@ class JdbcRDD[T: ClassTag](
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
extends RDD[T](sc, Nil) with Logging {

private var schema: Seq[(String, Int, Boolean)] = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the schema stuff to JdbcResultSetRDD? We'd better keep the Spark core clean and same implementation pattern with the other Core RDDs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i tried to do like you said before, but there is no public method or attribute to get ResultSet or Statement from this JdbcRDD in spark core, so in JdbcResultSetRDD i have no idea how can we get the metadata from JdbcRDD... otherwise we do something like jdbcRDD.head then we can get the metadata from first row, but it may execute the whole query at plan phase.

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have started for PR 1612 at commit 917a753.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 30, 2014

QA tests have finished for PR 1612 at commit 917a753.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 1612 at commit 566d154.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 1612 at commit 566d154.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 1612 at commit 2013303.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 1612 at commit 2013303.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case BinaryType => row.update(i, rs.getBytes(i + 1))
case TimestampType => row.update(i, rs.getTimestamp(i + 1))
case _ => sys.error(
s"Unsupported jdbc datatype")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to print what the unsupported type is. Also, try to wrap at the highest syntatic level, for example:

case unsupportedType =>
  sys.error(s"Unsupported jdbc datatype: $unsupportedType")

(Though actually in this case I think it'll all fit on one line).

@marmbrus
Copy link
Contributor

marmbrus commented Sep 9, 2014

Thanks for working on this! Several people have asked for it :)

Aside from the few minor style comments, it would be great if we could add APIs for java and python as well.

return schema
}

val conn = getConnection()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this connection guaranteed to get closed? It won't benefit from the addOnCompleteCallback below, for instance.

@chutium
Copy link
Contributor Author

chutium commented Sep 11, 2014

thanks for the review, i will try to improve it soon, adding more external datasources is always helpful, then we can use Spark SQL as a data integration platform, and of course SQL92 is also important, now Spark SQL ist more like a tool for quering hadoop files.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 2, 2014

Thanks for working on this! I think this will be a really useful addition. However, with the new external data sources api that is part of 1.2, I think it might be better to do this as an external library (for example: https://github.com/databricks/spark-avro). This would make it easier to make releases, and also help us keep spark core's size manageable. If you agree, maybe we can close this issue? Let me know if you have any questions.

@asfgit asfgit closed this in b0a46d8 Dec 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants