[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612

chutium · 2014-07-27T23:12:54Z

SPARK-2710 Build SchemaRDD from a JdbcRDD with MetaData

and a small bug fix on JdbcRDD, line 109
it seems conn will never be closed

AmplabJenkins · 2014-07-27T23:17:17Z

Can one of the admins verify this patch?

chutium · 2014-07-28T08:33:42Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

@@ -67,6 +69,28 @@ class JdbcRDD[T: ClassTag](
    }).toArray
  }

+  def getSchema: Seq[(String, Int, Boolean)] = {


here i tried to return a java.sql.ResultSetMetaData object, then build the Seq[(String, Int, Boolean)] for schemaRDD in Spark SQL scope, but when i run this SchemaRDD, i got "java.io.NotSerializableException: org.postgresql.jdbc4.Jdbc4ResultSetMetaData"

so i let this method return a Seq[(String, Int, Boolean)], and in Spark SQL scope, map this Seq[(String, Int, Boolean)] to Seq[StructField]

Can you add this as a comment here?

We should probably also make this private[spark].

chutium · 2014-07-30T13:59:15Z

Test Suite added

…hema-rdd

chenghao-intel · 2014-08-25T04:55:37Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

@@ -57,6 +61,8 @@ class JdbcRDD[T: ClassTag](
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
  extends RDD[T](sc, Nil) with Logging {

+  private var schema: Seq[(String, Int, Boolean)] = null


Move the schema stuff to JdbcResultSetRDD? We'd better keep the Spark core clean and same implementation pattern with the other Core RDDs.

yep, i tried to do like you said before, but there is no public method or attribute to get ResultSet or Statement from this JdbcRDD in spark core, so in JdbcResultSetRDD i have no idea how can we get the metadata from JdbcRDD... otherwise we do something like jdbcRDD.head then we can get the metadata from first row, but it may execute the whole query at plan phase.

marmbrus · 2014-08-30T00:15:47Z

ok to test

SparkQA · 2014-08-30T00:19:02Z

QA tests have started for PR 1612 at commit 917a753.

This patch merges cleanly.

SparkQA · 2014-08-30T00:19:58Z

QA tests have finished for PR 1612 at commit 917a753.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

…hema-rdd

…ach, use new SchemaRDD API in test suite to fix warning

SparkQA · 2014-09-04T08:44:14Z

QA tests have started for PR 1612 at commit 566d154.

This patch merges cleanly.

SparkQA · 2014-09-04T08:45:10Z

QA tests have finished for PR 1612 at commit 566d154.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-04T09:09:12Z

QA tests have started for PR 1612 at commit 2013303.

This patch merges cleanly.

SparkQA · 2014-09-04T11:01:14Z

QA tests have finished for PR 1612 at commit 2013303.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-09T02:17:38Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcResultSetRDD.scala

+        case BinaryType  => row.update(i, rs.getBytes(i + 1))
+        case TimestampType => row.update(i, rs.getTimestamp(i + 1))
+        case _ => sys.error(
+          s"Unsupported jdbc datatype")


Would be good to print what the unsupported type is. Also, try to wrap at the highest syntatic level, for example:

case unsupportedType => sys.error(s"Unsupported jdbc datatype: $unsupportedType")

(Though actually in this case I think it'll all fit on one line).

marmbrus · 2014-09-09T02:26:52Z

Thanks for working on this! Several people have asked for it :)

Aside from the few minor style comments, it would be great if we could add APIs for java and python as well.

koeninger · 2014-09-09T03:25:16Z

core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

+      return schema
+    }
+
+    val conn = getConnection()


Is this connection guaranteed to get closed? It won't benefit from the addOnCompleteCallback below, for instance.

chutium · 2014-09-11T19:43:45Z

thanks for the review, i will try to improve it soon, adding more external datasources is always helpful, then we can use Spark SQL as a data integration platform, and of course SQL92 is also important, now Spark SQL ist more like a tool for quering hadoop files.

marmbrus · 2014-12-02T00:31:29Z

Thanks for working on this! I think this will be a really useful addition. However, with the new external data sources api that is part of 1.2, I think it might be better to do this as an external library (for example: https://github.com/databricks/spark-avro). This would make it easier to make releases, and also help us keep spark core's size manageable. If you agree, maybe we can close this issue? Let me know if you have any questions.

chutium changed the title ~~SPARK-2710 Build SchemaRDD from a JdbcRDD with MetaData~~ [SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData Jul 27, 2014

chutium reviewed Jul 28, 2014
View reviewed changes

This was referenced Jul 30, 2014

[SPARK-2729] [SQL] Forgot to match Timestamp type in ColumnBuilder #1636

Closed

[SPARK-2179][SQL] Public API for DataTypes and Schema #1346

Closed

chutium added 4 commits August 5, 2014 18:07

[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData

cd33a3b

merge SPARK-2869 PR apache#1792

dda49bc

Merge branch 'master' of https://github.com/apache/spark into jdbc-sc…

a4f9ed2

…hema-rdd

add derby dependency for TestSuite

917a753

chenghao-intel reviewed Aug 25, 2014
View reviewed changes

chutium added 2 commits September 4, 2014 09:44

Merge branch 'master' of https://github.com/apache/spark into jdbc-sc…

b98c598

…hema-rdd

[SPARK-2710][SQL] use while loop instead of the zipWithIndex and fore…

566d154

…ach, use new SchemaRDD API in test suite to fix warning

[SPARK-2710][SQL] fix fcalastyle checks errors

2013303

marmbrus reviewed Sep 9, 2014
View reviewed changes

koeninger reviewed Sep 9, 2014
View reviewed changes

asfgit closed this in b0a46d8 Dec 2, 2014

[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612

[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612

Uh oh!

Conversation

chutium commented Jul 27, 2014

Uh oh!

AmplabJenkins commented Jul 27, 2014

Uh oh!

chutium Jul 28, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

chutium commented Jul 30, 2014

Uh oh!

chenghao-intel Aug 25, 2014

Choose a reason for hiding this comment

Uh oh!

chutium Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Aug 30, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

SparkQA commented Aug 30, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

marmbrus Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Sep 9, 2014

Uh oh!

koeninger Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

chutium commented Sep 11, 2014

Uh oh!

marmbrus commented Dec 2, 2014

Uh oh!

Uh oh!