-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-2710] [SQL] Build SchemaRDD from a JdbcRDD with MetaData #1612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
@@ -67,6 +69,28 @@ class JdbcRDD[T: ClassTag]( | |||
}).toArray | |||
} | |||
|
|||
def getSchema: Seq[(String, Int, Boolean)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here i tried to return a java.sql.ResultSetMetaData object, then build the Seq[(String, Int, Boolean)] for schemaRDD in Spark SQL scope, but when i run this SchemaRDD, i got "java.io.NotSerializableException: org.postgresql.jdbc4.Jdbc4ResultSetMetaData"
so i let this method return a Seq[(String, Int, Boolean)], and in Spark SQL scope, map this Seq[(String, Int, Boolean)] to Seq[StructField]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this as a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also make this private[spark]
.
Test Suite added |
@@ -57,6 +61,8 @@ class JdbcRDD[T: ClassTag]( | |||
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _) | |||
extends RDD[T](sc, Nil) with Logging { | |||
|
|||
private var schema: Seq[(String, Int, Boolean)] = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the schema stuff to JdbcResultSetRDD
? We'd better keep the Spark core clean and same implementation pattern with the other Core RDDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, i tried to do like you said before, but there is no public method or attribute to get ResultSet or Statement from this JdbcRDD
in spark core, so in JdbcResultSetRDD
i have no idea how can we get the metadata from JdbcRDD
... otherwise we do something like jdbcRDD.head
then we can get the metadata from first row, but it may execute the whole query at plan phase.
ok to test |
QA tests have started for PR 1612 at commit
|
QA tests have finished for PR 1612 at commit
|
…ach, use new SchemaRDD API in test suite to fix warning
QA tests have started for PR 1612 at commit
|
QA tests have finished for PR 1612 at commit
|
QA tests have started for PR 1612 at commit
|
QA tests have finished for PR 1612 at commit
|
case BinaryType => row.update(i, rs.getBytes(i + 1)) | ||
case TimestampType => row.update(i, rs.getTimestamp(i + 1)) | ||
case _ => sys.error( | ||
s"Unsupported jdbc datatype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to print what the unsupported type is. Also, try to wrap at the highest syntatic level, for example:
case unsupportedType =>
sys.error(s"Unsupported jdbc datatype: $unsupportedType")
(Though actually in this case I think it'll all fit on one line).
Thanks for working on this! Several people have asked for it :) Aside from the few minor style comments, it would be great if we could add APIs for java and python as well. |
return schema | ||
} | ||
|
||
val conn = getConnection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this connection guaranteed to get closed? It won't benefit from the addOnCompleteCallback below, for instance.
thanks for the review, i will try to improve it soon, adding more external datasources is always helpful, then we can use Spark SQL as a data integration platform, and of course SQL92 is also important, now Spark SQL ist more like a tool for quering hadoop files. |
Thanks for working on this! I think this will be a really useful addition. However, with the new external data sources api that is part of 1.2, I think it might be better to do this as an external library (for example: https://github.com/databricks/spark-avro). This would make it easier to make releases, and also help us keep spark core's size manageable. If you agree, maybe we can close this issue? Let me know if you have any questions. |
SPARK-2710 Build SchemaRDD from a JdbcRDD with MetaData
and a small bug fix on JdbcRDD, line 109
it seems conn will never be closed