[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... #448

kanzhang · 2014-04-18T22:28:28Z

... that do not change schema

kanzhang · 2014-04-18T22:31:11Z

First try, pls comment. Not very comfortable with methods that take other RDDs, like intersect, subtract and union, since caller has to make sure they of the same schema.

AmplabJenkins · 2014-04-18T22:33:11Z

Can one of the admins verify this patch?

marmbrus · 2014-04-18T23:55:13Z

ok to test

AmplabJenkins · 2014-04-18T23:58:11Z

Merged build triggered.

AmplabJenkins · 2014-04-18T23:58:16Z

Merged build started.

marmbrus · 2014-04-19T00:38:29Z

Thanks for doing this!

I think we are actually okay for intersect and subtract as anything in the result must be a row that was in the original RDD and thus must have a correct schema. If you intersect with a different schema you will get back an empty rdd. If you subtract with an different schema the subtraction will be a no-op and you'll get back the original RDD.

Union is a little more troublesome. We could check the schema and throw an error if they don't match, but that is kinda changing the semantics relative the the standard union call on RDD. Also, when we do a SQL union we do type widening, so just calling RDD union and returning a SchemaRDD is a little weird.

So, I'd propose we leave union out, as users that want SQL semantics here can already call unionAll. @mateiz might have thoughts here too.

A few other methods we can add that also don't change the schema:

distinct() with no numPartitions
repartition
setName(...) ?
randomSplit (not sure if this is okay since Array is invariant)

marmbrus · 2014-04-19T00:43:26Z

sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

+  override def intersection(other: RDD[Row], numPartitions: Int): SchemaRDD =
+    applySchema(super.intersection(other, numPartitions))
+
+  override def sample(withReplacement: Boolean, fraction: Double, seed: Int): SchemaRDD =


There is already a sample method that returns SchemaRDD above.

Yes, I noticed it. It has a different signature and doesn't override the base sample() method. I thought this was done on purpose - you wanted to keep both implementations, right?

Ah, I see the execution eventually calls the base method, so it is equivalent?

Yeah, you are going to end up getting the same thing. I'd say we drop this one and leave the other. Right now it probably doesn't matter, but the other one is lazy and gives the optimizer a chance to possibly improve things before actually executing.

Isn't the base impl lazy also (till compute() is called)? It's kind of hard to tell the difference for users. What's your thinking in keeping the base implementation around (not overriding it)?

Oh, you are right. The base impl is probably lazy too. The distinction I was trying to make is that while normal RDD operations are lazy, they are not holistically optimized before execution. Where as if we create a logical operator and defer the creation of RDDs, there may be some extra chances for optimization (at some point in the future). We definitely want to override the base impl, but we don't need to have multiple redundant methods for creating samples.

Also note that you might need to sync with the changes being made in #462 .

Thanks for the heads-up. So #462 is already doing it, I'll skip it (I meant overriding it with the query one too).

marmbrus · 2014-04-19T00:49:11Z

Also, we should make the same changes to the Java and Python API if possible.

AmplabJenkins · 2014-04-19T01:11:18Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-19T01:11:18Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14248/

mateiz · 2014-04-19T21:08:47Z

I agree with leaving union out and adding repartition, coalesce and the other version of distinct. Also these should definitely be added to Java and Python too.

kanzhang · 2014-04-21T01:04:52Z

Thanks for your suggestions. I'll update.

Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?

kanzhang · 2014-04-21T04:50:21Z

@marmbrus you are right, I can't override randomSplit() due to invariance of Array.

How about cache(), persist(), unpersist()?

marmbrus · 2014-04-21T16:25:36Z

How about cache(), persist(), unpersist()?

Good catch!

Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now?

It is just called SchemaRDD and is located in python/pyspark/sql.py.

marmbrus · 2014-04-21T16:26:12Z

sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

+  override def subtract(other: RDD[Row], p: Partitioner): SchemaRDD =
+    applySchema(super.subtract(other, p))
+
+  override def union(other: RDD[Row]): SchemaRDD =


don't forget to remove this one.

kanzhang · 2014-04-21T18:55:58Z

Oh, I see. Thx.

AmplabJenkins · 2014-04-24T00:17:55Z

Build triggered.

AmplabJenkins · 2014-04-24T00:18:04Z

Build started.

kanzhang · 2014-04-24T00:18:50Z

Hey, just pushed an update on Scala and Java API. Wanted to get some feedback before I move on to Python. Pls pay attention to signatures of filter, intersection and subtract on the Java API. Thx.

AmplabJenkins · 2014-04-24T01:48:52Z

Build finished.

AmplabJenkins · 2014-04-24T01:48:52Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14414/

marmbrus · 2014-04-26T00:48:07Z

sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

+
+  // Common RDD functions
+
+  override def cache(): SchemaRDD = {


I should have mentioned this before, but we could consider using this.type instead in the Base RDD class for these methods. I'm not sure if that is breaking API or too much scala magic though. @mateiz ?

I'd be okay trying it, my questions then are what it looks like in Scaladoc and what it looks like in Java. We should also double-check that Scala expects binary compatibility for this kind of return type.

Thanks for pointing it out, will verify this.

marmbrus · 2014-04-26T00:51:32Z

Looking pretty good. Thanks again for working on this!

Jenkins, test this please.

AmplabJenkins · 2014-05-01T22:37:57Z

Build triggered.

AmplabJenkins · 2014-05-01T22:38:06Z

Build started.

kanzhang · 2014-05-01T22:57:04Z

@marmbrus @mateiz Here's an update on using this.type as return type. See following shell output. The result type for setName() method has changed from org.apache.spark.rdd.RDD[Record] to rdd.type. Similar things happened for SchemaRDD and JavaSchemaRDD (not a subclass). The benefit is we don't have to reimplement those methods in subclasses.

scala> rdd.setName("RDD")
res0: rdd.type = RDD ParallelCollectionRDD[0] at parallelize at <console>:16

scala> rdd
res1: org.apache.spark.rdd.RDD[Record] = RDD ParallelCollectionRDD[0] at parallelize at <console>:16

scala> srdd.setName("SCHEMA RDD")
res2: srdd.type = 
SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147

scala> srdd
res3: org.apache.spark.sql.SchemaRDD = 
SCHEMA RDD SchemaRDD[2] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147

scala> jsrdd.setName("JAVA SCHEMA RDD")
res4: jsrdd.type = 
JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147

scala> jsrdd
res5: org.apache.spark.sql.api.java.JavaSchemaRDD = 
JAVA SCHEMA RDD SchemaRDD[3] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [key#0,value#1], MappedRDD[1] at map at basicOperators.scala:147

kanzhang · 2014-05-01T23:21:15Z

PS. Subclasses that override those methods may have to be updated and recompiled (like what I did in EdgeRDD, VertexRDD). Better ideas?

AmplabJenkins · 2014-05-02T00:09:12Z

Build finished.

AmplabJenkins · 2014-05-02T00:09:12Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14611/

mridulm · 2014-05-06T20:18:35Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet..
   */
-  def persist(newLevel: StorageLevel): RDD[T] = {
+  def persist(newLevel: StorageLevel): this.type = {
    // TODO: Handle changes of StorageLevel


I am fairly ignorent of scala; I am not sure I follow, where is type coming from ? And what is it exactly ?
Also , does this change mean it is an incompat interface change ?

It's to allow child classes to not have to override functions like persist and cache that are used for chaining:

http://scalada.blogspot.com/2008/02/thistype-for-chaining-method-calls.html

Neat, thanks !

So I guess this cant be applied to checkpointRDD and randomSplit ?
What about things like filter, distinct, repartition, sample, filterWith etc ?

@mridulm if you look at this patch, it explicitly overrides those for SchemaRDD. You can't use this.type there because the return type is actually a new RDD class (FilteredRDD and so on).

@mridulm agree with Patrick, you have to return this for this.type return type.

Thanks for clarifying, in retrospect that looks obvious !
On 07-May-2014 2:52 am, "Patrick Wendell" [email protected] wrote:

In core/src/main/scala/org/apache/spark/rdd/RDD.scala:

@@ -138,7 +138,7 @@ abstract class RDD[T: ClassTag](
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/

def persist(newLevel: StorageLevel): RDD[T] = {

def persist(newLevel: StorageLevel): this.type = {
// TODO: Handle changes of StorageLevel

@mridulm https://github.com/mridulm if you look at this patch, it
explicitly overrides those for SchemaRDD. You can't use this.type there
because the return type is actually a new RDD class (FilteredRDD and so
on).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/448/files#r12349982
.

AmplabJenkins · 2014-05-06T21:23:49Z

Merged build finished.

AmplabJenkins · 2014-05-06T21:23:49Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14731/

pwendell · 2014-05-06T22:04:05Z

Jenkins, retest this please.

AmplabJenkins · 2014-05-06T22:07:58Z

Merged build triggered.

AmplabJenkins · 2014-05-06T22:08:08Z

Merged build started.

AmplabJenkins · 2014-05-06T23:16:00Z

Merged build finished.

AmplabJenkins · 2014-05-06T23:16:01Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14738/

pwendell · 2014-05-06T23:46:26Z

@kanzhang hey you'll need to silence some of the binary compatibility checks in project/MimaBuild.scala:

excludeSparkClass("org.apache.spark.graphx.VertexRDD")
excludeSparkClass("org.apache.spark.graphx.EdgeRDD")

kanzhang · 2014-05-07T00:47:39Z

@pwendell thanks for the heads-up. made those changes, let's see how it goes.

AmplabJenkins · 2014-05-07T00:47:58Z

Merged build triggered.

AmplabJenkins · 2014-05-07T00:48:04Z

Merged build started.

AmplabJenkins · 2014-05-07T02:03:45Z

Merged build finished.

AmplabJenkins · 2014-05-07T02:03:46Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14746/

pwendell · 2014-05-07T02:12:13Z

Jenkins, retest this please.

kanzhang · 2014-05-07T03:05:50Z

@pwendell the build didn't seem to start?

pwendell · 2014-05-07T06:14:44Z

Jenkins, retest this please.

AmplabJenkins · 2014-05-07T06:18:02Z

Merged build triggered.

AmplabJenkins · 2014-05-07T06:18:10Z

Merged build started.

AmplabJenkins · 2014-05-07T07:32:12Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T07:32:12Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14762/

pwendell · 2014-05-07T16:41:10Z

Thanks for updating this. I'm merging it.

…ons... ... that do not change schema Author: Kan Zhang <[email protected]> Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema (cherry picked from commit 967635a) Signed-off-by: Patrick Wendell <[email protected]>

…ons... ... that do not change schema Author: Kan Zhang <[email protected]> Closes apache#448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema

…#448)

marmbrus reviewed Apr 19, 2014
View reviewed changes

marmbrus reviewed Apr 21, 2014
View reviewed changes

marmbrus reviewed Apr 26, 2014
View reviewed changes

mridulm reviewed May 6, 2014
View reviewed changes

silence MiMa errors in EdgeRDD and VertexRDD

111e388

asfgit closed this in 967635a May 7, 2014

kanzhang deleted the SPARK-1460 branch May 9, 2014 04:31

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes (apache…

e8af6d2

…#448)

[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... #448

[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... #448

Uh oh!

Conversation

kanzhang commented Apr 18, 2014

Uh oh!

kanzhang commented Apr 18, 2014

Uh oh!

AmplabJenkins commented Apr 18, 2014

Uh oh!

marmbrus commented Apr 18, 2014

Uh oh!

AmplabJenkins commented Apr 18, 2014

Uh oh!

AmplabJenkins commented Apr 18, 2014

Uh oh!

marmbrus commented Apr 19, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

mateiz commented Apr 19, 2014

Uh oh!

kanzhang commented Apr 21, 2014

Uh oh!

kanzhang commented Apr 21, 2014

Uh oh!

marmbrus commented Apr 21, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kanzhang commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 24, 2014

Uh oh!

AmplabJenkins commented Apr 24, 2014

Uh oh!

kanzhang commented Apr 24, 2014

Uh oh!

AmplabJenkins commented Apr 24, 2014

Uh oh!

AmplabJenkins commented Apr 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Apr 26, 2014

Uh oh!

AmplabJenkins commented May 1, 2014

Uh oh!

AmplabJenkins commented May 1, 2014

Uh oh!

kanzhang commented May 1, 2014

Uh oh!

kanzhang commented May 1, 2014

Uh oh!

AmplabJenkins commented May 2, 2014

Uh oh!