[SPARK-3936] Remove Bytecode Inspection for Join Elimination #2815

jegonzal · 2014-10-15T19:42:55Z

Removing bytecode inspection from triplet operations and introducing explicit join elimination flags. The explicit flags make the join elimination more robust.

…explicit join elimination flags.

jegonzal · 2014-10-15T19:44:16Z

@ankurdave and @rxin I have not updated the applications to use the new explicit flags. I will do that in this PR pending approval for the API changes.

Ishiihara · 2014-10-15T19:54:33Z

graphx/src/main/scala/org/apache/spark/graphx/Graph.scala

+   * @param mapUsesSrcAttr indicates whether the source vertex attribute should be included in
+   * the triplet. Setting this to false can improve performance if the source vertex attribute
+   * is not needed.
+   * @param mapUsesSrcAttr indicates whether the destination vertex attribute should be included in


mapUsesSrcAttr -> mapUsesDstAttr

SparkQA · 2014-10-15T19:54:38Z

QA tests have started for PR 2815 at commit d32ec1c.

This patch merges cleanly.

SparkQA · 2014-10-15T19:55:40Z

QA tests have finished for PR 2815 at commit d32ec1c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jegonzal · 2014-10-15T21:03:45Z

Jenkins, test this please.

SparkQA · 2014-10-15T21:19:49Z

QA tests have started for PR 2815 at commit 459fa8c.

This patch merges cleanly.

SparkQA · 2014-10-15T21:19:51Z

QA tests have started for PR 2815 at commit 459fa8c.

This patch merges cleanly.

SparkQA · 2014-10-15T21:20:58Z

QA tests have finished for PR 2815 at commit 459fa8c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-15T21:21:02Z

QA tests have finished for PR 2815 at commit 459fa8c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-15T21:59:39Z

QA tests have started for PR 2815 at commit e8a8a8d.

This patch merges cleanly.

SparkQA · 2014-10-15T23:08:35Z

QA tests have finished for PR 2815 at commit e8a8a8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jegonzal · 2014-10-30T00:10:04Z

What is the status on this patch? I would like to merge it soon so that the python GraphX API can support these additional flags.

rxin · 2014-10-31T22:48:26Z

graphx/src/main/scala/org/apache/spark/graphx/Graph.scala

-  def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2] = {
-    mapTriplets((pid, iter) => iter.map(map))
+  def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] => ED2,
+    mapUsesSrcAttr: Boolean = true, mapUsesDstAttr: Boolean = true): Graph[VD, ED2] = {


i'd create an enum for that

jegonzal · 2014-11-01T00:37:40Z

I added the TripletFields enum and updated all the dependent files. I can't "deprecate" the old API since they have the same function signature up to default arguments and I don't want to require the TripletFields be specified.

jegonzal · 2014-11-01T00:40:24Z

At this point I could also imagine actually having a separate function closure for each version.

mapTriplets(f: Edge => ED2)

mapTriplets(f: SrcEdge => ED2)

mapTriplets(f: DstEdge => ED2)

mapTriplets(f: Triplet => ED2)

Though to do this would require users to annotate their functions:

g.mapTriplets( (t: SrcEdge) => t.src )

What do you all think?

SparkQA · 2014-11-01T01:50:07Z

Test build #22667 has finished for PR 2815 at commit 4097ca7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2014-11-01T07:43:12Z

How about

mapSourceTriplets, mapDst ...

ankurdave · 2014-11-01T22:37:33Z

graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala

    vertices.cache()
    // Filter the vertices, reusing the partitioner and the index from this graph
    val newVerts = vertices.mapVertexPartitions(_.filter(vpred))
    // Filter the triplets. We must always upgrade the triplet view fully because vpred always runs
    // on both src and dst vertices
-    replicatedVertexView.upgrade(vertices, true, true)
+    replicatedVertexView.upgrade(vertices, tripletFields.useSrc, tripletFields.useDst)


I don't think this change is correct - see the comment on the line above.

ankurdave · 2014-11-01T22:45:38Z

This design looks good to me. The alternatives are all worse:

@jegonzal It won't work to overload on the function type like that because of type erasure:

$ scala
scala> class A { def f(g: Int => Int): Int = g(1); def f(g: String => Int): Int = g("hello") }
<console>:7: error: double definition:
def f(g: Int => Int): Int at line 7 and
def f(g: String => Int): Int at line 7
have same type after erasure: (g: Function1)Int
       class A { def f(g: Int => Int): Int = g(1); def f(g: String => Int): Int = g("hello") }

@rxin That will prevent us from passing triplet field information along, as we do in Pregel.

ankurdave · 2014-11-02T00:37:24Z

graphx/src/main/scala/org/apache/spark/graphx/Graph.scala

+   * is not needed.
+   * @param mapUsesDstAttr indicates whether the destination vertex attribute should be included in
+   * the triplet. Setting this to false can improve performance if the destination vertex attribute
+   * is not needed.


These comments are out of date now that we have tripletFields

rxin · 2014-11-02T04:44:32Z

Let's not change the old API. Just deprecate it, and then add this tripletFields thing to the new API Ankur is going to submit.

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. This is more efficient, providing a 20.2% speedup on PageRank over apache#3054 (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 403 s to 322 s). 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Subsumes apache#2815.

@rxin

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes apache#2815. Also fixes SPARK-4173. Author: Ankur Dave <[email protected]> Closes apache#3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on apache#3054 and apache#3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition

@rxin

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes #2815. Also fixes SPARK-4173. Author: Ankur Dave <[email protected]> Closes #3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on #3054 and #3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition (cherry picked from commit faeb41d) Signed-off-by: Reynold Xin <[email protected]>

balduz · 2015-04-20T10:15:22Z

Is the Pregel API going to be updated so that it uses aggregateMessages instead of mapReduceTriplets? Or is the whole Pregel class going to be deprecated in favor of this method?

rxin · 2015-04-20T19:11:57Z

We should update Pregel to use aggregateMessages instead. Would you like to submit a patch for that?

ankurdave · 2015-04-20T19:26:19Z

@rxin @balduz We can update Pregel to use aggregateMessages without changing the API, but it won't provide any performance benefit since we'll still be handling iterators. Ultimately, the plan is to add a new version of Pregel that uses an aggregateMessages-style interface and has some other changes to expose activeness more explicitly.

Removing bytecode inspection from triplet operations and introducing …

2e47158

…explicit join elimination flags.

jegonzal added 2 commits October 15, 2014 12:46

Fixing typos

f25a6ac

dropping commented references to bytecode inspection

d32ec1c

Ishiihara reviewed Oct 15, 2014
View reviewed changes

jegonzal changed the title ~~Remove Bytecode Inspection for Join Elimination~~ [SPARK-3936] Remove Bytecode Inspection for Join Elimination Oct 15, 2014

jegonzal added 2 commits October 15, 2014 13:55

updating applications

8cf4726

fixing typo in docs

459fa8c

removing long lines

e8a8a8d

rxin reviewed Oct 31, 2014
View reviewed changes

Adding an enum for the triplet fields.

4097ca7

ankurdave reviewed Nov 1, 2014
View reviewed changes

ankurdave reviewed Nov 2, 2014
View reviewed changes

ankurdave mentioned this pull request Nov 5, 2014

[SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets #3100

Closed

jegonzal closed this Nov 12, 2014

[SPARK-3936] Remove Bytecode Inspection for Join Elimination #2815

[SPARK-3936] Remove Bytecode Inspection for Join Elimination #2815

Uh oh!

Conversation

jegonzal commented Oct 15, 2014

Uh oh!

jegonzal commented Oct 15, 2014

Uh oh!

Ishiihara Oct 15, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

jegonzal commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

SparkQA commented Oct 15, 2014

Uh oh!

jegonzal commented Oct 30, 2014

Uh oh!

rxin Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

jegonzal commented Nov 1, 2014

Uh oh!

jegonzal commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

rxin commented Nov 1, 2014

Uh oh!

ankurdave Nov 1, 2014

Choose a reason for hiding this comment

Uh oh!

ankurdave commented Nov 1, 2014

Uh oh!

ankurdave Nov 2, 2014

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 2, 2014

Uh oh!

balduz commented Apr 20, 2015

Uh oh!

rxin commented Apr 20, 2015

Uh oh!

ankurdave commented Apr 20, 2015

Uh oh!

Uh oh!