The communitySigmaTot of the communityRDD is initialized by zero in the second iteration.

Thanks for making your implementation public.
I compared results of the louvain method of the giraph version with that of the graphx version by using a graph as follows.
https://gist.github.com/ken57/6a090706a17735e8a333b931af451a49

A final q value of a result of giraph version was 0.429, and that of graphx version was 0.402.
Although I tried parameter tuning, a final q value of graphx version was still lower than that of giraph version.

I checked values of communityRDD in the second iteration by adding debug logs as follows.
```
println("totalEdgeWeight: " + totalGraphWeight.value)

// gather community information from each vertex's local neighborhood
// var communityRDD = louvainGraph.mapReduceTriplets(sendCommunityData, mergeCommunityMessages)
communityRDD.collect().foreach(println) // add to debug communityRDD

var activeMessages = communityRDD.count()
```
https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L852

I obtained debug logs as follows, and It is found that the communitySigmaTot of the communityRDD is initialized by zero in the second iteration.
```
(4,Map((2,0) -> 9, (23,0) -> 4, (25,0) -> 2))
(25,Map((27,0) -> 1, (2,0) -> 1, (23,0) -> 6, (4,0) -> 2))
(27,Map((23,0) -> 3, (25,0) -> 1))
(23,Map((27,0) -> 3, (2,0) -> 3, (25,0) -> 6, (4,0) -> 4))
(7,Map((2,0) -> 2, (5,0) -> 2))
(5,Map((7,0) -> 2, (2,0) -> 2))
(2,Map((5,0) -> 2, (4,0) -> 9, (25,0) -> 1, (23,0) -> 3, (7,0) -> 2))
```

So, I added `partitionBy(PartitionStrategy.EdgePartition2D).groupEdges(_ + _)` to the compressGraph as follows.
```
    val louvainGraph = compressedGraph.outerJoinVertices(nodeWeights)((vid, data, weightOption) => {
      val weight = weightOption.getOrElse(0L)
      data.communitySigmaTot = weight + data.internalWeight
      data.nodeWeight = weight
      data
    }).partitionBy(PartitionStrategy.EdgePartition2D).groupEdges(_ + _).cache()
```
https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L334

The communityRDD is initialized by correct values and I obtained final q value 0.417 by giraphx version.
I think that modifications as above are necessary.

I have two more things to contact you.
This is a minute thing. Very large logs are outputted by println as follows. I would be grateful if you delete it.
https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L102

I tried the PartitionStrategy.CanonicalRandomVertexCut instead of the PartitionStrategy.EdgePartition2D with a large graph (approximately 100,000,000 edges) and I obtained a result by approximately 10 times faster.
It is found that the optimal PartitionStrategy is dependent on an inputted graph structure.
I would be grateful if it be selectable.

thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The communitySigmaTot of the communityRDD is initialized by zero in the second iteration. #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The communitySigmaTot of the communityRDD is initialized by zero in the second iteration. #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions