Skip to content

The communitySigmaTot of the communityRDD is initialized by zero in the second iteration. #91

@ken57

Description

@ken57

Thanks for making your implementation public.
I compared results of the louvain method of the giraph version with that of the graphx version by using a graph as follows.
https://gist.github.com/ken57/6a090706a17735e8a333b931af451a49

A final q value of a result of giraph version was 0.429, and that of graphx version was 0.402.
Although I tried parameter tuning, a final q value of graphx version was still lower than that of giraph version.

I checked values of communityRDD in the second iteration by adding debug logs as follows.

println("totalEdgeWeight: " + totalGraphWeight.value)

// gather community information from each vertex's local neighborhood
// var communityRDD = louvainGraph.mapReduceTriplets(sendCommunityData, mergeCommunityMessages)
communityRDD.collect().foreach(println) // add to debug communityRDD

var activeMessages = communityRDD.count()

https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L852

I obtained debug logs as follows, and It is found that the communitySigmaTot of the communityRDD is initialized by zero in the second iteration.

(4,Map((2,0) -> 9, (23,0) -> 4, (25,0) -> 2))
(25,Map((27,0) -> 1, (2,0) -> 1, (23,0) -> 6, (4,0) -> 2))
(27,Map((23,0) -> 3, (25,0) -> 1))
(23,Map((27,0) -> 3, (2,0) -> 3, (25,0) -> 6, (4,0) -> 4))
(7,Map((2,0) -> 2, (5,0) -> 2))
(5,Map((7,0) -> 2, (2,0) -> 2))
(2,Map((5,0) -> 2, (4,0) -> 9, (25,0) -> 1, (23,0) -> 3, (7,0) -> 2))

So, I added partitionBy(PartitionStrategy.EdgePartition2D).groupEdges(_ + _) to the compressGraph as follows.

    val louvainGraph = compressedGraph.outerJoinVertices(nodeWeights)((vid, data, weightOption) => {
      val weight = weightOption.getOrElse(0L)
      data.communitySigmaTot = weight + data.internalWeight
      data.nodeWeight = weight
      data
    }).partitionBy(PartitionStrategy.EdgePartition2D).groupEdges(_ + _).cache()

https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L334

The communityRDD is initialized by correct values and I obtained final q value 0.417 by giraphx version.
I think that modifications as above are necessary.

I have two more things to contact you.
This is a minute thing. Very large logs are outputted by println as follows. I would be grateful if you delete it.
https://github.com/Sotera/distributed-graph-analytics/blob/master/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/louvain/LouvainCore.scala#L102

I tried the PartitionStrategy.CanonicalRandomVertexCut instead of the PartitionStrategy.EdgePartition2D with a large graph (approximately 100,000,000 edges) and I obtained a result by approximately 10 times faster.
It is found that the optimal PartitionStrategy is dependent on an inputted graph structure.
I would be grateful if it be selectable.

thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions