Hi squirrels,
here's some more Gelly iteration love for you ;) Together with Kien and Nikola (students at KTH, cc-ed), we looked into partition-centric iterations for Gelly. The idea of the partition-centric model is to expose the graph partition structure to the user (not just a single vertex, but all vertices belonging to the same partition at once). It allows for local graph structure exploitation inside a partition and potentially avoids unnecessary communication. The model was first introduced in [1] and we used it as inspiration, but our current implementation is closer to the Spargel model. Kien and Nikola prepared some slides with their design choices and experiments [2]. The connected components results are quite impressive :) There is still room for optimization, but you can find the wip repository in [3] if you're interested. What do you think about exposing this as a Gelly iteration model? The task has been in the Gelly roadmap and there are certainly performance advantages for some algorithms. This implementation can be basically considered as a Spargel optimization (existing VertexUpdateFunctions from Spargel can actually be re-used). On the other hand, the model is quite low-level compared to Spargel and vertex-centric, since users have to deal with inside-partition algorithms themselves (e.g. compare the connected components implementation [4] with the existing Spargel/GSA ones). Thanks! -Vasia. [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf [2]: https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing [3]: https://github.com/vasia/gelly-partition-centric [4]: https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java |
Hi,
this would be a very nice addition! I had a glimpse look into the PC implementation and the two library algorithms and when you get the idea, it is easy to follow what's happening. The benchmark results are also very promising. I got some questions about partitions: 1) I was wondering if the coGroup after the PartitionProcessUDF is aware of the initial partitioning. To be more precise, is it guaranteed, that the (vertex-)partitions do not change during the whole iteration and the RichEdges "stay" on the same worker? 2) If 1) is true, would it work to call partition(udf) on the vertex data set before the PC iteration starts? 3) If 1) is false, would it work to call partition(udf) after the coGroup during the delta iteration to achieve partition stability? What I got in mind is something like: 1 Compute partitions or a set of subgraphs (via CC, LP, Pattern Matching, ...) 2 Partition Vertices by computed Partition/Subgraph ID 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per Partition/Subgraph via PC iteration Again, nice work! Best, Martin On 06.01.2016 23:29, Vasiliki Kalavri wrote: > Hi squirrels, > > here's some more Gelly iteration love for you ;) > > Together with Kien and Nikola (students at KTH, cc-ed), we looked into > partition-centric iterations for Gelly. The idea of the partition-centric > model is to expose the graph partition structure to the user (not just a > single vertex, but all vertices belonging to the same partition at once). > It allows for local graph structure exploitation inside a partition and > potentially avoids unnecessary communication. The model was first > introduced in [1] and we used it as inspiration, but our current > implementation is closer to the Spargel model. > > Kien and Nikola prepared some slides with their design choices and > experiments [2]. The connected components results are quite impressive :) > There is still room for optimization, but you can find the wip repository > in [3] if you're interested. > > What do you think about exposing this as a Gelly iteration model? The task > has been in the Gelly roadmap and there are certainly performance > advantages for some algorithms. This implementation can be basically > considered as a Spargel optimization (existing VertexUpdateFunctions from > Spargel can actually be re-used). On the other hand, the model is quite > low-level compared to Spargel and vertex-centric, since users have to deal > with inside-partition algorithms themselves (e.g. compare the connected > components implementation [4] with the existing Spargel/GSA ones). > > Thanks! > -Vasia. > > [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf > [2]: > https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing > [3]: https://github.com/vasia/gelly-partition-centric > [4]: > https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java > |
Hi Martin,
thanks for your input! On 7 January 2016 at 09:36, Martin Junghanns <[hidden email]> wrote: > Hi, > > this would be a very nice addition! I had a glimpse look into the PC > implementation and the two library algorithms and when you get the > idea, it is easy to follow what's happening. The benchmark results are > also very promising. > > I got some questions about partitions: > > 1) I was wondering if the coGroup after the PartitionProcessUDF is aware > of the initial partitioning. To be more precise, is it guaranteed, that > the (vertex-)partitions do not change during the whole iteration and the > RichEdges "stay" on the same worker? > I think you can find the questions to your answers in the attached generated plan [1]. The edges are cached, meaning that the partitions remain on the same workers across iterations. Records from the first coGroup to the mapPartition are forwarded (no shuffling). The output of the mapPartition operator is a tuple of <vertexID, message>, which is sorted and shuffled to reach the destination vertices. > 2) If 1) is true, would it work to call partition(udf) on the vertex > data set before the PC iteration starts? > Do you mean replacing the default hash partitioning with some custom partition function? I guess that'd would work, yes. > 3) If 1) is false, would it work to call partition(udf) after the > coGroup during the delta iteration to achieve partition stability? > > What I got in mind is something like: > > 1 Compute partitions or a set of subgraphs > (via CC, LP, Pattern Matching, ...) > 2 Partition Vertices by computed Partition/Subgraph ID > 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per > Partition/Subgraph via PC iteration > > Again, nice work! > > Best, > Martin > Cheers, -Vasia. [1]: https://drive.google.com/file/d/0BzQJrI2eGlyYSHZJTHFObmgxTXM/view?usp=sharing > > On 06.01.2016 23:29, Vasiliki Kalavri wrote: > > Hi squirrels, > > > > here's some more Gelly iteration love for you ;) > > > > Together with Kien and Nikola (students at KTH, cc-ed), we looked into > > partition-centric iterations for Gelly. The idea of the partition-centric > > model is to expose the graph partition structure to the user (not just a > > single vertex, but all vertices belonging to the same partition at once). > > It allows for local graph structure exploitation inside a partition and > > potentially avoids unnecessary communication. The model was first > > introduced in [1] and we used it as inspiration, but our current > > implementation is closer to the Spargel model. > > > > Kien and Nikola prepared some slides with their design choices and > > experiments [2]. The connected components results are quite impressive :) > > There is still room for optimization, but you can find the wip repository > > in [3] if you're interested. > > > > What do you think about exposing this as a Gelly iteration model? The > task > > has been in the Gelly roadmap and there are certainly performance > > advantages for some algorithms. This implementation can be basically > > considered as a Spargel optimization (existing VertexUpdateFunctions from > > Spargel can actually be re-used). On the other hand, the model is quite > > low-level compared to Spargel and vertex-centric, since users have to > deal > > with inside-partition algorithms themselves (e.g. compare the connected > > components implementation [4] with the existing Spargel/GSA ones). > > > > Thanks! > > -Vasia. > > > > [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf > > [2]: > > > https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing > > [3]: https://github.com/vasia/gelly-partition-centric > > [4]: > > > https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java > > > |
Hi Martin,
> What I got in mind is something like: > 1 Compute partitions or a set of subgraphs > (via CC, LP, Pattern Matching, ...) > 2 Partition Vertices by computed Partition/Subgraph ID > 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per > Partition/Subgraph via PC iteration I did try this idea at the start of the project, but was unable to store the computed partitions efficiently so I abandoned it. My original approach was storing each partition as an element of a DataSet, but the partition object can get very large and cause big serialization/de-serialization overhead. This approach also breaks if one partition can't be fitted into memory. Best regards, Kien On Thu, Jan 7, 2016 at 5:14 PM, Vasiliki Kalavri <[hidden email]> wrote: > Hi Martin, > > thanks for your input! > > > On 7 January 2016 at 09:36, Martin Junghanns <[hidden email]> > wrote: > > > Hi, > > > > this would be a very nice addition! I had a glimpse look into the PC > > implementation and the two library algorithms and when you get the > > idea, it is easy to follow what's happening. The benchmark results are > > also very promising. > > > > I got some questions about partitions: > > > > 1) I was wondering if the coGroup after the PartitionProcessUDF is aware > > of the initial partitioning. To be more precise, is it guaranteed, that > > the (vertex-)partitions do not change during the whole iteration and the > > RichEdges "stay" on the same worker? > > > > > I think you can find the questions to your answers in the attached > generated plan [1]. > The edges are cached, meaning that the partitions remain on the same > workers across iterations. > Records from the first coGroup to the mapPartition are forwarded (no > shuffling). > The output of the mapPartition operator is a tuple of <vertexID, message>, > which is sorted and shuffled to reach the destination vertices. > > > > > 2) If 1) is true, would it work to call partition(udf) on the vertex > > data set before the PC iteration starts? > > > > Do you mean replacing the default hash partitioning with some custom > partition function? I guess that'd would work, yes. > > > > > 3) If 1) is false, would it work to call partition(udf) after the > > coGroup during the delta iteration to achieve partition stability? > > > > What I got in mind is something like: > > > > 1 Compute partitions or a set of subgraphs > > (via CC, LP, Pattern Matching, ...) > > 2 Partition Vertices by computed Partition/Subgraph ID > > 3 Compute Algorithm X (Page Rank, BC, SSSP, FSM...) per > > Partition/Subgraph via PC iteration > > > > Again, nice work! > > > > Best, > > Martin > > > > > > Cheers, > -Vasia. > > [1]: > > https://drive.google.com/file/d/0BzQJrI2eGlyYSHZJTHFObmgxTXM/view?usp=sharing > > > > > > On 06.01.2016 23:29, Vasiliki Kalavri wrote: > > > Hi squirrels, > > > > > > here's some more Gelly iteration love for you ;) > > > > > > Together with Kien and Nikola (students at KTH, cc-ed), we looked into > > > partition-centric iterations for Gelly. The idea of the > partition-centric > > > model is to expose the graph partition structure to the user (not just > a > > > single vertex, but all vertices belonging to the same partition at > once). > > > It allows for local graph structure exploitation inside a partition and > > > potentially avoids unnecessary communication. The model was first > > > introduced in [1] and we used it as inspiration, but our current > > > implementation is closer to the Spargel model. > > > > > > Kien and Nikola prepared some slides with their design choices and > > > experiments [2]. The connected components results are quite impressive > :) > > > There is still room for optimization, but you can find the wip > repository > > > in [3] if you're interested. > > > > > > What do you think about exposing this as a Gelly iteration model? The > > task > > > has been in the Gelly roadmap and there are certainly performance > > > advantages for some algorithms. This implementation can be basically > > > considered as a Spargel optimization (existing VertexUpdateFunctions > from > > > Spargel can actually be re-used). On the other hand, the model is quite > > > low-level compared to Spargel and vertex-centric, since users have to > > deal > > > with inside-partition algorithms themselves (e.g. compare the connected > > > components implementation [4] with the existing Spargel/GSA ones). > > > > > > Thanks! > > > -Vasia. > > > > > > [1]: http://researcher.ibm.com/researcher/files/us-ytian/giraph++.pdf > > > [2]: > > > > > > https://drive.google.com/file/d/0BzQJrI2eGlyYRmlKTjMyVEk1T0U/view?usp=sharing > > > [3]: https://github.com/vasia/gelly-partition-centric > > > [4]: > > > > > > https://github.com/vasia/gelly-partition-centric/blob/master/src/main/java/org/apache/flink/graph/partition/centric/library/PCConnectedComponents.java > > > > > > |
Free forum by Nabble | Edit this page |