Vasia and I are looking for additional feedback on FLINK-3772. This ticket
[0] and PR [1] provides a set of graph algorithms which compute and store the degree for vertices and edges. Degree annotation is a basic component of many algorithms. For example, PageRank requires the vertex out-degree to evenly divide the vertex score among neighbors. The triangle enumeration algorithm in Gelly requires both source and target degree for each edge as an optimization. The Jaccard and Adamic-Adar similarities require the target degree for each edge. As discussed in the ticket, Graph has methods for outDegrees, inDegrees, and getDegrees. These are simple but difficult or impossible to use efficiently. Stepping back, I believe algorithms should be composable and reused where possible, not only to improve Flink but also to support users. Implementing algorithms as classes rather than Graph methods enables customization and optimization such as used here. One such optimization is CachingGraphAlgorithm which implicitly reuses DataSets. There is overlap between algorithms. From this ticket, annotating edges with source and target degree on an undirected graph can reuse vertex degree. Local clustering coefficient requires a triangle listing and global clustering coefficient requires a triangle count, there is no need to generate the list three times. Further optimizations include the use of mutable types, reusing sort order, avoidance of coGroup, configurable parallelism, and not unnecessarily materializing DataSets. I see all this as the expectation for inclusion in Flink. Greg [0] https://issues.apache.org/jira/browse/FLINK-3772 [1] https://github.com/apache/flink/pull/1901/files |
Hi all,
I asked Greg to start a discussion here about FLINK-3771 and FLINK-3772, to make sure we're all on the same page about future Gelly development. About a year ago we created the Gelly roadmap [1]. Many of these items have been implemented and others were researched and either developed externally or dropped. Graph translators and degree annotations were not in the roadmap, but I personally think that they are both helpful features and I'm in favor of adding them to Gelly. That said, I find this a perfect timing for revisiting and updating our Gelly development plans. We should update the roadmap so that it reflects current state and future plans. This way, the community can have a clear picture of what we are working on and what are upcoming features. This is also very important for new contributors. They can always refer to the roadmap to see what the community is interested in and we can avoid having people spending their time on obsolete or dropped features. We should also go through existing JIRAs and clean them up. Several are assigned to people who have been inactive or might need help. In the following days I will go through the roadmap and then start a new thread where we can discuss features and agree on future plans. In the meantime, it would be great if you give us your view on FLINK-3771 and FLINK-3772. Cheers, -Vasia. [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly On 21 April 2016 at 18:52, Greg Hogan <[hidden email]> wrote: > Vasia and I are looking for additional feedback on FLINK-3772. This ticket > [0] and PR [1] provides a set of graph algorithms which compute and store > the degree for vertices and edges. > > Degree annotation is a basic component of many algorithms. For example, > PageRank requires the vertex out-degree to evenly divide the vertex score > among neighbors. The triangle enumeration algorithm in Gelly requires both > source and target degree for each edge as an optimization. The Jaccard and > Adamic-Adar similarities require the target degree for each edge. > > As discussed in the ticket, Graph has methods for outDegrees, inDegrees, > and getDegrees. These are simple but difficult or impossible to use > efficiently. > > Stepping back, I believe algorithms should be composable and reused where > possible, not only to improve Flink but also to support users. Implementing > algorithms as classes rather than Graph methods enables customization and > optimization such as used here. > > One such optimization is CachingGraphAlgorithm which implicitly reuses > DataSets. There is overlap between algorithms. From this ticket, annotating > edges with source and target degree on an undirected graph can reuse vertex > degree. Local clustering coefficient requires a triangle listing and global > clustering coefficient requires a triangle count, there is no need to > generate the list three times. > > Further optimizations include the use of mutable types, reusing sort order, > avoidance of coGroup, configurable parallelism, and not unnecessarily > materializing DataSets. I see all this as the expectation for inclusion in > Flink. > > Greg > > [0] https://issues.apache.org/jira/browse/FLINK-3772 > [1] https://github.com/apache/flink/pull/1901/files > |
Hi Greg and Vasia,
thanks for starting this discussion. I think it is a good idea to update the Gelly roadmap. As Vasia said, many items on the list have been implemented and other have been more or less dropped. Also new persons who want to improve Gelly have joint the community while others have become idle. So from my point of view it makes perfect sense to gather the input of the community and update the Gelly roadmap. Regarding the specific changes of FLINK-3772, I am usually always in favor of performance improvements. How much would the suggested functionality for degree computation overlaps with the current methods? Could the current methods be replaced by the more efficient implementations or would there be two methods which look very similar and behave almost the same? Best, Fabian 2016-04-22 11:00 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > Hi all, > > I asked Greg to start a discussion here about FLINK-3771 and FLINK-3772, to > make sure we're all on the same page about future Gelly development. > > About a year ago we created the Gelly roadmap [1]. Many of these items have > been implemented and others were researched and either developed externally > or dropped. Graph translators and degree annotations were not in the > roadmap, but I personally think that they are both helpful features and I'm > in favor of adding them to Gelly. > > That said, I find this a perfect timing for revisiting and updating our > Gelly development plans. We should update the roadmap so that it reflects > current state and future plans. This way, the community can have a clear > picture of what we are working on and what are upcoming features. This is > also very important for new contributors. They can always refer to the > roadmap to see what the community is interested in and we can avoid having > people spending their time on obsolete or dropped features. We should also > go through existing JIRAs and clean them up. Several are assigned to people > who have been inactive or might need help. > > In the following days I will go through the roadmap and then start a new > thread where we can discuss features and agree on future plans. In the > meantime, it would be great if you give us your view on FLINK-3771 and > FLINK-3772. > > Cheers, > -Vasia. > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly > > On 21 April 2016 at 18:52, Greg Hogan <[hidden email]> wrote: > > > Vasia and I are looking for additional feedback on FLINK-3772. This > ticket > > [0] and PR [1] provides a set of graph algorithms which compute and store > > the degree for vertices and edges. > > > > Degree annotation is a basic component of many algorithms. For example, > > PageRank requires the vertex out-degree to evenly divide the vertex score > > among neighbors. The triangle enumeration algorithm in Gelly requires > both > > source and target degree for each edge as an optimization. The Jaccard > and > > Adamic-Adar similarities require the target degree for each edge. > > > > As discussed in the ticket, Graph has methods for outDegrees, inDegrees, > > and getDegrees. These are simple but difficult or impossible to use > > efficiently. > > > > Stepping back, I believe algorithms should be composable and reused where > > possible, not only to improve Flink but also to support users. > Implementing > > algorithms as classes rather than Graph methods enables customization and > > optimization such as used here. > > > > One such optimization is CachingGraphAlgorithm which implicitly reuses > > DataSets. There is overlap between algorithms. From this ticket, > annotating > > edges with source and target degree on an undirected graph can reuse > vertex > > degree. Local clustering coefficient requires a triangle listing and > global > > clustering coefficient requires a triangle count, there is no need to > > generate the list three times. > > > > Further optimizations include the use of mutable types, reusing sort > order, > > avoidance of coGroup, configurable parallelism, and not unnecessarily > > materializing DataSets. I see all this as the expectation for inclusion > in > > Flink. > > > > Greg > > > > [0] https://issues.apache.org/jira/browse/FLINK-3772 > > [1] https://github.com/apache/flink/pull/1901/files > > > |
Hi Fabian,
I don't know if this has been looked at. There is discussion of BipartiteGraph in FLINK-2254. If Gelly had DirectedGraph and UndirectedGraph then the API could stay lean while methods could be tuned for the specific graph type. I do like having simple APIs such as DataSet and Graph for for interactive use in small scripts or Scala and Python shells. Greg On Mon, Apr 25, 2016 at 8:46 AM, Fabian Hueske <[hidden email]> wrote: > Hi Greg and Vasia, > > thanks for starting this discussion. > I think it is a good idea to update the Gelly roadmap. As Vasia said, many > items on the list have been implemented and other have been more or less > dropped. > Also new persons who want to improve Gelly have joint the community while > others have become idle. > So from my point of view it makes perfect sense to gather the input of the > community and update the Gelly roadmap. > > Regarding the specific changes of FLINK-3772, I am usually always in favor > of performance improvements. > How much would the suggested functionality for degree computation overlaps > with the current methods? > Could the current methods be replaced by the more efficient implementations > or would there be two methods which look very similar and behave almost the > same? > > Best, Fabian > > > > 2016-04-22 11:00 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > > > Hi all, > > > > I asked Greg to start a discussion here about FLINK-3771 and FLINK-3772, > to > > make sure we're all on the same page about future Gelly development. > > > > About a year ago we created the Gelly roadmap [1]. Many of these items > have > > been implemented and others were researched and either developed > externally > > or dropped. Graph translators and degree annotations were not in the > > roadmap, but I personally think that they are both helpful features and > I'm > > in favor of adding them to Gelly. > > > > That said, I find this a perfect timing for revisiting and updating our > > Gelly development plans. We should update the roadmap so that it reflects > > current state and future plans. This way, the community can have a clear > > picture of what we are working on and what are upcoming features. This is > > also very important for new contributors. They can always refer to the > > roadmap to see what the community is interested in and we can avoid > having > > people spending their time on obsolete or dropped features. We should > also > > go through existing JIRAs and clean them up. Several are assigned to > people > > who have been inactive or might need help. > > > > In the following days I will go through the roadmap and then start a new > > thread where we can discuss features and agree on future plans. In the > > meantime, it would be great if you give us your view on FLINK-3771 and > > FLINK-3772. > > > > Cheers, > > -Vasia. > > > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly > > > > On 21 April 2016 at 18:52, Greg Hogan <[hidden email]> wrote: > > > > > Vasia and I are looking for additional feedback on FLINK-3772. This > > ticket > > > [0] and PR [1] provides a set of graph algorithms which compute and > store > > > the degree for vertices and edges. > > > > > > Degree annotation is a basic component of many algorithms. For example, > > > PageRank requires the vertex out-degree to evenly divide the vertex > score > > > among neighbors. The triangle enumeration algorithm in Gelly requires > > both > > > source and target degree for each edge as an optimization. The Jaccard > > and > > > Adamic-Adar similarities require the target degree for each edge. > > > > > > As discussed in the ticket, Graph has methods for outDegrees, > inDegrees, > > > and getDegrees. These are simple but difficult or impossible to use > > > efficiently. > > > > > > Stepping back, I believe algorithms should be composable and reused > where > > > possible, not only to improve Flink but also to support users. > > Implementing > > > algorithms as classes rather than Graph methods enables customization > and > > > optimization such as used here. > > > > > > One such optimization is CachingGraphAlgorithm which implicitly reuses > > > DataSets. There is overlap between algorithms. From this ticket, > > annotating > > > edges with source and target degree on an undirected graph can reuse > > vertex > > > degree. Local clustering coefficient requires a triangle listing and > > global > > > clustering coefficient requires a triangle count, there is no need to > > > generate the list three times. > > > > > > Further optimizations include the use of mutable types, reusing sort > > order, > > > avoidance of coGroup, configurable parallelism, and not unnecessarily > > > materializing DataSets. I see all this as the expectation for inclusion > > in > > > Flink. > > > > > > Greg > > > > > > [0] https://issues.apache.org/jira/browse/FLINK-3772 > > > [1] https://github.com/apache/flink/pull/1901/files > > > > > > |
Hey Greg,
when we made the initial design for Gelly, we discussed having separate directed and undirected graph types. We decided to represent undirected graphs by simply adding the opposite-direction edges, like Apache Giraph does. I'm not sure whether that was a good idea, but, back then, it seemed to simplify the design of the API. If you think we can improve the API and the usability of Gelly by introducing separate types, I would love to hear your proposal! Cheers, -Vasia. On 25 April 2016 at 22:49, Greg Hogan <[hidden email]> wrote: > Hi Fabian, > > I don't know if this has been looked at. There is discussion of > BipartiteGraph in FLINK-2254. If Gelly had DirectedGraph and > UndirectedGraph then the API could stay lean while methods could be tuned > for the specific graph type. I do like having simple APIs such as DataSet > and Graph for for interactive use in small scripts or Scala and Python > shells. > > Greg > > On Mon, Apr 25, 2016 at 8:46 AM, Fabian Hueske <[hidden email]> wrote: > > > Hi Greg and Vasia, > > > > thanks for starting this discussion. > > I think it is a good idea to update the Gelly roadmap. As Vasia said, > many > > items on the list have been implemented and other have been more or less > > dropped. > > Also new persons who want to improve Gelly have joint the community while > > others have become idle. > > So from my point of view it makes perfect sense to gather the input of > the > > community and update the Gelly roadmap. > > > > Regarding the specific changes of FLINK-3772, I am usually always in > favor > > of performance improvements. > > How much would the suggested functionality for degree computation > overlaps > > with the current methods? > > Could the current methods be replaced by the more efficient > implementations > > or would there be two methods which look very similar and behave almost > the > > same? > > > > Best, Fabian > > > > > > > > 2016-04-22 11:00 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > > > > > Hi all, > > > > > > I asked Greg to start a discussion here about FLINK-3771 and > FLINK-3772, > > to > > > make sure we're all on the same page about future Gelly development. > > > > > > About a year ago we created the Gelly roadmap [1]. Many of these items > > have > > > been implemented and others were researched and either developed > > externally > > > or dropped. Graph translators and degree annotations were not in the > > > roadmap, but I personally think that they are both helpful features and > > I'm > > > in favor of adding them to Gelly. > > > > > > That said, I find this a perfect timing for revisiting and updating our > > > Gelly development plans. We should update the roadmap so that it > reflects > > > current state and future plans. This way, the community can have a > clear > > > picture of what we are working on and what are upcoming features. This > is > > > also very important for new contributors. They can always refer to the > > > roadmap to see what the community is interested in and we can avoid > > having > > > people spending their time on obsolete or dropped features. We should > > also > > > go through existing JIRAs and clean them up. Several are assigned to > > people > > > who have been inactive or might need help. > > > > > > In the following days I will go through the roadmap and then start a > new > > > thread where we can discuss features and agree on future plans. In the > > > meantime, it would be great if you give us your view on FLINK-3771 and > > > FLINK-3772. > > > > > > Cheers, > > > -Vasia. > > > > > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly > > > > > > On 21 April 2016 at 18:52, Greg Hogan <[hidden email]> wrote: > > > > > > > Vasia and I are looking for additional feedback on FLINK-3772. This > > > ticket > > > > [0] and PR [1] provides a set of graph algorithms which compute and > > store > > > > the degree for vertices and edges. > > > > > > > > Degree annotation is a basic component of many algorithms. For > example, > > > > PageRank requires the vertex out-degree to evenly divide the vertex > > score > > > > among neighbors. The triangle enumeration algorithm in Gelly requires > > > both > > > > source and target degree for each edge as an optimization. The > Jaccard > > > and > > > > Adamic-Adar similarities require the target degree for each edge. > > > > > > > > As discussed in the ticket, Graph has methods for outDegrees, > > inDegrees, > > > > and getDegrees. These are simple but difficult or impossible to use > > > > efficiently. > > > > > > > > Stepping back, I believe algorithms should be composable and reused > > where > > > > possible, not only to improve Flink but also to support users. > > > Implementing > > > > algorithms as classes rather than Graph methods enables customization > > and > > > > optimization such as used here. > > > > > > > > One such optimization is CachingGraphAlgorithm which implicitly > reuses > > > > DataSets. There is overlap between algorithms. From this ticket, > > > annotating > > > > edges with source and target degree on an undirected graph can reuse > > > vertex > > > > degree. Local clustering coefficient requires a triangle listing and > > > global > > > > clustering coefficient requires a triangle count, there is no need to > > > > generate the list three times. > > > > > > > > Further optimizations include the use of mutable types, reusing sort > > > order, > > > > avoidance of coGroup, configurable parallelism, and not unnecessarily > > > > materializing DataSets. I see all this as the expectation for > inclusion > > > in > > > > Flink. > > > > > > > > Greg > > > > > > > > [0] https://issues.apache.org/jira/browse/FLINK-3772 > > > > [1] https://github.com/apache/flink/pull/1901/files > > > > > > > > > > |
In reply to this post by Fabian Hueske-2
Hi Fabian,
It should be trivial to implement inDegrees and outDegrees using the new algorithms. We are now doing this for the translate methods in FLINK-3771. The algorithms implement additional customization but the Graph methods are kept simple. ScatterGatherIteration could be simplified as well by using VertexPairDegree. These changes to the Graph API should be considered in the context of FLINK-3277. I am not familiar with the use of getDegrees which simply sums inDegrees and outDegrees but would remain unchanged. Performance was discussed in the Jira as follows. "Each of the "join" methods is implemented as a coGroup as is each of the degree methods, so the trade-off is a coGroup vs a reduce to calculate each degree, and a coGroup vs a hash-join for each annotated degree. As you note the user is required to compose multiple methods and must be aware of implementation details such as using inDegrees or outDegrees instead of getDegrees on an undirected graph. Performance impacts include the use of coGroup, immutable types, and that the degree functions materialize the vertex set." Greg On Mon, Apr 25, 2016 at 8:46 AM, Fabian Hueske <[hidden email]> wrote: > Hi Greg and Vasia, > > thanks for starting this discussion. > I think it is a good idea to update the Gelly roadmap. As Vasia said, many > items on the list have been implemented and other have been more or less > dropped. > Also new persons who want to improve Gelly have joint the community while > others have become idle. > So from my point of view it makes perfect sense to gather the input of the > community and update the Gelly roadmap. > > Regarding the specific changes of FLINK-3772, I am usually always in favor > of performance improvements. > How much would the suggested functionality for degree computation overlaps > with the current methods? > Could the current methods be replaced by the more efficient implementations > or would there be two methods which look very similar and behave almost the > same? > > Best, Fabian > > > > 2016-04-22 11:00 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > > > Hi all, > > > > I asked Greg to start a discussion here about FLINK-3771 and FLINK-3772, > to > > make sure we're all on the same page about future Gelly development. > > > > About a year ago we created the Gelly roadmap [1]. Many of these items > have > > been implemented and others were researched and either developed > externally > > or dropped. Graph translators and degree annotations were not in the > > roadmap, but I personally think that they are both helpful features and > I'm > > in favor of adding them to Gelly. > > > > That said, I find this a perfect timing for revisiting and updating our > > Gelly development plans. We should update the roadmap so that it reflects > > current state and future plans. This way, the community can have a clear > > picture of what we are working on and what are upcoming features. This is > > also very important for new contributors. They can always refer to the > > roadmap to see what the community is interested in and we can avoid > having > > people spending their time on obsolete or dropped features. We should > also > > go through existing JIRAs and clean them up. Several are assigned to > people > > who have been inactive or might need help. > > > > In the following days I will go through the roadmap and then start a new > > thread where we can discuss features and agree on future plans. In the > > meantime, it would be great if you give us your view on FLINK-3771 and > > FLINK-3772. > > > > Cheers, > > -Vasia. > > > > [1]: https://cwiki.apache.org/confluence/display/FLINK/Flink+Gelly > > > > On 21 April 2016 at 18:52, Greg Hogan <[hidden email]> wrote: > > > > > Vasia and I are looking for additional feedback on FLINK-3772. This > > ticket > > > [0] and PR [1] provides a set of graph algorithms which compute and > store > > > the degree for vertices and edges. > > > > > > Degree annotation is a basic component of many algorithms. For example, > > > PageRank requires the vertex out-degree to evenly divide the vertex > score > > > among neighbors. The triangle enumeration algorithm in Gelly requires > > both > > > source and target degree for each edge as an optimization. The Jaccard > > and > > > Adamic-Adar similarities require the target degree for each edge. > > > > > > As discussed in the ticket, Graph has methods for outDegrees, > inDegrees, > > > and getDegrees. These are simple but difficult or impossible to use > > > efficiently. > > > > > > Stepping back, I believe algorithms should be composable and reused > where > > > possible, not only to improve Flink but also to support users. > > Implementing > > > algorithms as classes rather than Graph methods enables customization > and > > > optimization such as used here. > > > > > > One such optimization is CachingGraphAlgorithm which implicitly reuses > > > DataSets. There is overlap between algorithms. From this ticket, > > annotating > > > edges with source and target degree on an undirected graph can reuse > > vertex > > > degree. Local clustering coefficient requires a triangle listing and > > global > > > clustering coefficient requires a triangle count, there is no need to > > > generate the list three times. > > > > > > Further optimizations include the use of mutable types, reusing sort > > order, > > > avoidance of coGroup, configurable parallelism, and not unnecessarily > > > materializing DataSets. I see all this as the expectation for inclusion > > in > > > Flink. > > > > > > Greg > > > > > > [0] https://issues.apache.org/jira/browse/FLINK-3772 > > > [1] https://github.com/apache/flink/pull/1901/files > > > > > > |
Free forum by Nabble | Edit this page |