Hey,
Before actually opening a PR, I wanted to hear your opinion. So, here goes nothing :). I'd like to add the core of my master thesis to Gelly. That is, a series of operators that take a skewed graph, split its high degree vertices into subvertices and redistribute the edges accordingly (thus producing the result faster due to the increased parallelism); then merge the subvertices into the initial vertex. Here's part of my -unrevised- abstract: "The Twitter follower graph, the citation network and general purpose social networks follow a power law degree distribution. These highly skewed graphs raise challenges to current computation models which uniformly process vertices, leading to communication overhead or large execution time stalls. In spite of efforts made to scale up computation on natural graphs (PowerGraph ’s Gather - Apply - Scatter model), many performance problems raise from a subset of vertices that have high degrees. In this thesis, we outline the main processing issues that arise in current graph systems. We then propose a novel node splitting technique meant to differentiate the computation as well as partitioning strategies for low and high degree nodes. The “Node Splitting” abstraction is implemented as a separate set of operators that can be seamlessly combined with current degree-oblivious models such as Pregel and PowerGraph, but also with degree-aware engines such as Pregel+ and PowerLyra. Our solution introduces minimal overhead in the absence of skew and improves the naive implementation of a subset of computational intensive algorithms by a factor of two. These results have been proven on a theoretical and practical level on both real world and synthetic graphs." What I would add: 1) the operators per se in a separate package 2) a slightly modified example (e.g. take Jaccard Similarity and apply these operators on it) 3). tests 4). a separate section in the gelly docs that explains how to modify your code to use this solution. Needless to say I will maintain the code and, if, as we get more users, they deem this sub-API useless, we can always deprecate and delete it :). So, what do you think? Andra |
Hi Andra,
thanks for offering to add this work to Gelly and for starting the discussion! How do you think this would look like from an API point of view? Is it easy to make it transparent to the application? Could you give us a simple example of what you have in mind? Apart from usability, we should also consider documenting when to use this feature, i.e. for which algorithms, what kind degree distribution and what overhead to expect (if any). Cheers, Vasia. On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> wrote: > Hey, > > Before actually opening a PR, I wanted to hear your opinion. So, here goes > nothing :). > > I'd like to add the core of my master thesis to Gelly. That is, a series of > operators that take a skewed graph, split its high degree vertices into > subvertices and redistribute the edges accordingly (thus producing the > result faster due to the increased parallelism); then merge the subvertices > into the initial vertex. > > Here's part of my -unrevised- abstract: > "The Twitter follower graph, the citation network and general purpose > social networks follow a power law degree distribution. These highly skewed > graphs raise challenges to current computation models which uniformly > process vertices, leading to communication overhead or large execution time > stalls. > > In spite of efforts made to scale up computation on natural graphs > (PowerGraph ’s Gather - Apply - Scatter model), many performance problems > raise from a subset of vertices that have high degrees. In this thesis, we > outline the main processing issues that arise in current graph systems. We > then propose a novel node splitting technique meant to differentiate the > computation as well as partitioning strategies for low and high degree > nodes. > > The “Node Splitting” abstraction is implemented as a separate set of > operators that can be seamlessly combined with current degree-oblivious > models such as Pregel and PowerGraph, but also with degree-aware engines > such as Pregel+ and PowerLyra. Our solution introduces minimal overhead in > the absence of skew and improves the naive implementation of a subset of > computational intensive algorithms by a factor of two. > > These results have been proven on a theoretical and practical level on both > real world and synthetic graphs." > > What I would add: > 1) the operators per se in a separate package > 2) a slightly modified example (e.g. take Jaccard Similarity and apply > these operators on it) > 3). tests > 4). a separate section in the gelly docs that explains how to modify your > code to use this solution. > > Needless to say I will maintain the code and, if, as we get more users, > they deem this sub-API useless, we can always deprecate and delete it :). > > So, what do you think? > Andra > |
Hi Vasia,
I shall polish the functions a bit, but this is more or less what I had in mind: GSA Jaccard [what we have in Gelly right now]: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java The same version with node split: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java Yes, sure I can also share some charts with the results. I could say for which data sets (power law) and types of algorithms this can be used. If it's used appropriately, the overhead is 0. I'm open to suggestions! Thanks! Andra On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <[hidden email] > wrote: > Hi Andra, > > thanks for offering to add this work to Gelly and for starting the > discussion! > > How do you think this would look like from an API point of view? Is it easy > to make it transparent to the application? Could you give us a simple > example of what you have in mind? > > Apart from usability, we should also consider documenting when to use this > feature, i.e. for which algorithms, what kind degree distribution and what > overhead to expect (if any). > > Cheers, > Vasia. > On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> wrote: > > > Hey, > > > > Before actually opening a PR, I wanted to hear your opinion. So, here > goes > > nothing :). > > > > I'd like to add the core of my master thesis to Gelly. That is, a series > of > > operators that take a skewed graph, split its high degree vertices into > > subvertices and redistribute the edges accordingly (thus producing the > > result faster due to the increased parallelism); then merge the > subvertices > > into the initial vertex. > > > > Here's part of my -unrevised- abstract: > > "The Twitter follower graph, the citation network and general purpose > > social networks follow a power law degree distribution. These highly > skewed > > graphs raise challenges to current computation models which uniformly > > process vertices, leading to communication overhead or large execution > time > > stalls. > > > > In spite of efforts made to scale up computation on natural graphs > > (PowerGraph ’s Gather - Apply - Scatter model), many performance problems > > raise from a subset of vertices that have high degrees. In this thesis, > we > > outline the main processing issues that arise in current graph systems. > We > > then propose a novel node splitting technique meant to differentiate the > > computation as well as partitioning strategies for low and high degree > > nodes. > > > > The “Node Splitting” abstraction is implemented as a separate set of > > operators that can be seamlessly combined with current degree-oblivious > > models such as Pregel and PowerGraph, but also with degree-aware engines > > such as Pregel+ and PowerLyra. Our solution introduces minimal overhead > in > > the absence of skew and improves the naive implementation of a subset of > > computational intensive algorithms by a factor of two. > > > > These results have been proven on a theoretical and practical level on > both > > real world and synthetic graphs." > > > > What I would add: > > 1) the operators per se in a separate package > > 2) a slightly modified example (e.g. take Jaccard Similarity and apply > > these operators on it) > > 3). tests > > 4). a separate section in the gelly docs that explains how to modify your > > code to use this solution. > > > > Needless to say I will maintain the code and, if, as we get more users, > > they deem this sub-API useless, we can always deprecate and delete it :). > > > > So, what do you think? > > Andra > > > |
Hi Andra and nice to meet you btw :)
It sounds like very fancy way to deal with skew, I like the idea even though I am not a graph analytics expert. Have you ran any experiments or benchmarks to see when this preferable ? Users should be aware when they will get benefits by using it since node splitting doesn’t come with no cost I guess. I am really eager to see how this will evolve, I think it’s good effort. cheers Paris > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> wrote: > > Hi Vasia, > > I shall polish the functions a bit, but this is more or less what I had in > mind: > GSA Jaccard [what we have in Gelly right now]: > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > The same version with node split: > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > Yes, sure I can also share some charts with the results. I could say for > which data sets (power law) and types of algorithms this can be used. If > it's used appropriately, the overhead is 0. > > I'm open to suggestions! > Thanks! > Andra > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <[hidden email] >> wrote: > >> Hi Andra, >> >> thanks for offering to add this work to Gelly and for starting the >> discussion! >> >> How do you think this would look like from an API point of view? Is it easy >> to make it transparent to the application? Could you give us a simple >> example of what you have in mind? >> >> Apart from usability, we should also consider documenting when to use this >> feature, i.e. for which algorithms, what kind degree distribution and what >> overhead to expect (if any). >> >> Cheers, >> Vasia. >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> wrote: >> >>> Hey, >>> >>> Before actually opening a PR, I wanted to hear your opinion. So, here >> goes >>> nothing :). >>> >>> I'd like to add the core of my master thesis to Gelly. That is, a series >> of >>> operators that take a skewed graph, split its high degree vertices into >>> subvertices and redistribute the edges accordingly (thus producing the >>> result faster due to the increased parallelism); then merge the >> subvertices >>> into the initial vertex. >>> >>> Here's part of my -unrevised- abstract: >>> "The Twitter follower graph, the citation network and general purpose >>> social networks follow a power law degree distribution. These highly >> skewed >>> graphs raise challenges to current computation models which uniformly >>> process vertices, leading to communication overhead or large execution >> time >>> stalls. >>> >>> In spite of efforts made to scale up computation on natural graphs >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance problems >>> raise from a subset of vertices that have high degrees. In this thesis, >> we >>> outline the main processing issues that arise in current graph systems. >> We >>> then propose a novel node splitting technique meant to differentiate the >>> computation as well as partitioning strategies for low and high degree >>> nodes. >>> >>> The “Node Splitting” abstraction is implemented as a separate set of >>> operators that can be seamlessly combined with current degree-oblivious >>> models such as Pregel and PowerGraph, but also with degree-aware engines >>> such as Pregel+ and PowerLyra. Our solution introduces minimal overhead >> in >>> the absence of skew and improves the naive implementation of a subset of >>> computational intensive algorithms by a factor of two. >>> >>> These results have been proven on a theoretical and practical level on >> both >>> real world and synthetic graphs." >>> >>> What I would add: >>> 1) the operators per se in a separate package >>> 2) a slightly modified example (e.g. take Jaccard Similarity and apply >>> these operators on it) >>> 3). tests >>> 4). a separate section in the gelly docs that explains how to modify your >>> code to use this solution. >>> >>> Needless to say I will maintain the code and, if, as we get more users, >>> they deem this sub-API useless, we can always deprecate and delete it :). >>> >>> So, what do you think? >>> Andra >>> >> |
Hi Paris,
Nice to virtually meet you too :) Maybe it makes sense to share my freshest chart: https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing This is for the Community Detection algorithm [1] in which you basically find communities by continuously rescoring vertices. In a vertex-centric scenario (for a skewed graph, in this case, the Twitter follower network), since all vertices need to sync after an iteration superstep, they will remain idle until a high degree vertex terminates (computing an aggregate over my followers takes significantly less time than computing one over Obama's followers). What you have on the right is the time elapsed in the regular vertex centric Community Detection (more than two hours). And since we split the node in aggregation trees, in the left there is a chart with the various levels. Level 3 produces the best result. It's a bit cumbersome to put months of work in a mail, I am still polishing the documentation, but feel free to ask questions if this is still unclear. All in all, the technique is best for linear algorithms that work just in Pregel, for a skewed graph. On a high level, it's a generalisation of PowerGraph's mirroring, that can work on any system, and that does not restrict itself to level = 1 (like PowerGraph does) which in the case of out chart there did not bring us too far. Hope that clarified things a bit more! Andra [1] http://arxiv.org/pdf/0808.2633.pdf On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> wrote: > Hi Andra and nice to meet you btw :) > > It sounds like very fancy way to deal with skew, I like the idea even > though I am not a graph analytics expert. > Have you ran any experiments or benchmarks to see when this preferable ? > Users should be aware when they will get benefits by using it since node > splitting doesn’t come with no cost I guess. > I am really eager to see how this will evolve, I think it’s good effort. > > cheers > Paris > > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> wrote: > > > > Hi Vasia, > > > > I shall polish the functions a bit, but this is more or less what I had > in > > mind: > > GSA Jaccard [what we have in Gelly right now]: > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > The same version with node split: > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > > Yes, sure I can also share some charts with the results. I could say for > > which data sets (power law) and types of algorithms this can be used. If > > it's used appropriately, the overhead is 0. > > > > I'm open to suggestions! > > Thanks! > > Andra > > > > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > [hidden email] > >> wrote: > > > >> Hi Andra, > >> > >> thanks for offering to add this work to Gelly and for starting the > >> discussion! > >> > >> How do you think this would look like from an API point of view? Is it > easy > >> to make it transparent to the application? Could you give us a simple > >> example of what you have in mind? > >> > >> Apart from usability, we should also consider documenting when to use > this > >> feature, i.e. for which algorithms, what kind degree distribution and > what > >> overhead to expect (if any). > >> > >> Cheers, > >> Vasia. > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> wrote: > >> > >>> Hey, > >>> > >>> Before actually opening a PR, I wanted to hear your opinion. So, here > >> goes > >>> nothing :). > >>> > >>> I'd like to add the core of my master thesis to Gelly. That is, a > series > >> of > >>> operators that take a skewed graph, split its high degree vertices into > >>> subvertices and redistribute the edges accordingly (thus producing the > >>> result faster due to the increased parallelism); then merge the > >> subvertices > >>> into the initial vertex. > >>> > >>> Here's part of my -unrevised- abstract: > >>> "The Twitter follower graph, the citation network and general purpose > >>> social networks follow a power law degree distribution. These highly > >> skewed > >>> graphs raise challenges to current computation models which uniformly > >>> process vertices, leading to communication overhead or large execution > >> time > >>> stalls. > >>> > >>> In spite of efforts made to scale up computation on natural graphs > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > problems > >>> raise from a subset of vertices that have high degrees. In this thesis, > >> we > >>> outline the main processing issues that arise in current graph systems. > >> We > >>> then propose a novel node splitting technique meant to differentiate > the > >>> computation as well as partitioning strategies for low and high degree > >>> nodes. > >>> > >>> The “Node Splitting” abstraction is implemented as a separate set of > >>> operators that can be seamlessly combined with current degree-oblivious > >>> models such as Pregel and PowerGraph, but also with degree-aware > engines > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal overhead > >> in > >>> the absence of skew and improves the naive implementation of a subset > of > >>> computational intensive algorithms by a factor of two. > >>> > >>> These results have been proven on a theoretical and practical level on > >> both > >>> real world and synthetic graphs." > >>> > >>> What I would add: > >>> 1) the operators per se in a separate package > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and apply > >>> these operators on it) > >>> 3). tests > >>> 4). a separate section in the gelly docs that explains how to modify > your > >>> code to use this solution. > >>> > >>> Needless to say I will maintain the code and, if, as we get more users, > >>> they deem this sub-API useless, we can always deprecate and delete it > :). > >>> > >>> So, what do you think? > >>> Andra > >>> > >> > > |
Dear Andra,
The idea seems pretty nice. I wonder how you decide the threshold to separate the high degree vertices from the low degree vertices. Regards, Samia On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email]> wrote: > Hi Paris, > > Nice to virtually meet you too :) > > Maybe it makes sense to share my freshest chart: > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > This is for the Community Detection algorithm [1] in which you basically > find communities by continuously rescoring vertices. In a vertex-centric > scenario (for a skewed graph, in this case, the Twitter follower network), > since all vertices need to sync after an iteration superstep, they will > remain idle until a high degree vertex terminates (computing an aggregate > over my followers takes significantly less time than computing one over > Obama's followers). > > What you have on the right is the time elapsed in the regular vertex > centric Community Detection (more than two hours). > And since we split the node in aggregation trees, in the left there is a > chart with the various levels. Level 3 produces the best result. > > It's a bit cumbersome to put months of work in a mail, I am still polishing > the documentation, but feel free to ask questions if this is still unclear. > > > All in all, the technique is best for linear algorithms that work just in > Pregel, for a skewed graph. On a high level, it's a generalisation of > PowerGraph's mirroring, that can work on any system, and that does not > restrict itself to level = 1 (like PowerGraph does) which in the case of > out chart there did not bring us too far. > > Hope that clarified things a bit more! > Andra > > [1] http://arxiv.org/pdf/0808.2633.pdf > > > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> wrote: > > > Hi Andra and nice to meet you btw :) > > > > It sounds like very fancy way to deal with skew, I like the idea even > > though I am not a graph analytics expert. > > Have you ran any experiments or benchmarks to see when this preferable ? > > Users should be aware when they will get benefits by using it since node > > splitting doesn’t come with no cost I guess. > > I am really eager to see how this will evolve, I think it’s good effort. > > > > cheers > > Paris > > > > > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> wrote: > > > > > > Hi Vasia, > > > > > > I shall polish the functions a bit, but this is more or less what I had > > in > > > mind: > > > GSA Jaccard [what we have in Gelly right now]: > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > > The same version with node split: > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > > > > Yes, sure I can also share some charts with the results. I could say > for > > > which data sets (power law) and types of algorithms this can be used. > If > > > it's used appropriately, the overhead is 0. > > > > > > I'm open to suggestions! > > > Thanks! > > > Andra > > > > > > > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > [hidden email] > > >> wrote: > > > > > >> Hi Andra, > > >> > > >> thanks for offering to add this work to Gelly and for starting the > > >> discussion! > > >> > > >> How do you think this would look like from an API point of view? Is it > > easy > > >> to make it transparent to the application? Could you give us a simple > > >> example of what you have in mind? > > >> > > >> Apart from usability, we should also consider documenting when to use > > this > > >> feature, i.e. for which algorithms, what kind degree distribution and > > what > > >> overhead to expect (if any). > > >> > > >> Cheers, > > >> Vasia. > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> > wrote: > > >> > > >>> Hey, > > >>> > > >>> Before actually opening a PR, I wanted to hear your opinion. So, here > > >> goes > > >>> nothing :). > > >>> > > >>> I'd like to add the core of my master thesis to Gelly. That is, a > > series > > >> of > > >>> operators that take a skewed graph, split its high degree vertices > into > > >>> subvertices and redistribute the edges accordingly (thus producing > the > > >>> result faster due to the increased parallelism); then merge the > > >> subvertices > > >>> into the initial vertex. > > >>> > > >>> Here's part of my -unrevised- abstract: > > >>> "The Twitter follower graph, the citation network and general purpose > > >>> social networks follow a power law degree distribution. These highly > > >> skewed > > >>> graphs raise challenges to current computation models which uniformly > > >>> process vertices, leading to communication overhead or large > execution > > >> time > > >>> stalls. > > >>> > > >>> In spite of efforts made to scale up computation on natural graphs > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > > problems > > >>> raise from a subset of vertices that have high degrees. In this > thesis, > > >> we > > >>> outline the main processing issues that arise in current graph > systems. > > >> We > > >>> then propose a novel node splitting technique meant to differentiate > > the > > >>> computation as well as partitioning strategies for low and high > degree > > >>> nodes. > > >>> > > >>> The “Node Splitting” abstraction is implemented as a separate set of > > >>> operators that can be seamlessly combined with current > degree-oblivious > > >>> models such as Pregel and PowerGraph, but also with degree-aware > > engines > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > overhead > > >> in > > >>> the absence of skew and improves the naive implementation of a subset > > of > > >>> computational intensive algorithms by a factor of two. > > >>> > > >>> These results have been proven on a theoretical and practical level > on > > >> both > > >>> real world and synthetic graphs." > > >>> > > >>> What I would add: > > >>> 1) the operators per se in a separate package > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and > apply > > >>> these operators on it) > > >>> 3). tests > > >>> 4). a separate section in the gelly docs that explains how to modify > > your > > >>> code to use this solution. > > >>> > > >>> Needless to say I will maintain the code and, if, as we get more > users, > > >>> they deem this sub-API useless, we can always deprecate and delete it > > :). > > >>> > > >>> So, what do you think? > > >>> Andra > > >>> > > >> > > > > > |
Hi Samia,
A good method to statistically determine skewed vertices was beyond the purpose of my thesis. Unfortunately, the statistical methods that fit a power law distribution don't do a good job. So what I do is that I plot the degree distribution and then visually determine the threshold. That is also how other state of the art systems, such as PowerLyra, fix the issue. A way to automate this would be a nice future addition :). On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <[hidden email]> wrote: > Dear Andra, > > The idea seems pretty nice. > > I wonder how you decide the threshold to separate the high degree vertices > from the low degree vertices. > > Regards, > Samia > > On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email]> > wrote: > > > Hi Paris, > > > > Nice to virtually meet you too :) > > > > Maybe it makes sense to share my freshest chart: > > > > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > > > This is for the Community Detection algorithm [1] in which you basically > > find communities by continuously rescoring vertices. In a vertex-centric > > scenario (for a skewed graph, in this case, the Twitter follower > network), > > since all vertices need to sync after an iteration superstep, they will > > remain idle until a high degree vertex terminates (computing an aggregate > > over my followers takes significantly less time than computing one over > > Obama's followers). > > > > What you have on the right is the time elapsed in the regular vertex > > centric Community Detection (more than two hours). > > And since we split the node in aggregation trees, in the left there is a > > chart with the various levels. Level 3 produces the best result. > > > > It's a bit cumbersome to put months of work in a mail, I am still > polishing > > the documentation, but feel free to ask questions if this is still > unclear. > > > > > > All in all, the technique is best for linear algorithms that work just in > > Pregel, for a skewed graph. On a high level, it's a generalisation of > > PowerGraph's mirroring, that can work on any system, and that does not > > restrict itself to level = 1 (like PowerGraph does) which in the case of > > out chart there did not bring us too far. > > > > Hope that clarified things a bit more! > > Andra > > > > [1] http://arxiv.org/pdf/0808.2633.pdf > > > > > > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> wrote: > > > > > Hi Andra and nice to meet you btw :) > > > > > > It sounds like very fancy way to deal with skew, I like the idea even > > > though I am not a graph analytics expert. > > > Have you ran any experiments or benchmarks to see when this preferable > ? > > > Users should be aware when they will get benefits by using it since > node > > > splitting doesn’t come with no cost I guess. > > > I am really eager to see how this will evolve, I think it’s good > effort. > > > > > > cheers > > > Paris > > > > > > > > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> wrote: > > > > > > > > Hi Vasia, > > > > > > > > I shall polish the functions a bit, but this is more or less what I > had > > > in > > > > mind: > > > > GSA Jaccard [what we have in Gelly right now]: > > > > > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > > > The same version with node split: > > > > > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > > > > > > Yes, sure I can also share some charts with the results. I could say > > for > > > > which data sets (power law) and types of algorithms this can be used. > > If > > > > it's used appropriately, the overhead is 0. > > > > > > > > I'm open to suggestions! > > > > Thanks! > > > > Andra > > > > > > > > > > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > > [hidden email] > > > >> wrote: > > > > > > > >> Hi Andra, > > > >> > > > >> thanks for offering to add this work to Gelly and for starting the > > > >> discussion! > > > >> > > > >> How do you think this would look like from an API point of view? Is > it > > > easy > > > >> to make it transparent to the application? Could you give us a > simple > > > >> example of what you have in mind? > > > >> > > > >> Apart from usability, we should also consider documenting when to > use > > > this > > > >> feature, i.e. for which algorithms, what kind degree distribution > and > > > what > > > >> overhead to expect (if any). > > > >> > > > >> Cheers, > > > >> Vasia. > > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> > > wrote: > > > >> > > > >>> Hey, > > > >>> > > > >>> Before actually opening a PR, I wanted to hear your opinion. So, > here > > > >> goes > > > >>> nothing :). > > > >>> > > > >>> I'd like to add the core of my master thesis to Gelly. That is, a > > > series > > > >> of > > > >>> operators that take a skewed graph, split its high degree vertices > > into > > > >>> subvertices and redistribute the edges accordingly (thus producing > > the > > > >>> result faster due to the increased parallelism); then merge the > > > >> subvertices > > > >>> into the initial vertex. > > > >>> > > > >>> Here's part of my -unrevised- abstract: > > > >>> "The Twitter follower graph, the citation network and general > purpose > > > >>> social networks follow a power law degree distribution. These > highly > > > >> skewed > > > >>> graphs raise challenges to current computation models which > uniformly > > > >>> process vertices, leading to communication overhead or large > > execution > > > >> time > > > >>> stalls. > > > >>> > > > >>> In spite of efforts made to scale up computation on natural graphs > > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > > > problems > > > >>> raise from a subset of vertices that have high degrees. In this > > thesis, > > > >> we > > > >>> outline the main processing issues that arise in current graph > > systems. > > > >> We > > > >>> then propose a novel node splitting technique meant to > differentiate > > > the > > > >>> computation as well as partitioning strategies for low and high > > degree > > > >>> nodes. > > > >>> > > > >>> The “Node Splitting” abstraction is implemented as a separate set > of > > > >>> operators that can be seamlessly combined with current > > degree-oblivious > > > >>> models such as Pregel and PowerGraph, but also with degree-aware > > > engines > > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > > overhead > > > >> in > > > >>> the absence of skew and improves the naive implementation of a > subset > > > of > > > >>> computational intensive algorithms by a factor of two. > > > >>> > > > >>> These results have been proven on a theoretical and practical level > > on > > > >> both > > > >>> real world and synthetic graphs." > > > >>> > > > >>> What I would add: > > > >>> 1) the operators per se in a separate package > > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and > > apply > > > >>> these operators on it) > > > >>> 3). tests > > > >>> 4). a separate section in the gelly docs that explains how to > modify > > > your > > > >>> code to use this solution. > > > >>> > > > >>> Needless to say I will maintain the code and, if, as we get more > > users, > > > >>> they deem this sub-API useless, we can always deprecate and delete > it > > > :). > > > >>> > > > >>> So, what do you think? > > > >>> Andra > > > >>> > > > >> > > > > > > > > > |
I would love to get some feedback from the guys at data Artisans about this
one. So far, the comments originated and spread in the Stockholm area :) On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <[hidden email]> wrote: > Hi Samia, > > A good method to statistically determine skewed vertices was beyond the > purpose of my thesis. Unfortunately, the statistical methods that fit a > power law distribution don't do a good job. So what I do is that I plot the > degree distribution and then visually determine the threshold. That is also > how other state of the art systems, such as PowerLyra, fix the issue. > > A way to automate this would be a nice future addition :). > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <[hidden email]> wrote: > >> Dear Andra, >> >> The idea seems pretty nice. >> >> I wonder how you decide the threshold to separate the high degree vertices >> from the low degree vertices. >> >> Regards, >> Samia >> >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email]> >> wrote: >> >> > Hi Paris, >> > >> > Nice to virtually meet you too :) >> > >> > Maybe it makes sense to share my freshest chart: >> > >> > >> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing >> > >> > This is for the Community Detection algorithm [1] in which you basically >> > find communities by continuously rescoring vertices. In a vertex-centric >> > scenario (for a skewed graph, in this case, the Twitter follower >> network), >> > since all vertices need to sync after an iteration superstep, they will >> > remain idle until a high degree vertex terminates (computing an >> aggregate >> > over my followers takes significantly less time than computing one over >> > Obama's followers). >> > >> > What you have on the right is the time elapsed in the regular vertex >> > centric Community Detection (more than two hours). >> > And since we split the node in aggregation trees, in the left there is a >> > chart with the various levels. Level 3 produces the best result. >> > >> > It's a bit cumbersome to put months of work in a mail, I am still >> polishing >> > the documentation, but feel free to ask questions if this is still >> unclear. >> > >> > >> > All in all, the technique is best for linear algorithms that work just >> in >> > Pregel, for a skewed graph. On a high level, it's a generalisation of >> > PowerGraph's mirroring, that can work on any system, and that does not >> > restrict itself to level = 1 (like PowerGraph does) which in the case of >> > out chart there did not bring us too far. >> > >> > Hope that clarified things a bit more! >> > Andra >> > >> > [1] http://arxiv.org/pdf/0808.2633.pdf >> > >> > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> wrote: >> > >> > > Hi Andra and nice to meet you btw :) >> > > >> > > It sounds like very fancy way to deal with skew, I like the idea even >> > > though I am not a graph analytics expert. >> > > Have you ran any experiments or benchmarks to see when this >> preferable ? >> > > Users should be aware when they will get benefits by using it since >> node >> > > splitting doesn’t come with no cost I guess. >> > > I am really eager to see how this will evolve, I think it’s good >> effort. >> > > >> > > cheers >> > > Paris >> > > >> > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> >> wrote: >> > > > >> > > > Hi Vasia, >> > > > >> > > > I shall polish the functions a bit, but this is more or less what I >> had >> > > in >> > > > mind: >> > > > GSA Jaccard [what we have in Gelly right now]: >> > > > >> > > >> > >> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java >> > > > The same version with node split: >> > > > >> > > >> > >> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java >> > > > >> > > > Yes, sure I can also share some charts with the results. I could say >> > for >> > > > which data sets (power law) and types of algorithms this can be >> used. >> > If >> > > > it's used appropriately, the overhead is 0. >> > > > >> > > > I'm open to suggestions! >> > > > Thanks! >> > > > Andra >> > > > >> > > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < >> > > [hidden email] >> > > >> wrote: >> > > > >> > > >> Hi Andra, >> > > >> >> > > >> thanks for offering to add this work to Gelly and for starting the >> > > >> discussion! >> > > >> >> > > >> How do you think this would look like from an API point of view? >> Is it >> > > easy >> > > >> to make it transparent to the application? Could you give us a >> simple >> > > >> example of what you have in mind? >> > > >> >> > > >> Apart from usability, we should also consider documenting when to >> use >> > > this >> > > >> feature, i.e. for which algorithms, what kind degree distribution >> and >> > > what >> > > >> overhead to expect (if any). >> > > >> >> > > >> Cheers, >> > > >> Vasia. >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> >> > wrote: >> > > >> >> > > >>> Hey, >> > > >>> >> > > >>> Before actually opening a PR, I wanted to hear your opinion. So, >> here >> > > >> goes >> > > >>> nothing :). >> > > >>> >> > > >>> I'd like to add the core of my master thesis to Gelly. That is, a >> > > series >> > > >> of >> > > >>> operators that take a skewed graph, split its high degree vertices >> > into >> > > >>> subvertices and redistribute the edges accordingly (thus producing >> > the >> > > >>> result faster due to the increased parallelism); then merge the >> > > >> subvertices >> > > >>> into the initial vertex. >> > > >>> >> > > >>> Here's part of my -unrevised- abstract: >> > > >>> "The Twitter follower graph, the citation network and general >> purpose >> > > >>> social networks follow a power law degree distribution. These >> highly >> > > >> skewed >> > > >>> graphs raise challenges to current computation models which >> uniformly >> > > >>> process vertices, leading to communication overhead or large >> > execution >> > > >> time >> > > >>> stalls. >> > > >>> >> > > >>> In spite of efforts made to scale up computation on natural graphs >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance >> > > problems >> > > >>> raise from a subset of vertices that have high degrees. In this >> > thesis, >> > > >> we >> > > >>> outline the main processing issues that arise in current graph >> > systems. >> > > >> We >> > > >>> then propose a novel node splitting technique meant to >> differentiate >> > > the >> > > >>> computation as well as partitioning strategies for low and high >> > degree >> > > >>> nodes. >> > > >>> >> > > >>> The “Node Splitting” abstraction is implemented as a separate set >> of >> > > >>> operators that can be seamlessly combined with current >> > degree-oblivious >> > > >>> models such as Pregel and PowerGraph, but also with degree-aware >> > > engines >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal >> > overhead >> > > >> in >> > > >>> the absence of skew and improves the naive implementation of a >> subset >> > > of >> > > >>> computational intensive algorithms by a factor of two. >> > > >>> >> > > >>> These results have been proven on a theoretical and practical >> level >> > on >> > > >> both >> > > >>> real world and synthetic graphs." >> > > >>> >> > > >>> What I would add: >> > > >>> 1) the operators per se in a separate package >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and >> > apply >> > > >>> these operators on it) >> > > >>> 3). tests >> > > >>> 4). a separate section in the gelly docs that explains how to >> modify >> > > your >> > > >>> code to use this solution. >> > > >>> >> > > >>> Needless to say I will maintain the code and, if, as we get more >> > users, >> > > >>> they deem this sub-API useless, we can always deprecate and >> delete it >> > > :). >> > > >>> >> > > >>> So, what do you think? >> > > >>> Andra >> > > >>> >> > > >> >> > > >> > > >> > >> > > |
I think this is a decision to be made by the people involved in the Gelly
library. I'm not very familiar with graph processing libraries. Thus, it is hard for me to asses the value of this contribution. However, you outlined pretty well that for highly skewed graphs your technique results in a much better performance. So even if your addition might not be relevant to all users of the Gelly API, there are users that could demand such processing techniques. Since you want to develop, maintain, and document the code, I think there is absolutely nothing to say against opening a pull request :) Cheers, Max On Wed, Aug 12, 2015 at 12:16 PM, Andra Lungu <[hidden email]> wrote: > I would love to get some feedback from the guys at data Artisans about this > one. > So far, the comments originated and spread in the Stockholm area :) > > On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <[hidden email]> > wrote: > > > Hi Samia, > > > > A good method to statistically determine skewed vertices was beyond the > > purpose of my thesis. Unfortunately, the statistical methods that fit a > > power law distribution don't do a good job. So what I do is that I plot > the > > degree distribution and then visually determine the threshold. That is > also > > how other state of the art systems, such as PowerLyra, fix the issue. > > > > A way to automate this would be a nice future addition :). > > > > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <[hidden email]> > wrote: > > > >> Dear Andra, > >> > >> The idea seems pretty nice. > >> > >> I wonder how you decide the threshold to separate the high degree > vertices > >> from the low degree vertices. > >> > >> Regards, > >> Samia > >> > >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email]> > >> wrote: > >> > >> > Hi Paris, > >> > > >> > Nice to virtually meet you too :) > >> > > >> > Maybe it makes sense to share my freshest chart: > >> > > >> > > >> > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > >> > > >> > This is for the Community Detection algorithm [1] in which you > basically > >> > find communities by continuously rescoring vertices. In a > vertex-centric > >> > scenario (for a skewed graph, in this case, the Twitter follower > >> network), > >> > since all vertices need to sync after an iteration superstep, they > will > >> > remain idle until a high degree vertex terminates (computing an > >> aggregate > >> > over my followers takes significantly less time than computing one > over > >> > Obama's followers). > >> > > >> > What you have on the right is the time elapsed in the regular vertex > >> > centric Community Detection (more than two hours). > >> > And since we split the node in aggregation trees, in the left there > is a > >> > chart with the various levels. Level 3 produces the best result. > >> > > >> > It's a bit cumbersome to put months of work in a mail, I am still > >> polishing > >> > the documentation, but feel free to ask questions if this is still > >> unclear. > >> > > >> > > >> > All in all, the technique is best for linear algorithms that work just > >> in > >> > Pregel, for a skewed graph. On a high level, it's a generalisation of > >> > PowerGraph's mirroring, that can work on any system, and that does not > >> > restrict itself to level = 1 (like PowerGraph does) which in the case > of > >> > out chart there did not bring us too far. > >> > > >> > Hope that clarified things a bit more! > >> > Andra > >> > > >> > [1] http://arxiv.org/pdf/0808.2633.pdf > >> > > >> > > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> wrote: > >> > > >> > > Hi Andra and nice to meet you btw :) > >> > > > >> > > It sounds like very fancy way to deal with skew, I like the idea > even > >> > > though I am not a graph analytics expert. > >> > > Have you ran any experiments or benchmarks to see when this > >> preferable ? > >> > > Users should be aware when they will get benefits by using it since > >> node > >> > > splitting doesn’t come with no cost I guess. > >> > > I am really eager to see how this will evolve, I think it’s good > >> effort. > >> > > > >> > > cheers > >> > > Paris > >> > > > >> > > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> > >> wrote: > >> > > > > >> > > > Hi Vasia, > >> > > > > >> > > > I shall polish the functions a bit, but this is more or less what > I > >> had > >> > > in > >> > > > mind: > >> > > > GSA Jaccard [what we have in Gelly right now]: > >> > > > > >> > > > >> > > >> > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > >> > > > The same version with node split: > >> > > > > >> > > > >> > > >> > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > >> > > > > >> > > > Yes, sure I can also share some charts with the results. I could > say > >> > for > >> > > > which data sets (power law) and types of algorithms this can be > >> used. > >> > If > >> > > > it's used appropriately, the overhead is 0. > >> > > > > >> > > > I'm open to suggestions! > >> > > > Thanks! > >> > > > Andra > >> > > > > >> > > > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > >> > > [hidden email] > >> > > >> wrote: > >> > > > > >> > > >> Hi Andra, > >> > > >> > >> > > >> thanks for offering to add this work to Gelly and for starting > the > >> > > >> discussion! > >> > > >> > >> > > >> How do you think this would look like from an API point of view? > >> Is it > >> > > easy > >> > > >> to make it transparent to the application? Could you give us a > >> simple > >> > > >> example of what you have in mind? > >> > > >> > >> > > >> Apart from usability, we should also consider documenting when to > >> use > >> > > this > >> > > >> feature, i.e. for which algorithms, what kind degree distribution > >> and > >> > > what > >> > > >> overhead to expect (if any). > >> > > >> > >> > > >> Cheers, > >> > > >> Vasia. > >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email]> > >> > wrote: > >> > > >> > >> > > >>> Hey, > >> > > >>> > >> > > >>> Before actually opening a PR, I wanted to hear your opinion. So, > >> here > >> > > >> goes > >> > > >>> nothing :). > >> > > >>> > >> > > >>> I'd like to add the core of my master thesis to Gelly. That is, > a > >> > > series > >> > > >> of > >> > > >>> operators that take a skewed graph, split its high degree > vertices > >> > into > >> > > >>> subvertices and redistribute the edges accordingly (thus > producing > >> > the > >> > > >>> result faster due to the increased parallelism); then merge the > >> > > >> subvertices > >> > > >>> into the initial vertex. > >> > > >>> > >> > > >>> Here's part of my -unrevised- abstract: > >> > > >>> "The Twitter follower graph, the citation network and general > >> purpose > >> > > >>> social networks follow a power law degree distribution. These > >> highly > >> > > >> skewed > >> > > >>> graphs raise challenges to current computation models which > >> uniformly > >> > > >>> process vertices, leading to communication overhead or large > >> > execution > >> > > >> time > >> > > >>> stalls. > >> > > >>> > >> > > >>> In spite of efforts made to scale up computation on natural > graphs > >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > >> > > problems > >> > > >>> raise from a subset of vertices that have high degrees. In this > >> > thesis, > >> > > >> we > >> > > >>> outline the main processing issues that arise in current graph > >> > systems. > >> > > >> We > >> > > >>> then propose a novel node splitting technique meant to > >> differentiate > >> > > the > >> > > >>> computation as well as partitioning strategies for low and high > >> > degree > >> > > >>> nodes. > >> > > >>> > >> > > >>> The “Node Splitting” abstraction is implemented as a separate > set > >> of > >> > > >>> operators that can be seamlessly combined with current > >> > degree-oblivious > >> > > >>> models such as Pregel and PowerGraph, but also with degree-aware > >> > > engines > >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > >> > overhead > >> > > >> in > >> > > >>> the absence of skew and improves the naive implementation of a > >> subset > >> > > of > >> > > >>> computational intensive algorithms by a factor of two. > >> > > >>> > >> > > >>> These results have been proven on a theoretical and practical > >> level > >> > on > >> > > >> both > >> > > >>> real world and synthetic graphs." > >> > > >>> > >> > > >>> What I would add: > >> > > >>> 1) the operators per se in a separate package > >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and > >> > apply > >> > > >>> these operators on it) > >> > > >>> 3). tests > >> > > >>> 4). a separate section in the gelly docs that explains how to > >> modify > >> > > your > >> > > >>> code to use this solution. > >> > > >>> > >> > > >>> Needless to say I will maintain the code and, if, as we get more > >> > users, > >> > > >>> they deem this sub-API useless, we can always deprecate and > >> delete it > >> > > :). > >> > > >>> > >> > > >>> So, what do you think? > >> > > >>> Andra > >> > > >>> > >> > > >> > >> > > > >> > > > >> > > >> > > > > > |
Same here as for Max, I am not familiar enough any more to make really good
comments. Some generic comments, though: - Check whether you really need a technique. Techniques that improve corner cases, but make the code much more complex and make the behavior of algorithms less robust are often better to leave out. - It is usually easier to something as an optional extension, rather then adding it to a core operation. Easier to add, easier to remove if one finds that it does not work out as anticipated in hind sight - If it works as a Gelly extension, you can always put it on your own github repository. The Gelly docs could list a set of known algorithms/operations on top of Gelly. If the interest in the extension is high, you can still reconsider to move it into the core codebase. Stephan On Wed, Aug 12, 2015 at 2:35 PM, Maximilian Michels <[hidden email]> wrote: > I think this is a decision to be made by the people involved in the Gelly > library. I'm not very familiar with graph processing libraries. Thus, it is > hard for me to asses the value of this contribution. > > However, you outlined pretty well that for highly skewed graphs your > technique results in a much better performance. So even if your addition > might not be relevant to all users of the Gelly API, there are users that > could demand such processing techniques. Since you want to develop, > maintain, and document the code, I think there is absolutely nothing to say > against opening a pull request :) > > Cheers, > Max > > > > On Wed, Aug 12, 2015 at 12:16 PM, Andra Lungu <[hidden email]> > wrote: > > > I would love to get some feedback from the guys at data Artisans about > this > > one. > > So far, the comments originated and spread in the Stockholm area :) > > > > On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <[hidden email]> > > wrote: > > > > > Hi Samia, > > > > > > A good method to statistically determine skewed vertices was beyond the > > > purpose of my thesis. Unfortunately, the statistical methods that fit a > > > power law distribution don't do a good job. So what I do is that I plot > > the > > > degree distribution and then visually determine the threshold. That is > > also > > > how other state of the art systems, such as PowerLyra, fix the issue. > > > > > > A way to automate this would be a nice future addition :). > > > > > > > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <[hidden email]> > > wrote: > > > > > >> Dear Andra, > > >> > > >> The idea seems pretty nice. > > >> > > >> I wonder how you decide the threshold to separate the high degree > > vertices > > >> from the low degree vertices. > > >> > > >> Regards, > > >> Samia > > >> > > >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email]> > > >> wrote: > > >> > > >> > Hi Paris, > > >> > > > >> > Nice to virtually meet you too :) > > >> > > > >> > Maybe it makes sense to share my freshest chart: > > >> > > > >> > > > >> > > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > >> > > > >> > This is for the Community Detection algorithm [1] in which you > > basically > > >> > find communities by continuously rescoring vertices. In a > > vertex-centric > > >> > scenario (for a skewed graph, in this case, the Twitter follower > > >> network), > > >> > since all vertices need to sync after an iteration superstep, they > > will > > >> > remain idle until a high degree vertex terminates (computing an > > >> aggregate > > >> > over my followers takes significantly less time than computing one > > over > > >> > Obama's followers). > > >> > > > >> > What you have on the right is the time elapsed in the regular vertex > > >> > centric Community Detection (more than two hours). > > >> > And since we split the node in aggregation trees, in the left there > > is a > > >> > chart with the various levels. Level 3 produces the best result. > > >> > > > >> > It's a bit cumbersome to put months of work in a mail, I am still > > >> polishing > > >> > the documentation, but feel free to ask questions if this is still > > >> unclear. > > >> > > > >> > > > >> > All in all, the technique is best for linear algorithms that work > just > > >> in > > >> > Pregel, for a skewed graph. On a high level, it's a generalisation > of > > >> > PowerGraph's mirroring, that can work on any system, and that does > not > > >> > restrict itself to level = 1 (like PowerGraph does) which in the > case > > of > > >> > out chart there did not bring us too far. > > >> > > > >> > Hope that clarified things a bit more! > > >> > Andra > > >> > > > >> > [1] http://arxiv.org/pdf/0808.2633.pdf > > >> > > > >> > > > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> > wrote: > > >> > > > >> > > Hi Andra and nice to meet you btw :) > > >> > > > > >> > > It sounds like very fancy way to deal with skew, I like the idea > > even > > >> > > though I am not a graph analytics expert. > > >> > > Have you ran any experiments or benchmarks to see when this > > >> preferable ? > > >> > > Users should be aware when they will get benefits by using it > since > > >> node > > >> > > splitting doesn’t come with no cost I guess. > > >> > > I am really eager to see how this will evolve, I think it’s good > > >> effort. > > >> > > > > >> > > cheers > > >> > > Paris > > >> > > > > >> > > > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> > > >> wrote: > > >> > > > > > >> > > > Hi Vasia, > > >> > > > > > >> > > > I shall polish the functions a bit, but this is more or less > what > > I > > >> had > > >> > > in > > >> > > > mind: > > >> > > > GSA Jaccard [what we have in Gelly right now]: > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > >> > > > The same version with node split: > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > >> > > > > > >> > > > Yes, sure I can also share some charts with the results. I could > > say > > >> > for > > >> > > > which data sets (power law) and types of algorithms this can be > > >> used. > > >> > If > > >> > > > it's used appropriately, the overhead is 0. > > >> > > > > > >> > > > I'm open to suggestions! > > >> > > > Thanks! > > >> > > > Andra > > >> > > > > > >> > > > > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > >> > > [hidden email] > > >> > > >> wrote: > > >> > > > > > >> > > >> Hi Andra, > > >> > > >> > > >> > > >> thanks for offering to add this work to Gelly and for starting > > the > > >> > > >> discussion! > > >> > > >> > > >> > > >> How do you think this would look like from an API point of > view? > > >> Is it > > >> > > easy > > >> > > >> to make it transparent to the application? Could you give us a > > >> simple > > >> > > >> example of what you have in mind? > > >> > > >> > > >> > > >> Apart from usability, we should also consider documenting when > to > > >> use > > >> > > this > > >> > > >> feature, i.e. for which algorithms, what kind degree > distribution > > >> and > > >> > > what > > >> > > >> overhead to expect (if any). > > >> > > >> > > >> > > >> Cheers, > > >> > > >> Vasia. > > >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[hidden email] > > > > >> > wrote: > > >> > > >> > > >> > > >>> Hey, > > >> > > >>> > > >> > > >>> Before actually opening a PR, I wanted to hear your opinion. > So, > > >> here > > >> > > >> goes > > >> > > >>> nothing :). > > >> > > >>> > > >> > > >>> I'd like to add the core of my master thesis to Gelly. That > is, > > a > > >> > > series > > >> > > >> of > > >> > > >>> operators that take a skewed graph, split its high degree > > vertices > > >> > into > > >> > > >>> subvertices and redistribute the edges accordingly (thus > > producing > > >> > the > > >> > > >>> result faster due to the increased parallelism); then merge > the > > >> > > >> subvertices > > >> > > >>> into the initial vertex. > > >> > > >>> > > >> > > >>> Here's part of my -unrevised- abstract: > > >> > > >>> "The Twitter follower graph, the citation network and general > > >> purpose > > >> > > >>> social networks follow a power law degree distribution. These > > >> highly > > >> > > >> skewed > > >> > > >>> graphs raise challenges to current computation models which > > >> uniformly > > >> > > >>> process vertices, leading to communication overhead or large > > >> > execution > > >> > > >> time > > >> > > >>> stalls. > > >> > > >>> > > >> > > >>> In spite of efforts made to scale up computation on natural > > graphs > > >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many > performance > > >> > > problems > > >> > > >>> raise from a subset of vertices that have high degrees. In > this > > >> > thesis, > > >> > > >> we > > >> > > >>> outline the main processing issues that arise in current graph > > >> > systems. > > >> > > >> We > > >> > > >>> then propose a novel node splitting technique meant to > > >> differentiate > > >> > > the > > >> > > >>> computation as well as partitioning strategies for low and > high > > >> > degree > > >> > > >>> nodes. > > >> > > >>> > > >> > > >>> The “Node Splitting” abstraction is implemented as a separate > > set > > >> of > > >> > > >>> operators that can be seamlessly combined with current > > >> > degree-oblivious > > >> > > >>> models such as Pregel and PowerGraph, but also with > degree-aware > > >> > > engines > > >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > > >> > overhead > > >> > > >> in > > >> > > >>> the absence of skew and improves the naive implementation of a > > >> subset > > >> > > of > > >> > > >>> computational intensive algorithms by a factor of two. > > >> > > >>> > > >> > > >>> These results have been proven on a theoretical and practical > > >> level > > >> > on > > >> > > >> both > > >> > > >>> real world and synthetic graphs." > > >> > > >>> > > >> > > >>> What I would add: > > >> > > >>> 1) the operators per se in a separate package > > >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity > and > > >> > apply > > >> > > >>> these operators on it) > > >> > > >>> 3). tests > > >> > > >>> 4). a separate section in the gelly docs that explains how to > > >> modify > > >> > > your > > >> > > >>> code to use this solution. > > >> > > >>> > > >> > > >>> Needless to say I will maintain the code and, if, as we get > more > > >> > users, > > >> > > >>> they deem this sub-API useless, we can always deprecate and > > >> delete it > > >> > > :). > > >> > > >>> > > >> > > >>> So, what do you think? > > >> > > >>> Andra > > >> > > >>> > > >> > > >> > > >> > > > > >> > > > > >> > > > >> > > > > > > > > > |
I think skewed graphs can be considered quite common, i.e., not a corner
case. So if there is code to significantly speed up computations on such graphs, this would definitely be interesting for Gelly, IMO. Would it be possible to integrate your approach with existing library algorithms and offer a switch (boolean parameter) to execute the algorithm either with the regular implementation or the skewed-optimized implementation? That requires of course that both methods are semantically equivalent, i.e., produce exactly the same result. If that is given, this would be similar to choosing between a hash- and a sort-based join algorithm. 2015-08-12 16:07 GMT+02:00 Stephan Ewen <[hidden email]>: > Same here as for Max, I am not familiar enough any more to make really good > comments. > > Some generic comments, though: > > - Check whether you really need a technique. Techniques that improve > corner cases, but make the code much more complex and make the behavior of > algorithms less robust are often better to leave out. > > - It is usually easier to something as an optional extension, rather then > adding it to a core operation. Easier to add, easier to remove if one finds > that it does not work out as anticipated in hind sight > > - If it works as a Gelly extension, you can always put it on your own > github repository. The Gelly docs could list a set of known > algorithms/operations on top of Gelly. If the interest in the extension is > high, you can still reconsider to move it into the core codebase. > > Stephan > > > > On Wed, Aug 12, 2015 at 2:35 PM, Maximilian Michels <[hidden email]> > wrote: > > > I think this is a decision to be made by the people involved in the Gelly > > library. I'm not very familiar with graph processing libraries. Thus, it > is > > hard for me to asses the value of this contribution. > > > > However, you outlined pretty well that for highly skewed graphs your > > technique results in a much better performance. So even if your addition > > might not be relevant to all users of the Gelly API, there are users that > > could demand such processing techniques. Since you want to develop, > > maintain, and document the code, I think there is absolutely nothing to > say > > against opening a pull request :) > > > > Cheers, > > Max > > > > > > > > On Wed, Aug 12, 2015 at 12:16 PM, Andra Lungu <[hidden email]> > > wrote: > > > > > I would love to get some feedback from the guys at data Artisans about > > this > > > one. > > > So far, the comments originated and spread in the Stockholm area :) > > > > > > On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <[hidden email]> > > > wrote: > > > > > > > Hi Samia, > > > > > > > > A good method to statistically determine skewed vertices was beyond > the > > > > purpose of my thesis. Unfortunately, the statistical methods that > fit a > > > > power law distribution don't do a good job. So what I do is that I > plot > > > the > > > > degree distribution and then visually determine the threshold. That > is > > > also > > > > how other state of the art systems, such as PowerLyra, fix the issue. > > > > > > > > A way to automate this would be a nice future addition :). > > > > > > > > > > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <[hidden email]> > > > wrote: > > > > > > > >> Dear Andra, > > > >> > > > >> The idea seems pretty nice. > > > >> > > > >> I wonder how you decide the threshold to separate the high degree > > > vertices > > > >> from the low degree vertices. > > > >> > > > >> Regards, > > > >> Samia > > > >> > > > >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[hidden email] > > > > > >> wrote: > > > >> > > > >> > Hi Paris, > > > >> > > > > >> > Nice to virtually meet you too :) > > > >> > > > > >> > Maybe it makes sense to share my freshest chart: > > > >> > > > > >> > > > > >> > > > > > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > > >> > > > > >> > This is for the Community Detection algorithm [1] in which you > > > basically > > > >> > find communities by continuously rescoring vertices. In a > > > vertex-centric > > > >> > scenario (for a skewed graph, in this case, the Twitter follower > > > >> network), > > > >> > since all vertices need to sync after an iteration superstep, they > > > will > > > >> > remain idle until a high degree vertex terminates (computing an > > > >> aggregate > > > >> > over my followers takes significantly less time than computing one > > > over > > > >> > Obama's followers). > > > >> > > > > >> > What you have on the right is the time elapsed in the regular > vertex > > > >> > centric Community Detection (more than two hours). > > > >> > And since we split the node in aggregation trees, in the left > there > > > is a > > > >> > chart with the various levels. Level 3 produces the best result. > > > >> > > > > >> > It's a bit cumbersome to put months of work in a mail, I am still > > > >> polishing > > > >> > the documentation, but feel free to ask questions if this is still > > > >> unclear. > > > >> > > > > >> > > > > >> > All in all, the technique is best for linear algorithms that work > > just > > > >> in > > > >> > Pregel, for a skewed graph. On a high level, it's a generalisation > > of > > > >> > PowerGraph's mirroring, that can work on any system, and that does > > not > > > >> > restrict itself to level = 1 (like PowerGraph does) which in the > > case > > > of > > > >> > out chart there did not bring us too far. > > > >> > > > > >> > Hope that clarified things a bit more! > > > >> > Andra > > > >> > > > > >> > [1] http://arxiv.org/pdf/0808.2633.pdf > > > >> > > > > >> > > > > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[hidden email]> > > wrote: > > > >> > > > > >> > > Hi Andra and nice to meet you btw :) > > > >> > > > > > >> > > It sounds like very fancy way to deal with skew, I like the idea > > > even > > > >> > > though I am not a graph analytics expert. > > > >> > > Have you ran any experiments or benchmarks to see when this > > > >> preferable ? > > > >> > > Users should be aware when they will get benefits by using it > > since > > > >> node > > > >> > > splitting doesn’t come with no cost I guess. > > > >> > > I am really eager to see how this will evolve, I think it’s good > > > >> effort. > > > >> > > > > > >> > > cheers > > > >> > > Paris > > > >> > > > > > >> > > > > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <[hidden email]> > > > >> wrote: > > > >> > > > > > > >> > > > Hi Vasia, > > > >> > > > > > > >> > > > I shall polish the functions a bit, but this is more or less > > what > > > I > > > >> had > > > >> > > in > > > >> > > > mind: > > > >> > > > GSA Jaccard [what we have in Gelly right now]: > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > > >> > > > The same version with node split: > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > >> > > > > > > >> > > > Yes, sure I can also share some charts with the results. I > could > > > say > > > >> > for > > > >> > > > which data sets (power law) and types of algorithms this can > be > > > >> used. > > > >> > If > > > >> > > > it's used appropriately, the overhead is 0. > > > >> > > > > > > >> > > > I'm open to suggestions! > > > >> > > > Thanks! > > > >> > > > Andra > > > >> > > > > > > >> > > > > > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > > >> > > [hidden email] > > > >> > > >> wrote: > > > >> > > > > > > >> > > >> Hi Andra, > > > >> > > >> > > > >> > > >> thanks for offering to add this work to Gelly and for > starting > > > the > > > >> > > >> discussion! > > > >> > > >> > > > >> > > >> How do you think this would look like from an API point of > > view? > > > >> Is it > > > >> > > easy > > > >> > > >> to make it transparent to the application? Could you give us > a > > > >> simple > > > >> > > >> example of what you have in mind? > > > >> > > >> > > > >> > > >> Apart from usability, we should also consider documenting > when > > to > > > >> use > > > >> > > this > > > >> > > >> feature, i.e. for which algorithms, what kind degree > > distribution > > > >> and > > > >> > > what > > > >> > > >> overhead to expect (if any). > > > >> > > >> > > > >> > > >> Cheers, > > > >> > > >> Vasia. > > > >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" < > [hidden email] > > > > > > >> > wrote: > > > >> > > >> > > > >> > > >>> Hey, > > > >> > > >>> > > > >> > > >>> Before actually opening a PR, I wanted to hear your opinion. > > So, > > > >> here > > > >> > > >> goes > > > >> > > >>> nothing :). > > > >> > > >>> > > > >> > > >>> I'd like to add the core of my master thesis to Gelly. That > > is, > > > a > > > >> > > series > > > >> > > >> of > > > >> > > >>> operators that take a skewed graph, split its high degree > > > vertices > > > >> > into > > > >> > > >>> subvertices and redistribute the edges accordingly (thus > > > producing > > > >> > the > > > >> > > >>> result faster due to the increased parallelism); then merge > > the > > > >> > > >> subvertices > > > >> > > >>> into the initial vertex. > > > >> > > >>> > > > >> > > >>> Here's part of my -unrevised- abstract: > > > >> > > >>> "The Twitter follower graph, the citation network and > general > > > >> purpose > > > >> > > >>> social networks follow a power law degree distribution. > These > > > >> highly > > > >> > > >> skewed > > > >> > > >>> graphs raise challenges to current computation models which > > > >> uniformly > > > >> > > >>> process vertices, leading to communication overhead or large > > > >> > execution > > > >> > > >> time > > > >> > > >>> stalls. > > > >> > > >>> > > > >> > > >>> In spite of efforts made to scale up computation on natural > > > graphs > > > >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many > > performance > > > >> > > problems > > > >> > > >>> raise from a subset of vertices that have high degrees. In > > this > > > >> > thesis, > > > >> > > >> we > > > >> > > >>> outline the main processing issues that arise in current > graph > > > >> > systems. > > > >> > > >> We > > > >> > > >>> then propose a novel node splitting technique meant to > > > >> differentiate > > > >> > > the > > > >> > > >>> computation as well as partitioning strategies for low and > > high > > > >> > degree > > > >> > > >>> nodes. > > > >> > > >>> > > > >> > > >>> The “Node Splitting” abstraction is implemented as a > separate > > > set > > > >> of > > > >> > > >>> operators that can be seamlessly combined with current > > > >> > degree-oblivious > > > >> > > >>> models such as Pregel and PowerGraph, but also with > > degree-aware > > > >> > > engines > > > >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces > minimal > > > >> > overhead > > > >> > > >> in > > > >> > > >>> the absence of skew and improves the naive implementation > of a > > > >> subset > > > >> > > of > > > >> > > >>> computational intensive algorithms by a factor of two. > > > >> > > >>> > > > >> > > >>> These results have been proven on a theoretical and > practical > > > >> level > > > >> > on > > > >> > > >> both > > > >> > > >>> real world and synthetic graphs." > > > >> > > >>> > > > >> > > >>> What I would add: > > > >> > > >>> 1) the operators per se in a separate package > > > >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity > > and > > > >> > apply > > > >> > > >>> these operators on it) > > > >> > > >>> 3). tests > > > >> > > >>> 4). a separate section in the gelly docs that explains how > to > > > >> modify > > > >> > > your > > > >> > > >>> code to use this solution. > > > >> > > >>> > > > >> > > >>> Needless to say I will maintain the code and, if, as we get > > more > > > >> > users, > > > >> > > >>> they deem this sub-API useless, we can always deprecate and > > > >> delete it > > > >> > > :). > > > >> > > >>> > > > >> > > >>> So, what do you think? > > > >> > > >>> Andra > > > >> > > >>> > > > >> > > >> > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |