I would like to propose that Flink include a selection of graph generators
in Gelly. Generated graphs will be useful for performing scalability, stress, and regression testing as well as benchmarking and comparing algorithms, both for Flink users and developers. Generated data is infinitely scalable yet described by a few simple parameters and can often substitute for user data or sharing large files when reporting issues. Spark's GraphX includes a modest GraphGenerators class [1]. The initial implementation would focus on Erdos-Renyi, R-Mat [2], and Kronecker [3] generators. A key consideration is that the graphs should be seedable and generate the same Graph regardless of parallelism. Generated data is a complement to my proposed "Checksum method for DataSet and Graph" [4]. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$ [2] R-MAT: A Recursive Model for Graph Mining; http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf [3] Kronecker graphs: An Approach to Modeling Networks; http://arxiv.org/pdf/0812.4905v2.pdf [4] https://issues.apache.org/jira/browse/FLINK-2716 Greg Hogan |
Hi Greg,
thank you for this proposal! I think graph generators will be a very useful addition to Gelly. I'm not quite familiar with the state-of-the-art algorithms for distributed graph generation. I suppose that we could easily provide an efficient random graph generator and I've also seen some work on parallel/distributed algorithms for R-MAT [1, 2]. Are you aware of similar work for Erdos-Reniy, Kronecker or other types of graphs? Another place we might want to look at is Giraph's Watts-Strogatz generator [3]. Cheers, Vasia. [1]: https://github.com/farkhor/PaRMAT/ [2]: http://arxiv.org/pdf/1210.0187.pdf [3]: https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/WattsStrogatzVertexInputFormat.html On 23 September 2015 at 19:49, Greg Hogan <[hidden email]> wrote: > I would like to propose that Flink include a selection of graph generators > in Gelly. Generated graphs will be useful for performing scalability, > stress, and regression testing as well as benchmarking and comparing > algorithms, both for Flink users and developers. Generated data is > infinitely scalable yet described by a few simple parameters and can often > substitute for user data or sharing large files when reporting issues. > > Spark's GraphX includes a modest GraphGenerators class [1]. > > The initial implementation would focus on Erdos-Renyi, R-Mat [2], and > Kronecker [3] generators. > > A key consideration is that the graphs should be seedable and generate the > same Graph regardless of parallelism. > > Generated data is a complement to my proposed "Checksum method for DataSet > and Graph" [4]. > > [1] > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$ > [2] R-MAT: A Recursive Model for Graph Mining; > http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf > [3] Kronecker graphs: An Approach to Modeling Networks; > http://arxiv.org/pdf/0812.4905v2.pdf > [4] https://issues.apache.org/jira/browse/FLINK-2716 > > Greg Hogan > |
I would be happy to see some generators in Gelly for exactly the reasons
you've mentioned. Its always difficult for me to get some testing data when running Flink on a new cluster ... so this would help me ;) On Thu, Sep 24, 2015 at 11:03 AM, Vasiliki Kalavri < [hidden email]> wrote: > Hi Greg, > > thank you for this proposal! > I think graph generators will be a very useful addition to Gelly. > > I'm not quite familiar with the state-of-the-art algorithms for distributed > graph generation. > I suppose that we could easily provide an efficient random graph generator > and I've also seen some work on parallel/distributed algorithms for R-MAT > [1, 2]. > Are you aware of similar work for Erdos-Reniy, Kronecker or other types of > graphs? > Another place we might want to look at is Giraph's Watts-Strogatz generator > [3]. > > Cheers, > Vasia. > > [1]: https://github.com/farkhor/PaRMAT/ > [2]: http://arxiv.org/pdf/1210.0187.pdf > [3]: > > https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/WattsStrogatzVertexInputFormat.html > > > On 23 September 2015 at 19:49, Greg Hogan <[hidden email]> wrote: > > > I would like to propose that Flink include a selection of graph > generators > > in Gelly. Generated graphs will be useful for performing scalability, > > stress, and regression testing as well as benchmarking and comparing > > algorithms, both for Flink users and developers. Generated data is > > infinitely scalable yet described by a few simple parameters and can > often > > substitute for user data or sharing large files when reporting issues. > > > > Spark's GraphX includes a modest GraphGenerators class [1]. > > > > The initial implementation would focus on Erdos-Renyi, R-Mat [2], and > > Kronecker [3] generators. > > > > A key consideration is that the graphs should be seedable and generate > the > > same Graph regardless of parallelism. > > > > Generated data is a complement to my proposed "Checksum method for > DataSet > > and Graph" [4]. > > > > [1] > > > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$ > > [2] R-MAT: A Recursive Model for Graph Mining; > > http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf > > [3] Kronecker graphs: An Approach to Modeling Networks; > > http://arxiv.org/pdf/0812.4905v2.pdf > > [4] https://issues.apache.org/jira/browse/FLINK-2716 > > > > Greg Hogan > > > |
Free forum by Nabble | Edit this page |