Hi everyone!
Happy new year, first of all and I hope you had a nice end-of-the-year season. I thought that it is a good time now to officially kick off the creation of a library of machine learning algorithms. There are a lot of individual artifacts and algorithms floating around which we should consolidate. The machine-learning library in Flink would stand on two legs: - A collection of efficient implementations for common problems and algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), Matrix Factorization (ALS), ... - An adapter to the linear algebra DSL in Apache Mahout. In the long run, it would be the goal to be able to mix and match code from both parts. The linear algebra DSL is very convenient when it comes to quickly composing an algorithm, or some custom pre- and post-processing steps. For some complex algorithms, however, a low level system specific implementation is necessary to make the algorithm efficient. Being able to call the tailored algorithms from the DSL would combine the benefits. As a concrete initial step, I suggest to do the following: 1) We create a dedicated maven sub-project for that ML library (flink-lib-ml). The project gets two sub-projects, one for the collection of specialized algorithms, one for the Mahout DSL 2) We add the code for the existing specialized algorithms. As followup work, we need to consolidate data types between those algorithms, to ensure that they can easily be combined/chained. 3) The code for the Flink bindings to the Mahout DSL will actually reside in the Mahout project, which we need to add as a dependency to flink-lib-ml. 4) We add some examples of Mahout DSL algorithms, and a template how to use them within Flink programs. 5) Create a good introductory readme.md, outlining this structure. The readme can also track the implemented algorithms and the ones we put on the roadmap. Comments welcome :-) Greetings, Stephan |
Hi,
happy new year everyone. I hope you all had some relaxing holidays. I really like the idea of having a machine learning library because this allows users to quickly solve problems without having to dive too deep into the system. Moreover, it is a good way to show what the system is capable of in terms of expressibility and programming paradigms. Currently, we already have more or less optimised versions of several ML algorithms implemented with Flink. I'm aware of the following implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that these algorithms constitute a good foundation for the ML library. I like the idea to have optimised algorithms which can be mixed with Mahout DSL code. As far as I can tell, the interoperation should not be too difficult if the "future" Flink backend is used to execute the Mahout DSL program. Internally, the Mahout DSL performs its operations on a row-wise partitioned matrix which is represented as a DataSet[(Key, Vector)]. Providing some wrapper functions to transform different matrix representations into the row-wise representation should be the first step. Another idea could be to investigate to what extent Flink can interact with the Parameter Server and which algorithms could be adapted to benefit from these systems. Greetings, Till On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <[hidden email]> wrote: > Hi everyone! > > Happy new year, first of all and I hope you had a nice end-of-the-year > season. > > I thought that it is a good time now to officially kick off the creation of > a library of machine learning algorithms. There are a lot of individual > artifacts and algorithms floating around which we should consolidate. > > The machine-learning library in Flink would stand on two legs: > > - A collection of efficient implementations for common problems and > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > Matrix Factorization (ALS), ... > > - An adapter to the linear algebra DSL in Apache Mahout. > > In the long run, it would be the goal to be able to mix and match code from > both parts. > The linear algebra DSL is very convenient when it comes to quickly > composing an algorithm, or some custom pre- and post-processing steps. > For some complex algorithms, however, a low level system specific > implementation is necessary to make the algorithm efficient. > Being able to call the tailored algorithms from the DSL would combine the > benefits. > > > As a concrete initial step, I suggest to do the following: > > 1) We create a dedicated maven sub-project for that ML library > (flink-lib-ml). The project gets two sub-projects, one for the collection > of specialized algorithms, one for the Mahout DSL > > 2) We add the code for the existing specialized algorithms. As followup > work, we need to consolidate data types between those algorithms, to ensure > that they can easily be combined/chained. > > 3) The code for the Flink bindings to the Mahout DSL will actually reside > in the Mahout project, which we need to add as a dependency to > flink-lib-ml. > > 4) We add some examples of Mahout DSL algorithms, and a template how to use > them within Flink programs. > > 5) Create a good introductory readme.md, outlining this structure. The > readme can also track the implemented algorithms and the ones we put on the > roadmap. > > > Comments welcome :-) > > > Greetings, > Stephan > |
+1 for the initial steps, which I can implement.
On Sat, Jan 3, 2015 at 8:15 PM, Till Rohrmann <[hidden email]> wrote: > Hi, > > happy new year everyone. I hope you all had some relaxing holidays. > > I really like the idea of having a machine learning library because this > allows users to quickly solve problems without having to dive too deep into > the system. Moreover, it is a good way to show what the system is capable > of in terms of expressibility and programming paradigms. > > Currently, we already have more or less optimised versions of several ML > algorithms implemented with Flink. I'm aware of the following > implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that > these algorithms constitute a good foundation for the ML library. > > I like the idea to have optimised algorithms which can be mixed with > Mahout DSL code. As far as I can tell, the interoperation should not be too > difficult if the "future" Flink backend is used to execute the Mahout DSL > program. Internally, the Mahout DSL performs its operations on a row-wise > partitioned matrix which is represented as a DataSet[(Key, Vector)]. > Providing some wrapper functions to transform different matrix > representations into the row-wise representation should be the first step. > > Another idea could be to investigate to what extent Flink can interact > with the Parameter Server and which algorithms could be adapted to benefit > from these systems. > > Greetings, > > Till > > On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <[hidden email]> wrote: > >> Hi everyone! >> >> Happy new year, first of all and I hope you had a nice end-of-the-year >> season. >> >> I thought that it is a good time now to officially kick off the creation >> of >> a library of machine learning algorithms. There are a lot of individual >> artifacts and algorithms floating around which we should consolidate. >> >> The machine-learning library in Flink would stand on two legs: >> >> - A collection of efficient implementations for common problems and >> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), >> Matrix Factorization (ALS), ... >> >> - An adapter to the linear algebra DSL in Apache Mahout. >> >> In the long run, it would be the goal to be able to mix and match code >> from >> both parts. >> The linear algebra DSL is very convenient when it comes to quickly >> composing an algorithm, or some custom pre- and post-processing steps. >> For some complex algorithms, however, a low level system specific >> implementation is necessary to make the algorithm efficient. >> Being able to call the tailored algorithms from the DSL would combine the >> benefits. >> >> >> As a concrete initial step, I suggest to do the following: >> >> 1) We create a dedicated maven sub-project for that ML library >> (flink-lib-ml). The project gets two sub-projects, one for the collection >> of specialized algorithms, one for the Mahout DSL >> >> 2) We add the code for the existing specialized algorithms. As followup >> work, we need to consolidate data types between those algorithms, to >> ensure >> that they can easily be combined/chained. >> >> 3) The code for the Flink bindings to the Mahout DSL will actually reside >> in the Mahout project, which we need to add as a dependency to >> flink-lib-ml. >> >> 4) We add some examples of Mahout DSL algorithms, and a template how to >> use >> them within Flink programs. >> >> 5) Create a good introductory readme.md, outlining this structure. The >> readme can also track the implemented algorithms and the ones we put on >> the >> roadmap. >> >> >> Comments welcome :-) >> >> >> Greetings, >> Stephan >> > > |
In reply to this post by Stephan Ewen
Happy new year all!
Like the idea to add ML module with Flink. As I have mentioned to Kostas, Stephan, and Robert before, I would love to see if we could work with H20 project [1], and it seemed like the community has added support for it for Apache Mahout backend binding [2]. So we might get some additional scale ML algos like deep learning. Definitely would love to help with this initiative =) - Henry [1] https://github.com/h2oai/h2o-dev [2] https://issues.apache.org/jira/browse/MAHOUT-1500 On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: > Hi everyone! > > Happy new year, first of all and I hope you had a nice end-of-the-year > season. > > I thought that it is a good time now to officially kick off the creation of > a library of machine learning algorithms. There are a lot of individual > artifacts and algorithms floating around which we should consolidate. > > The machine-learning library in Flink would stand on two legs: > > - A collection of efficient implementations for common problems and > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > Matrix Factorization (ALS), ... > > - An adapter to the linear algebra DSL in Apache Mahout. > > In the long run, it would be the goal to be able to mix and match code from > both parts. > The linear algebra DSL is very convenient when it comes to quickly > composing an algorithm, or some custom pre- and post-processing steps. > For some complex algorithms, however, a low level system specific > implementation is necessary to make the algorithm efficient. > Being able to call the tailored algorithms from the DSL would combine the > benefits. > > > As a concrete initial step, I suggest to do the following: > > 1) We create a dedicated maven sub-project for that ML library > (flink-lib-ml). The project gets two sub-projects, one for the collection > of specialized algorithms, one for the Mahout DSL > > 2) We add the code for the existing specialized algorithms. As followup > work, we need to consolidate data types between those algorithms, to ensure > that they can easily be combined/chained. > > 3) The code for the Flink bindings to the Mahout DSL will actually reside > in the Mahout project, which we need to add as a dependency to flink-lib-ml. > > 4) We add some examples of Mahout DSL algorithms, and a template how to use > them within Flink programs. > > 5) Create a good introductory readme.md, outlining this structure. The > readme can also track the implemented algorithms and the ones we put on the > roadmap. > > > Comments welcome :-) > > > Greetings, > Stephan |
The idea to work with H2O sounds really interesting.
In terms of the Mahout DSL this would mean that we have to translate a Flink dataset into H2O's basic abstraction of distributed data and vice versa. Everything other than writing to disk with one system and reading from there with the other is probably non-trivial and hard to realize. On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote: > Happy new year all! > > Like the idea to add ML module with Flink. > > As I have mentioned to Kostas, Stephan, and Robert before, I would > love to see if we could work with H20 project [1], and it seemed like > the community has added support for it for Apache Mahout backend > binding [2]. > > So we might get some additional scale ML algos like deep learning. > > Definitely would love to help with this initiative =) > > - Henry > > [1] https://github.com/h2oai/h2o-dev > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: > > Hi everyone! > > > > Happy new year, first of all and I hope you had a nice end-of-the-year > > season. > > > > I thought that it is a good time now to officially kick off the creation > of > > a library of machine learning algorithms. There are a lot of individual > > artifacts and algorithms floating around which we should consolidate. > > > > The machine-learning library in Flink would stand on two legs: > > > > - A collection of efficient implementations for common problems and > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > > Matrix Factorization (ALS), ... > > > > - An adapter to the linear algebra DSL in Apache Mahout. > > > > In the long run, it would be the goal to be able to mix and match code > from > > both parts. > > The linear algebra DSL is very convenient when it comes to quickly > > composing an algorithm, or some custom pre- and post-processing steps. > > For some complex algorithms, however, a low level system specific > > implementation is necessary to make the algorithm efficient. > > Being able to call the tailored algorithms from the DSL would combine the > > benefits. > > > > > > As a concrete initial step, I suggest to do the following: > > > > 1) We create a dedicated maven sub-project for that ML library > > (flink-lib-ml). The project gets two sub-projects, one for the collection > > of specialized algorithms, one for the Mahout DSL > > > > 2) We add the code for the existing specialized algorithms. As followup > > work, we need to consolidate data types between those algorithms, to > ensure > > that they can easily be combined/chained. > > > > 3) The code for the Flink bindings to the Mahout DSL will actually reside > > in the Mahout project, which we need to add as a dependency to > flink-lib-ml. > > > > 4) We add some examples of Mahout DSL algorithms, and a template how to > use > > them within Flink programs. > > > > 5) Create a good introductory readme.md, outlining this structure. The > > readme can also track the implemented algorithms and the ones we put on > the > > roadmap. > > > > > > Comments welcome :-) > > > > > > Greetings, > > Stephan > |
Thanks Henry!
Do you know of a good source that gives pointers or examples how to interact with H2O ? Stephan On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote: > The idea to work with H2O sounds really interesting. > > In terms of the Mahout DSL this would mean that we have to translate a > Flink dataset into H2O's basic abstraction of distributed data and vice > versa. Everything other than writing to disk with one system and reading > from there with the other is probably non-trivial and hard to realize. > On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote: > > > Happy new year all! > > > > Like the idea to add ML module with Flink. > > > > As I have mentioned to Kostas, Stephan, and Robert before, I would > > love to see if we could work with H20 project [1], and it seemed like > > the community has added support for it for Apache Mahout backend > > binding [2]. > > > > So we might get some additional scale ML algos like deep learning. > > > > Definitely would love to help with this initiative =) > > > > - Henry > > > > [1] https://github.com/h2oai/h2o-dev > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > > > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: > > > Hi everyone! > > > > > > Happy new year, first of all and I hope you had a nice end-of-the-year > > > season. > > > > > > I thought that it is a good time now to officially kick off the > creation > > of > > > a library of machine learning algorithms. There are a lot of individual > > > artifacts and algorithms floating around which we should consolidate. > > > > > > The machine-learning library in Flink would stand on two legs: > > > > > > - A collection of efficient implementations for common problems and > > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > > > Matrix Factorization (ALS), ... > > > > > > - An adapter to the linear algebra DSL in Apache Mahout. > > > > > > In the long run, it would be the goal to be able to mix and match code > > from > > > both parts. > > > The linear algebra DSL is very convenient when it comes to quickly > > > composing an algorithm, or some custom pre- and post-processing steps. > > > For some complex algorithms, however, a low level system specific > > > implementation is necessary to make the algorithm efficient. > > > Being able to call the tailored algorithms from the DSL would combine > the > > > benefits. > > > > > > > > > As a concrete initial step, I suggest to do the following: > > > > > > 1) We create a dedicated maven sub-project for that ML library > > > (flink-lib-ml). The project gets two sub-projects, one for the > collection > > > of specialized algorithms, one for the Mahout DSL > > > > > > 2) We add the code for the existing specialized algorithms. As followup > > > work, we need to consolidate data types between those algorithms, to > > ensure > > > that they can easily be combined/chained. > > > > > > 3) The code for the Flink bindings to the Mahout DSL will actually > reside > > > in the Mahout project, which we need to add as a dependency to > > flink-lib-ml. > > > > > > 4) We add some examples of Mahout DSL algorithms, and a template how to > > use > > > them within Flink programs. > > > > > > 5) Create a good introductory readme.md, outlining this structure. The > > > readme can also track the implemented algorithms and the ones we put on > > the > > > roadmap. > > > > > > > > > Comments welcome :-) > > > > > > > > > Greetings, > > > Stephan > > > |
0xdata (now is called H2O) is developing integration with Spark with
the project called Sparkling Water [1]. It creates new RDD that could connect to H2O cluster to pass the higher order function to execute in the ML flow. The easiest way to use H2O is with R binding [2][3] but I think we would want to interact with H2O as via the REST APIs [4]. - Henry [1] https://github.com/h2oai/sparkling-water [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r [3] http://docs.h2o.ai/Ruser/rtutorial.html [4] http://docs.h2o.ai/developuser/rest.html On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote: > Thanks Henry! > > Do you know of a good source that gives pointers or examples how to > interact with H2O ? > > Stephan > > > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote: > >> The idea to work with H2O sounds really interesting. >> >> In terms of the Mahout DSL this would mean that we have to translate a >> Flink dataset into H2O's basic abstraction of distributed data and vice >> versa. Everything other than writing to disk with one system and reading >> from there with the other is probably non-trivial and hard to realize. >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote: >> >> > Happy new year all! >> > >> > Like the idea to add ML module with Flink. >> > >> > As I have mentioned to Kostas, Stephan, and Robert before, I would >> > love to see if we could work with H20 project [1], and it seemed like >> > the community has added support for it for Apache Mahout backend >> > binding [2]. >> > >> > So we might get some additional scale ML algos like deep learning. >> > >> > Definitely would love to help with this initiative =) >> > >> > - Henry >> > >> > [1] https://github.com/h2oai/h2o-dev >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 >> > >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: >> > > Hi everyone! >> > > >> > > Happy new year, first of all and I hope you had a nice end-of-the-year >> > > season. >> > > >> > > I thought that it is a good time now to officially kick off the >> creation >> > of >> > > a library of machine learning algorithms. There are a lot of individual >> > > artifacts and algorithms floating around which we should consolidate. >> > > >> > > The machine-learning library in Flink would stand on two legs: >> > > >> > > - A collection of efficient implementations for common problems and >> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), >> > > Matrix Factorization (ALS), ... >> > > >> > > - An adapter to the linear algebra DSL in Apache Mahout. >> > > >> > > In the long run, it would be the goal to be able to mix and match code >> > from >> > > both parts. >> > > The linear algebra DSL is very convenient when it comes to quickly >> > > composing an algorithm, or some custom pre- and post-processing steps. >> > > For some complex algorithms, however, a low level system specific >> > > implementation is necessary to make the algorithm efficient. >> > > Being able to call the tailored algorithms from the DSL would combine >> the >> > > benefits. >> > > >> > > >> > > As a concrete initial step, I suggest to do the following: >> > > >> > > 1) We create a dedicated maven sub-project for that ML library >> > > (flink-lib-ml). The project gets two sub-projects, one for the >> collection >> > > of specialized algorithms, one for the Mahout DSL >> > > >> > > 2) We add the code for the existing specialized algorithms. As followup >> > > work, we need to consolidate data types between those algorithms, to >> > ensure >> > > that they can easily be combined/chained. >> > > >> > > 3) The code for the Flink bindings to the Mahout DSL will actually >> reside >> > > in the Mahout project, which we need to add as a dependency to >> > flink-lib-ml. >> > > >> > > 4) We add some examples of Mahout DSL algorithms, and a template how to >> > use >> > > them within Flink programs. >> > > >> > > 5) Create a good introductory readme.md, outlining this structure. The >> > > readme can also track the implemented algorithms and the ones we put on >> > the >> > > roadmap. >> > > >> > > >> > > Comments welcome :-) >> > > >> > > >> > > Greetings, >> > > Stephan >> > >> |
Great Henry,
Sparkling Water looks really interesting. We would probably have to take a similar approach to make Flink interact with H2O. I briefly looked into the code and that's how they do it (or at least how I understood it ;-). Correct me if I got it wrong: They first start up a Spark cluster and start from within a RDD an H2O worker. Afterwards, they have on each node where a Spark executor runs also an H2O worker running. Consequently, they have on each node access to the distributed key value storage of H2O. I really like how elegantly they set up H2O from within Spark :-) Flink should be capable of doing the same. We only have to leave some memory for H2O. Once this is done, they only have to translate a RDD into a DataFrame and vice versa: RDD => DataFrame: This operation is realized as a mapPartition operation of Spark which is executed as an action. Currently, tuple elements and a SchemaRDD are supported. The mapPartition operation takes the tuples and groups fields with the same positional index together to build H2O's vector chunks which are then stored in the distributed key value store (DKV). We can do the same. DataFrame => RDD: Here they implemented a H2ORDD which reads from the DKV the corresponding vector chunks of the partition and constructs for each row an instance. We should be able to convert a DataFrame to a DataSet by implementing a H2ODKVInputFormat which basically does the same. I think that the implementation details might not be trivial but judging from the size of sparkling-water (1300 lines of Scala code) it should definitely be feasible. And as Henry already mentioned by having the H2O integration, we get a lot of ML algorithms for free. We could also talk to 0xdata to see if they are interested to help us with the effort. Greets, Till On Wed, Jan 7, 2015 at 7:08 PM, Henry Saputra <[hidden email]> wrote: > 0xdata (now is called H2O) is developing integration with Spark with > the project called Sparkling Water [1]. > It creates new RDD that could connect to H2O cluster to pass the > higher order function to execute in the ML flow. > > The easiest way to use H2O is with R binding [2][3] but I think we > would want to interact with H2O as via the REST APIs [4]. > > - Henry > > [1] https://github.com/h2oai/sparkling-water > [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r > [3] http://docs.h2o.ai/Ruser/rtutorial.html > [4] http://docs.h2o.ai/developuser/rest.html > > On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote: > > Thanks Henry! > > > > Do you know of a good source that gives pointers or examples how to > > interact with H2O ? > > > > Stephan > > > > > > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> > wrote: > > > >> The idea to work with H2O sounds really interesting. > >> > >> In terms of the Mahout DSL this would mean that we have to translate a > >> Flink dataset into H2O's basic abstraction of distributed data and vice > >> versa. Everything other than writing to disk with one system and reading > >> from there with the other is probably non-trivial and hard to realize. > >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> > wrote: > >> > >> > Happy new year all! > >> > > >> > Like the idea to add ML module with Flink. > >> > > >> > As I have mentioned to Kostas, Stephan, and Robert before, I would > >> > love to see if we could work with H20 project [1], and it seemed like > >> > the community has added support for it for Apache Mahout backend > >> > binding [2]. > >> > > >> > So we might get some additional scale ML algos like deep learning. > >> > > >> > Definitely would love to help with this initiative =) > >> > > >> > - Henry > >> > > >> > [1] https://github.com/h2oai/h2o-dev > >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > >> > > >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> > wrote: > >> > > Hi everyone! > >> > > > >> > > Happy new year, first of all and I hope you had a nice > end-of-the-year > >> > > season. > >> > > > >> > > I thought that it is a good time now to officially kick off the > >> creation > >> > of > >> > > a library of machine learning algorithms. There are a lot of > individual > >> > > artifacts and algorithms floating around which we should > consolidate. > >> > > > >> > > The machine-learning library in Flink would stand on two legs: > >> > > > >> > > - A collection of efficient implementations for common problems and > >> > > algorithms, e.g., Regression (logistic), clustering (k-Means, > Canopy), > >> > > Matrix Factorization (ALS), ... > >> > > > >> > > - An adapter to the linear algebra DSL in Apache Mahout. > >> > > > >> > > In the long run, it would be the goal to be able to mix and match > code > >> > from > >> > > both parts. > >> > > The linear algebra DSL is very convenient when it comes to quickly > >> > > composing an algorithm, or some custom pre- and post-processing > steps. > >> > > For some complex algorithms, however, a low level system specific > >> > > implementation is necessary to make the algorithm efficient. > >> > > Being able to call the tailored algorithms from the DSL would > combine > >> the > >> > > benefits. > >> > > > >> > > > >> > > As a concrete initial step, I suggest to do the following: > >> > > > >> > > 1) We create a dedicated maven sub-project for that ML library > >> > > (flink-lib-ml). The project gets two sub-projects, one for the > >> collection > >> > > of specialized algorithms, one for the Mahout DSL > >> > > > >> > > 2) We add the code for the existing specialized algorithms. As > followup > >> > > work, we need to consolidate data types between those algorithms, to > >> > ensure > >> > > that they can easily be combined/chained. > >> > > > >> > > 3) The code for the Flink bindings to the Mahout DSL will actually > >> reside > >> > > in the Mahout project, which we need to add as a dependency to > >> > flink-lib-ml. > >> > > > >> > > 4) We add some examples of Mahout DSL algorithms, and a template > how to > >> > use > >> > > them within Flink programs. > >> > > > >> > > 5) Create a good introductory readme.md, outlining this structure. > The > >> > > readme can also track the implemented algorithms and the ones we > put on > >> > the > >> > > roadmap. > >> > > > >> > > > >> > > Comments welcome :-) > >> > > > >> > > > >> > > Greetings, > >> > > Stephan > >> > > >> > |
Hi Till,
Yes, that is pretty much how they do it. The trick is to get access to shared data between Spark realm and H2O realm. Prev gen they use Tachyon but in the latest stint they use run H2O in the Spark executor to use the heap shared memory to do the trick. I am trying to hook us up with H2O guys, lets hope it pays off =) - Henry On Thu, Jan 8, 2015 at 3:48 AM, Till Rohrmann <[hidden email]> wrote: > Great Henry, > > Sparkling Water looks really interesting. We would probably have to take a > similar approach to make Flink interact with H2O. > > I briefly looked into the code and that's how they do it (or at least how I > understood it ;-). Correct me if I got it wrong: > > They first start up a Spark cluster and start from within a RDD an H2O > worker. Afterwards, they have on each node where a Spark executor runs also > an H2O worker running. Consequently, they have on each node access to the > distributed key value storage of H2O. I really like how elegantly they set > up H2O from within Spark :-) Flink should be capable of doing the same. We > only have to leave some memory for H2O. > > Once this is done, they only have to translate a RDD into a DataFrame and > vice versa: > > RDD => DataFrame: > This operation is realized as a mapPartition operation of Spark which is > executed as an action. Currently, tuple elements and a SchemaRDD are > supported. The mapPartition operation takes the tuples and groups fields > with the same positional index together to build H2O's vector chunks which > are then stored in the distributed key value store (DKV). We can do the > same. > > DataFrame => RDD: > Here they implemented a H2ORDD which reads from the DKV the corresponding > vector chunks of the partition and constructs for each row an instance. We > should be able to convert a DataFrame to a DataSet by implementing a > H2ODKVInputFormat which basically does the same. > > I think that the implementation details might not be trivial but judging > from the size of sparkling-water (1300 lines of Scala code) it should > definitely be feasible. And as Henry already mentioned by having the H2O > integration, we get a lot of ML algorithms for free. We could also talk to > 0xdata to see if they are interested to help us with the effort. > > Greets, > > Till > > > On Wed, Jan 7, 2015 at 7:08 PM, Henry Saputra <[hidden email]> > wrote: > >> 0xdata (now is called H2O) is developing integration with Spark with >> the project called Sparkling Water [1]. >> It creates new RDD that could connect to H2O cluster to pass the >> higher order function to execute in the ML flow. >> >> The easiest way to use H2O is with R binding [2][3] but I think we >> would want to interact with H2O as via the REST APIs [4]. >> >> - Henry >> >> [1] https://github.com/h2oai/sparkling-water >> [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r >> [3] http://docs.h2o.ai/Ruser/rtutorial.html >> [4] http://docs.h2o.ai/developuser/rest.html >> >> On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote: >> > Thanks Henry! >> > >> > Do you know of a good source that gives pointers or examples how to >> > interact with H2O ? >> > >> > Stephan >> > >> > >> > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> >> wrote: >> > >> >> The idea to work with H2O sounds really interesting. >> >> >> >> In terms of the Mahout DSL this would mean that we have to translate a >> >> Flink dataset into H2O's basic abstraction of distributed data and vice >> >> versa. Everything other than writing to disk with one system and reading >> >> from there with the other is probably non-trivial and hard to realize. >> >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> >> wrote: >> >> >> >> > Happy new year all! >> >> > >> >> > Like the idea to add ML module with Flink. >> >> > >> >> > As I have mentioned to Kostas, Stephan, and Robert before, I would >> >> > love to see if we could work with H20 project [1], and it seemed like >> >> > the community has added support for it for Apache Mahout backend >> >> > binding [2]. >> >> > >> >> > So we might get some additional scale ML algos like deep learning. >> >> > >> >> > Definitely would love to help with this initiative =) >> >> > >> >> > - Henry >> >> > >> >> > [1] https://github.com/h2oai/h2o-dev >> >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 >> >> > >> >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> >> wrote: >> >> > > Hi everyone! >> >> > > >> >> > > Happy new year, first of all and I hope you had a nice >> end-of-the-year >> >> > > season. >> >> > > >> >> > > I thought that it is a good time now to officially kick off the >> >> creation >> >> > of >> >> > > a library of machine learning algorithms. There are a lot of >> individual >> >> > > artifacts and algorithms floating around which we should >> consolidate. >> >> > > >> >> > > The machine-learning library in Flink would stand on two legs: >> >> > > >> >> > > - A collection of efficient implementations for common problems and >> >> > > algorithms, e.g., Regression (logistic), clustering (k-Means, >> Canopy), >> >> > > Matrix Factorization (ALS), ... >> >> > > >> >> > > - An adapter to the linear algebra DSL in Apache Mahout. >> >> > > >> >> > > In the long run, it would be the goal to be able to mix and match >> code >> >> > from >> >> > > both parts. >> >> > > The linear algebra DSL is very convenient when it comes to quickly >> >> > > composing an algorithm, or some custom pre- and post-processing >> steps. >> >> > > For some complex algorithms, however, a low level system specific >> >> > > implementation is necessary to make the algorithm efficient. >> >> > > Being able to call the tailored algorithms from the DSL would >> combine >> >> the >> >> > > benefits. >> >> > > >> >> > > >> >> > > As a concrete initial step, I suggest to do the following: >> >> > > >> >> > > 1) We create a dedicated maven sub-project for that ML library >> >> > > (flink-lib-ml). The project gets two sub-projects, one for the >> >> collection >> >> > > of specialized algorithms, one for the Mahout DSL >> >> > > >> >> > > 2) We add the code for the existing specialized algorithms. As >> followup >> >> > > work, we need to consolidate data types between those algorithms, to >> >> > ensure >> >> > > that they can easily be combined/chained. >> >> > > >> >> > > 3) The code for the Flink bindings to the Mahout DSL will actually >> >> reside >> >> > > in the Mahout project, which we need to add as a dependency to >> >> > flink-lib-ml. >> >> > > >> >> > > 4) We add some examples of Mahout DSL algorithms, and a template >> how to >> >> > use >> >> > > them within Flink programs. >> >> > > >> >> > > 5) Create a good introductory readme.md, outlining this structure. >> The >> >> > > readme can also track the implemented algorithms and the ones we >> put on >> >> > the >> >> > > roadmap. >> >> > > >> >> > > >> >> > > Comments welcome :-) >> >> > > >> >> > > >> >> > > Greetings, >> >> > > Stephan >> >> > >> >> >> |
In reply to this post by Till Rohrmann
In terms of Mahout DSL it means implementing a bunch of physical operators
such as transpose, A'B or B'A on large row or column partitioned matrices. Mahout optimizer takes care of simplifying algebraic expressions such as 1+ exp(drm) => drm.apply-unary(1+exp(x)) and tracking things like identical partitioning of datasets where applicable. Adding a back turned out to be pretty trivial for h20 back. I don't think there's a need to bridge flink operations to any of existing backs of DSL for support of Flink. instead i was hoping flink could have its own native back. Also keep in mind that I am not aware of any actual applications with Mahout + h2o back whereas i have built dozens on Spark in past 2 years (albeit on a quite intensely hacked version of the bindings). With Flink it is a bit less trivial i guess because mahout optimizer sometimes tries to run quick summary things such as matrix geometry detection as a part of optimizer action itself. And Flink, last time i checked, did not support running multiple computational actions on a cache-pinned dataset. So that was the main difficulty last time. Perhaps no more. Or perhaps there's a clever way to work around this. not sure. At the same time Flink back for algebraic optimizer also may be more trivial than e.g H20 back was since H20 insisted on marrying their own in-core Matrix api so they created a bridge between Mahout (former Colt) Matrix api and one of their own. Whereas if distributed system just takes care of serializing Writables of Matrices and Vectors (such as it was done for Spark backend) then there's practically nothing to do. Which, my understanding is, Flink is doing good. Now, on the part of current shortcomings, currently there is a fairly long list of performance problems, many of which i fixed internally but not publicly. But hopefully some or all of that will eventually be pushed sooner rather than later. The biggest sticking issue is in-core performance of the Colt apis. Even jvm-based matrices could do much better if they took a bit of cost-based approach doing so, and perhaps integrating a bit more sophisticated Winograd type of gemm approaches. Not to mention integration of things like Magma, jcuda or even simply netlib.org native dense blas stuff (as in Breeze). Good thing though it does not affect backend work since once it is fixed for in-core algebra, it is fixed elsewhere and this is hidden behind scalabindings (in-core algebra DSL). These in-core and out-of-core performance optimization issues are probably the only thing between code base and a good well rounded release. Out of core bad stuff is mostly fixed internally though. Anyone interested in working on any of the issues I mentioned please throw a note to Mahout list to me or Suneel Marthi. On the issue of optimized ML stuff vs. general algebra, here's food for thought. I found that 70% of algorithms are not purely algebraic in R-like operation set sense. But close to 95% could contain significant elements of simplification thru algebra. Even probabilistic things that just use stuff like MVN or Wishart sampling, or Gaussian processes. A lot easier to read and maintain. Even LLoyd iteration has a simple algebraic expression (turns out). Close to 80% could not avoid using some algebra. Only 5% could get away with not doing algebra at all. Thus, 95% of things i ever worked on are either purely or quasi algebraic. Even when they are probabilisitic at core. What it means is that any ML library would benefit enormously if it acts in terms of common algebraic data structures. It will create opportunity for pipelines to seamlessly connect elements of learning such as standardization, dimensionality reduction, MDS/visualization methods, recommender methods as well as clustering methods. My general criticism for MLLib has been that until recently, they did not work on creating such common math structure standards, algebraic data in particular, and every new method's input/output came in its own form an shape. Still does. So dilemma of separating efforts on ones using and not using algebra is a bit false. Most methods are quasi-algebraic (meaning they have at least some need for R-like matrix manipulations). Of course there's need for specific distributed primitives from time to time, there's no arguing about it (like i said, 70% of all methods cannot be purely algebraic, only about 25% are). There's an issue of algebra physical ops performance on spark and in-core, but there's no reason (for me) to believe that effort to fix it in POJO based implementations would require any less effort. There were also some discussion whether and how it is possible to create quasi algebraic methods such that their non-algebraic part is easy to port to another platform (provided algebraic part is already compatible), but that's completely another topic. But what i am saying additional benefit might be that moving ML methodology to platform-agnostic package in Mahout (if such interest ever appears) would also be much easier even for quasi-algebraic approaches if the solutions gravitated to using algebra as opposed to not using it . just some food for thought. thanks. -D On Sun, Jan 4, 2015 at 10:14 AM, Till Rohrmann <[hidden email]> wrote: > The idea to work with H2O sounds really interesting. > > In terms of the Mahout DSL this would mean that we have to translate a > Flink dataset into H2O's basic abstraction of distributed data and vice > versa. Everything other than writing to disk with one system and reading > from there with the other is probably non-trivial and hard to realize. > On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote: > > > Happy new year all! > > > > Like the idea to add ML module with Flink. > > > > As I have mentioned to Kostas, Stephan, and Robert before, I would > > love to see if we could work with H20 project [1], and it seemed like > > the community has added support for it for Apache Mahout backend > > binding [2]. > > > > So we might get some additional scale ML algos like deep learning. > > > > Definitely would love to help with this initiative =) > > > > - Henry > > > > [1] https://github.com/h2oai/h2o-dev > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > > > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: > > > Hi everyone! > > > > > > Happy new year, first of all and I hope you had a nice end-of-the-year > > > season. > > > > > > I thought that it is a good time now to officially kick off the > creation > > of > > > a library of machine learning algorithms. There are a lot of individual > > > artifacts and algorithms floating around which we should consolidate. > > > > > > The machine-learning library in Flink would stand on two legs: > > > > > > - A collection of efficient implementations for common problems and > > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > > > Matrix Factorization (ALS), ... > > > > > > - An adapter to the linear algebra DSL in Apache Mahout. > > > > > > In the long run, it would be the goal to be able to mix and match code > > from > > > both parts. > > > The linear algebra DSL is very convenient when it comes to quickly > > > composing an algorithm, or some custom pre- and post-processing steps. > > > For some complex algorithms, however, a low level system specific > > > implementation is necessary to make the algorithm efficient. > > > Being able to call the tailored algorithms from the DSL would combine > the > > > benefits. > > > > > > > > > As a concrete initial step, I suggest to do the following: > > > > > > 1) We create a dedicated maven sub-project for that ML library > > > (flink-lib-ml). The project gets two sub-projects, one for the > collection > > > of specialized algorithms, one for the Mahout DSL > > > > > > 2) We add the code for the existing specialized algorithms. As followup > > > work, we need to consolidate data types between those algorithms, to > > ensure > > > that they can easily be combined/chained. > > > > > > 3) The code for the Flink bindings to the Mahout DSL will actually > reside > > > in the Mahout project, which we need to add as a dependency to > > flink-lib-ml. > > > > > > 4) We add some examples of Mahout DSL algorithms, and a template how to > > use > > > them within Flink programs. > > > > > > 5) Create a good introductory readme.md, outlining this structure. The > > > readme can also track the implemented algorithms and the ones we put on > > the > > > roadmap. > > > > > > > > > Comments welcome :-) > > > > > > > > > Greetings, > > > Stephan > > > |
Thank you Dmitriy!
That was quite a bit food for thought - I think I will need a bit more time to digest that ;-) Especially the part about the algebraic data structures and how they allow you to pipeline and combine algorithms is exactly along the lines of what I was thinking - thank you for sharing your observations there. Concerning features of Flink: We recently had a large piece of code going in that is exactly the underpinning for running multiple programs on cached data sets. Some of the primitives to bring back counters or data sets to the driver exist already in pull requests, others will be added soon. So I think we will be good there. I would go for an integration between Flink and Mahout along the same lines as the integration with Spark, rather than in the way it is done with H2O. Greetings, Stephan On Tue, Jan 13, 2015 at 11:53 PM, Dmitriy Lyubimov <[hidden email]> wrote: > In terms of Mahout DSL it means implementing a bunch of physical operators > such as transpose, A'B or B'A on large row or column partitioned matrices. > > Mahout optimizer takes care of simplifying algebraic expressions such as 1+ > exp(drm) => drm.apply-unary(1+exp(x)) and tracking things like identical > partitioning of datasets where applicable. > > Adding a back turned out to be pretty trivial for h20 back. I don't think > there's a need to bridge flink operations to any of existing backs of DSL > for support of Flink. instead i was hoping flink could have its own native > back. Also keep in mind that I am not aware of any actual applications with > Mahout + h2o back whereas i have built dozens on Spark in past 2 years > (albeit on a quite intensely hacked version of the bindings). > > With Flink it is a bit less trivial i guess because mahout optimizer > sometimes tries to run quick summary things such as matrix geometry > detection as a part of optimizer action itself. And Flink, last time i > checked, did not support running multiple computational actions on a > cache-pinned dataset. So that was the main difficulty last time. Perhaps no > more. Or perhaps there's a clever way to work around this. not sure. > > At the same time Flink back for algebraic optimizer also may be more > trivial than e.g H20 back was since H20 insisted on marrying their own > in-core Matrix api so they created a bridge between Mahout (former Colt) > Matrix api and one of their own. Whereas if distributed system just takes > care of serializing Writables of Matrices and Vectors (such as it was done > for Spark backend) then there's practically nothing to do. Which, my > understanding is, Flink is doing good. > > Now, on the part of current shortcomings, currently there is a fairly long > list of performance problems, many of which i fixed internally but not > publicly. But hopefully some or all of that will eventually be pushed > sooner rather than later. > > The biggest sticking issue is in-core performance of the Colt apis. Even > jvm-based matrices could do much better if they took a bit of cost-based > approach doing so, and perhaps integrating a bit more sophisticated > Winograd type of gemm approaches. Not to mention integration of things like > Magma, jcuda or even simply netlib.org native dense blas stuff (as in > Breeze). Good thing though it does not affect backend work since once it is > fixed for in-core algebra, it is fixed elsewhere and this is hidden behind > scalabindings (in-core algebra DSL). > > These in-core and out-of-core performance optimization issues are probably > the only thing between code base and a good well rounded release. Out of > core bad stuff is mostly fixed internally though. > > Anyone interested in working on any of the issues I mentioned please throw > a note to Mahout list to me or Suneel Marthi. > > On the issue of optimized ML stuff vs. general algebra, here's food for > thought. > > I found that 70% of algorithms are not purely algebraic in R-like operation > set sense. > > But close to 95% could contain significant elements of simplification thru > algebra. Even probabilistic things that just use stuff like MVN or Wishart > sampling, or Gaussian processes. A lot easier to read and maintain. Even > LLoyd iteration has a simple algebraic expression (turns out). > > Close to 80% could not avoid using some algebra. > > Only 5% could get away with not doing algebra at all. > > Thus, 95% of things i ever worked on are either purely or quasi algebraic. > Even when they are probabilisitic at core. What it means is that any ML > library would benefit enormously if it acts in terms of common algebraic > data structures. It will create opportunity for pipelines to seamlessly > connect elements of learning such as standardization, dimensionality > reduction, MDS/visualization methods, recommender methods as well as > clustering methods. My general criticism for MLLib has been that until > recently, they did not work on creating such common math structure > standards, algebraic data in particular, and every new method's > input/output came in its own form an shape. Still does. > > So dilemma of separating efforts on ones using and not using algebra is a > bit false. Most methods are quasi-algebraic (meaning they have at least > some need for R-like matrix manipulations). Of course there's need for > specific distributed primitives from time to time, there's no arguing about > it (like i said, 70% of all methods cannot be purely algebraic, only about > 25% are). There's an issue of algebra physical ops performance on spark and > in-core, but there's no reason (for me) to believe that effort to fix it in > POJO based implementations would require any less effort. > > There were also some discussion whether and how it is possible to create > quasi algebraic methods such that their non-algebraic part is easy to port > to another platform (provided algebraic part is already compatible), but > that's completely another topic. But what i am saying additional benefit > might be that moving ML methodology to platform-agnostic package in Mahout > (if such interest ever appears) would also be much easier even for > quasi-algebraic approaches if the solutions gravitated to using algebra as > opposed to not using it . > > just some food for thought. > > thanks. > > -D > > > > > On Sun, Jan 4, 2015 at 10:14 AM, Till Rohrmann <[hidden email]> > wrote: > > > The idea to work with H2O sounds really interesting. > > > > In terms of the Mahout DSL this would mean that we have to translate a > > Flink dataset into H2O's basic abstraction of distributed data and vice > > versa. Everything other than writing to disk with one system and reading > > from there with the other is probably non-trivial and hard to realize. > > On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote: > > > > > Happy new year all! > > > > > > Like the idea to add ML module with Flink. > > > > > > As I have mentioned to Kostas, Stephan, and Robert before, I would > > > love to see if we could work with H20 project [1], and it seemed like > > > the community has added support for it for Apache Mahout backend > > > binding [2]. > > > > > > So we might get some additional scale ML algos like deep learning. > > > > > > Definitely would love to help with this initiative =) > > > > > > - Henry > > > > > > [1] https://github.com/h2oai/h2o-dev > > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > > > > > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote: > > > > Hi everyone! > > > > > > > > Happy new year, first of all and I hope you had a nice > end-of-the-year > > > > season. > > > > > > > > I thought that it is a good time now to officially kick off the > > creation > > > of > > > > a library of machine learning algorithms. There are a lot of > individual > > > > artifacts and algorithms floating around which we should consolidate. > > > > > > > > The machine-learning library in Flink would stand on two legs: > > > > > > > > - A collection of efficient implementations for common problems and > > > > algorithms, e.g., Regression (logistic), clustering (k-Means, > Canopy), > > > > Matrix Factorization (ALS), ... > > > > > > > > - An adapter to the linear algebra DSL in Apache Mahout. > > > > > > > > In the long run, it would be the goal to be able to mix and match > code > > > from > > > > both parts. > > > > The linear algebra DSL is very convenient when it comes to quickly > > > > composing an algorithm, or some custom pre- and post-processing > steps. > > > > For some complex algorithms, however, a low level system specific > > > > implementation is necessary to make the algorithm efficient. > > > > Being able to call the tailored algorithms from the DSL would combine > > the > > > > benefits. > > > > > > > > > > > > As a concrete initial step, I suggest to do the following: > > > > > > > > 1) We create a dedicated maven sub-project for that ML library > > > > (flink-lib-ml). The project gets two sub-projects, one for the > > collection > > > > of specialized algorithms, one for the Mahout DSL > > > > > > > > 2) We add the code for the existing specialized algorithms. As > followup > > > > work, we need to consolidate data types between those algorithms, to > > > ensure > > > > that they can easily be combined/chained. > > > > > > > > 3) The code for the Flink bindings to the Mahout DSL will actually > > reside > > > > in the Mahout project, which we need to add as a dependency to > > > flink-lib-ml. > > > > > > > > 4) We add some examples of Mahout DSL algorithms, and a template how > to > > > use > > > > them within Flink programs. > > > > > > > > 5) Create a good introductory readme.md, outlining this structure. > The > > > > readme can also track the implemented algorithms and the ones we put > on > > > the > > > > roadmap. > > > > > > > > > > > > Comments welcome :-) > > > > > > > > > > > > Greetings, > > > > Stephan > > > > > > |
In reply to this post by Henry Saputra
On Thu, Jan 8, 2015 at 5:09 PM, Henry Saputra <[hidden email]>
wrote: > I am trying to hook us up with H2O guys, lets hope it pays off =) > I know the CEO and CTO reasonably well and will ping them. |
Thanks Ted, I have reached out to them as well. The more requests the
merrier I suppose =) - Henry On Wed, Jan 14, 2015 at 11:45 AM, Ted Dunning <[hidden email]> wrote: > On Thu, Jan 8, 2015 at 5:09 PM, Henry Saputra <[hidden email]> > wrote: > >> I am trying to hook us up with H2O guys, lets hope it pays off =) >> > > I know the CEO and CTO reasonably well and will ping them. |
In reply to this post by Stephan Ewen
Hi All,
At TUB, we are looking at the possibility of hiring a new student programmer and possibly another Masters student to work on integrating Flink with Mahout DSL, to get a declarative language that can then be used to implement other ML algorithms. Just wanted to know if someone has already started looking into this topic and if there were any efforts already started in this direction? If yes, what were the main challenges faced? Would be interesting to know. If not, I would also be interested to hear some possible design decisions in order to make this work. Cheers, Manu |
Hi Manu,
I looked into it and I'm also working on the integration. When I started my first try, Flink had the problem to not properly support interfaces and subclasses. This was relevant, because the distributed row-wise partitioned matrices can be indexed by int keys or string keys. By now, this should be fixed. Another issue is the intermediate result retrieval to the driver program. But we can work around this problem by either using the RemoteCollectorOutputFormat or using the convenience methods which will be introduced by the pending PR #210 [1]. Apart from that, I hope that Flink supports all means necessary to implement the Mahout DSL. I don't know how quickly you'll find someone to work on the Mahout DSL but considering that I'm quite familiar with the topic and already started working on it, it would make sense for me to continue working on it. But having support for the Mahout DSL one could start thinking about some mixed specialized Flink algorithms with high-level linear algebra pre- and post-processing using the DSL. Moreover, Alexander told me that you have the Impro 3 course where you implemented several ML algorithms which could be ported to the latest version of Flink. At least, that would be a good start to familiarize oneself with the system. Greets, Till [1] https://github.com/apache/flink/pull/210 [2] https://github.com/TU-Berlin-DIMA/IMPRO-3.SS14 On Sat, Jan 31, 2015 at 6:10 PM, mkaul <[hidden email]> wrote: > Hi All, > At TUB, we are looking at the possibility of hiring a new student > programmer > and possibly another Masters student to work on integrating Flink with > Mahout DSL, to get a declarative language that can then be used to > implement > other ML algorithms. > Just wanted to know if someone has already started looking into this topic > and if there were any efforts already started in this direction? If yes, > what were the main challenges faced? Would be interesting to know. > If not, I would also be interested to hear some possible design decisions > in > order to make this work. > > Cheers, > Manu > > > > -- > View this message in context: > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3592.html > Sent from the Apache Flink (Incubator) Mailing List archive. mailing list > archive at Nabble.com. > |
Ok makes sense to me! :) So we will find something else for the student to do.
Would it then be possible for you to maybe very briefly describe your strategy of how you will make this work? Like slides for a design diagram maybe? It might be useful for us to see how the internals would look like at our end too. Cheers, Manu |
I may be able to help.
The official link on mahout talks page points to slide share, which mangles slides in a weird way, but if it helps, here's the (hidden) link to pptx source of those in case it helps: http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx On Mon, Feb 2, 2015 at 1:00 AM, mkaul <[hidden email]> wrote: > Ok makes sense to me! :) So we will find something else for the student to > do. > Would it then be possible for you to maybe very briefly describe your > strategy of how > you will make this work? Like slides for a design diagram maybe? > It might be useful for us to see how the internals would look like at our > end too. > > Cheers, > Manu > > > > -- > View this message in context: > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html > Sent from the Apache Flink (Incubator) Mailing List archive. mailing list > archive at Nabble.com. > |
Perhaps one of good ways to go about it is to look at the spark module of
mahout. Minimum stuff that is needed is stuff in sparkbindings.SparkEngine and CheckpointedDRM support. The idea is simple. When expressions are written, they are translated into logical operators impelmenting DrmLike[K] api which don't know anything of concrete engine. When expression checkpoint is invoked, optimizer applies rule-based optimizations (which may create more complex but still logical operators like A'B, AB' or A'A type of things, or elementwise unary function applications that are translated from elementwise expressions like exp(A*A). ). This stuff (strategies building engine-specific physical lineages for these blas-like building primitives) are found in sparkbindings.blas package. Unfortunately public version is kind of mm... behind of mine there quite a bit... These rewritten operators are passed to Engine.toPhysical which creates a "checkpointed" DRM. Checkpointed DRM is encapsulation of engine-specific physical operators (now logical part knows very little of them). In case of Spark, the logic goes over created TWA and translates it into Spark RDD lineage (applying some cost estimates along the way). So main points are to implement Engine.toPhysical and Flink's version of CheckpointedDRM. A few other concepts are broadcasting of in-core matrix and vectors (making them available in every task), and explicit cache pinning level (which is kind of right now maps 1:1 to supported cache strategies in spark - but that can be adjusted). Algorithm provide cache level hints in a sign that they intend to go over the same source matrix again and again, (welcome to iterative world...) Finally, another concept is that in-core stuff is backed by scalabindings dsl for mahout-math (i.e. pure in-core algebra) which is sort of consistent with distributed operations dsl. i.e. matrix multiplication will look like A %*% B regardless of whether A or B are in-core or distributed. The major difference for in-core is that in-place modification operators are enabled such as += *= or function assignment like mxA := exp _ (elementwise taking of an exponent) but distributed references of course are immutable (logically anyway). I need to update docs for all forms of in-core (aka 'scalabindings') DSL, but basic version of documentation kind of covers the capabilities more or less. So... as long as mahout-math's Vector and Matrix interface serialization is supported in the backend, that's the only persistence that is needed. No other types are either persisted or passed around. Mahout supports implicit Writable conversion for those in scala (or, more concretely, MatrixWritable and VectorWritable), and I have added native kryo support for those as well (not in public version). Glossary: DRM = distributed row matrix (row-wise partitioned matrix representation). TWA= tree walking automaton On Tue, Feb 3, 2015 at 5:33 PM, Dmitriy Lyubimov <[hidden email]> wrote: > I may be able to help. > > The official link on mahout talks page points to slide share, which > mangles slides in a weird way, but if it helps, here's the (hidden) link to > pptx source of those in case it helps: > > > http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx > > On Mon, Feb 2, 2015 at 1:00 AM, mkaul <[hidden email]> wrote: > >> Ok makes sense to me! :) So we will find something else for the student to >> do. >> Would it then be possible for you to maybe very briefly describe your >> strategy of how >> you will make this work? Like slides for a design diagram maybe? >> It might be useful for us to see how the internals would look like at our >> end too. >> >> Cheers, >> Manu >> >> >> >> -- >> View this message in context: >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list >> archive at Nabble.com. >> > > |
Free forum by Nabble | Edit this page |