Kicking off the Machine Learning Library

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Kicking off the Machine Learning Library

Stephan Ewen
Hi everyone!

Happy new year, first of all and I hope you had a nice end-of-the-year
season.

I thought that it is a good time now to officially kick off the creation of
a library of machine learning algorithms. There are a lot of individual
artifacts and algorithms floating around which we should consolidate.

The machine-learning library in Flink would stand on two legs:

 - A collection of efficient implementations for common problems and
algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
Matrix Factorization (ALS), ...

 - An adapter to the linear algebra DSL in Apache Mahout.

In the long run, it would be the goal to be able to mix and match code from
both parts.
The linear algebra DSL is very convenient when it comes to quickly
composing an algorithm, or some custom pre- and post-processing steps.
For some complex algorithms, however, a low level system specific
implementation is necessary to make the algorithm efficient.
Being able to call the tailored algorithms from the DSL would combine the
benefits.


As a concrete initial step, I suggest to do the following:

1) We create a dedicated maven sub-project for that ML library
(flink-lib-ml). The project gets two sub-projects, one for the collection
of specialized algorithms, one for the Mahout DSL

2) We add the code for the existing specialized algorithms. As followup
work, we need to consolidate data types between those algorithms, to ensure
that they can easily be combined/chained.

3) The code for the Flink bindings to the Mahout DSL will actually reside
in the Mahout project, which we need to add as a dependency to flink-lib-ml.

4) We add some examples of Mahout DSL algorithms, and a template how to use
them within Flink programs.

5) Create a good introductory readme.md, outlining this structure. The
readme can also track the implemented algorithms and the ones we put on the
roadmap.


Comments welcome :-)


Greetings,
Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Till Rohrmann
Hi,

happy new year everyone. I hope you all had some relaxing holidays.

I really like the idea of having a machine learning library because this
allows users to quickly solve problems without having to dive too deep into
the system. Moreover, it is a good way to show what the system is capable
of in terms of expressibility and programming paradigms.

Currently, we already have more or less optimised versions of several ML
algorithms implemented with Flink. I'm aware of the following
implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that
these algorithms constitute a good foundation for the ML library.

I like the idea to have optimised algorithms which can be mixed with Mahout
DSL code. As far as I can tell, the interoperation should not be too
difficult if the "future" Flink backend is used to execute the Mahout DSL
program. Internally, the Mahout DSL performs its operations on a row-wise
partitioned matrix which is represented as a DataSet[(Key, Vector)].
Providing some wrapper functions to transform different matrix
representations into the row-wise representation should be the first step.

Another idea could be to investigate to what extent Flink can interact with
the Parameter Server and which algorithms could be adapted to benefit from
these systems.

Greetings,

Till

On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <[hidden email]> wrote:

> Hi everyone!
>
> Happy new year, first of all and I hope you had a nice end-of-the-year
> season.
>
> I thought that it is a good time now to officially kick off the creation of
> a library of machine learning algorithms. There are a lot of individual
> artifacts and algorithms floating around which we should consolidate.
>
> The machine-learning library in Flink would stand on two legs:
>
>  - A collection of efficient implementations for common problems and
> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> Matrix Factorization (ALS), ...
>
>  - An adapter to the linear algebra DSL in Apache Mahout.
>
> In the long run, it would be the goal to be able to mix and match code from
> both parts.
> The linear algebra DSL is very convenient when it comes to quickly
> composing an algorithm, or some custom pre- and post-processing steps.
> For some complex algorithms, however, a low level system specific
> implementation is necessary to make the algorithm efficient.
> Being able to call the tailored algorithms from the DSL would combine the
> benefits.
>
>
> As a concrete initial step, I suggest to do the following:
>
> 1) We create a dedicated maven sub-project for that ML library
> (flink-lib-ml). The project gets two sub-projects, one for the collection
> of specialized algorithms, one for the Mahout DSL
>
> 2) We add the code for the existing specialized algorithms. As followup
> work, we need to consolidate data types between those algorithms, to ensure
> that they can easily be combined/chained.
>
> 3) The code for the Flink bindings to the Mahout DSL will actually reside
> in the Mahout project, which we need to add as a dependency to
> flink-lib-ml.
>
> 4) We add some examples of Mahout DSL algorithms, and a template how to use
> them within Flink programs.
>
> 5) Create a good introductory readme.md, outlining this structure. The
> readme can also track the implemented algorithms and the ones we put on the
> roadmap.
>
>
> Comments welcome :-)
>
>
> Greetings,
> Stephan
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Till Rohrmann
+1 for the initial steps, which I can implement.

On Sat, Jan 3, 2015 at 8:15 PM, Till Rohrmann <[hidden email]> wrote:

> Hi,
>
> happy new year everyone. I hope you all had some relaxing holidays.
>
> I really like the idea of having a machine learning library because this
> allows users to quickly solve problems without having to dive too deep into
> the system. Moreover, it is a good way to show what the system is capable
> of in terms of expressibility and programming paradigms.
>
> Currently, we already have more or less optimised versions of several ML
> algorithms implemented with Flink. I'm aware of the following
> implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that
> these algorithms constitute a good foundation for the ML library.
>
> I like the idea to have optimised algorithms which can be mixed with
> Mahout DSL code. As far as I can tell, the interoperation should not be too
> difficult if the "future" Flink backend is used to execute the Mahout DSL
> program. Internally, the Mahout DSL performs its operations on a row-wise
> partitioned matrix which is represented as a DataSet[(Key, Vector)].
> Providing some wrapper functions to transform different matrix
> representations into the row-wise representation should be the first step.
>
> Another idea could be to investigate to what extent Flink can interact
> with the Parameter Server and which algorithms could be adapted to benefit
> from these systems.
>
> Greetings,
>
> Till
>
> On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <[hidden email]> wrote:
>
>> Hi everyone!
>>
>> Happy new year, first of all and I hope you had a nice end-of-the-year
>> season.
>>
>> I thought that it is a good time now to officially kick off the creation
>> of
>> a library of machine learning algorithms. There are a lot of individual
>> artifacts and algorithms floating around which we should consolidate.
>>
>> The machine-learning library in Flink would stand on two legs:
>>
>>  - A collection of efficient implementations for common problems and
>> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
>> Matrix Factorization (ALS), ...
>>
>>  - An adapter to the linear algebra DSL in Apache Mahout.
>>
>> In the long run, it would be the goal to be able to mix and match code
>> from
>> both parts.
>> The linear algebra DSL is very convenient when it comes to quickly
>> composing an algorithm, or some custom pre- and post-processing steps.
>> For some complex algorithms, however, a low level system specific
>> implementation is necessary to make the algorithm efficient.
>> Being able to call the tailored algorithms from the DSL would combine the
>> benefits.
>>
>>
>> As a concrete initial step, I suggest to do the following:
>>
>> 1) We create a dedicated maven sub-project for that ML library
>> (flink-lib-ml). The project gets two sub-projects, one for the collection
>> of specialized algorithms, one for the Mahout DSL
>>
>> 2) We add the code for the existing specialized algorithms. As followup
>> work, we need to consolidate data types between those algorithms, to
>> ensure
>> that they can easily be combined/chained.
>>
>> 3) The code for the Flink bindings to the Mahout DSL will actually reside
>> in the Mahout project, which we need to add as a dependency to
>> flink-lib-ml.
>>
>> 4) We add some examples of Mahout DSL algorithms, and a template how to
>> use
>> them within Flink programs.
>>
>> 5) Create a good introductory readme.md, outlining this structure. The
>> readme can also track the implemented algorithms and the ones we put on
>> the
>> roadmap.
>>
>>
>> Comments welcome :-)
>>
>>
>> Greetings,
>> Stephan
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Henry Saputra
In reply to this post by Stephan Ewen
Happy new year all!

Like the idea to add ML module with Flink.

As I have mentioned to Kostas, Stephan, and Robert before, I would
love to see if we could work with H20 project [1], and it seemed like
the community has added support for it for Apache Mahout backend
binding [2].

So we might get some additional scale ML algos like deep learning.

Definitely would love to help with this initiative =)

- Henry

[1] https://github.com/h2oai/h2o-dev
[2] https://issues.apache.org/jira/browse/MAHOUT-1500

On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:

> Hi everyone!
>
> Happy new year, first of all and I hope you had a nice end-of-the-year
> season.
>
> I thought that it is a good time now to officially kick off the creation of
> a library of machine learning algorithms. There are a lot of individual
> artifacts and algorithms floating around which we should consolidate.
>
> The machine-learning library in Flink would stand on two legs:
>
>  - A collection of efficient implementations for common problems and
> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> Matrix Factorization (ALS), ...
>
>  - An adapter to the linear algebra DSL in Apache Mahout.
>
> In the long run, it would be the goal to be able to mix and match code from
> both parts.
> The linear algebra DSL is very convenient when it comes to quickly
> composing an algorithm, or some custom pre- and post-processing steps.
> For some complex algorithms, however, a low level system specific
> implementation is necessary to make the algorithm efficient.
> Being able to call the tailored algorithms from the DSL would combine the
> benefits.
>
>
> As a concrete initial step, I suggest to do the following:
>
> 1) We create a dedicated maven sub-project for that ML library
> (flink-lib-ml). The project gets two sub-projects, one for the collection
> of specialized algorithms, one for the Mahout DSL
>
> 2) We add the code for the existing specialized algorithms. As followup
> work, we need to consolidate data types between those algorithms, to ensure
> that they can easily be combined/chained.
>
> 3) The code for the Flink bindings to the Mahout DSL will actually reside
> in the Mahout project, which we need to add as a dependency to flink-lib-ml.
>
> 4) We add some examples of Mahout DSL algorithms, and a template how to use
> them within Flink programs.
>
> 5) Create a good introductory readme.md, outlining this structure. The
> readme can also track the implemented algorithms and the ones we put on the
> roadmap.
>
>
> Comments welcome :-)
>
>
> Greetings,
> Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Till Rohrmann
The idea to work with H2O sounds really interesting.

In terms of the Mahout DSL this would mean that we have to translate a
Flink dataset into H2O's basic abstraction of distributed data and vice
versa. Everything other than writing to disk with one system and reading
from there with the other is probably non-trivial and hard to realize.
On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote:

> Happy new year all!
>
> Like the idea to add ML module with Flink.
>
> As I have mentioned to Kostas, Stephan, and Robert before, I would
> love to see if we could work with H20 project [1], and it seemed like
> the community has added support for it for Apache Mahout backend
> binding [2].
>
> So we might get some additional scale ML algos like deep learning.
>
> Definitely would love to help with this initiative =)
>
> - Henry
>
> [1] https://github.com/h2oai/h2o-dev
> [2] https://issues.apache.org/jira/browse/MAHOUT-1500
>
> On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:
> > Hi everyone!
> >
> > Happy new year, first of all and I hope you had a nice end-of-the-year
> > season.
> >
> > I thought that it is a good time now to officially kick off the creation
> of
> > a library of machine learning algorithms. There are a lot of individual
> > artifacts and algorithms floating around which we should consolidate.
> >
> > The machine-learning library in Flink would stand on two legs:
> >
> >  - A collection of efficient implementations for common problems and
> > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> > Matrix Factorization (ALS), ...
> >
> >  - An adapter to the linear algebra DSL in Apache Mahout.
> >
> > In the long run, it would be the goal to be able to mix and match code
> from
> > both parts.
> > The linear algebra DSL is very convenient when it comes to quickly
> > composing an algorithm, or some custom pre- and post-processing steps.
> > For some complex algorithms, however, a low level system specific
> > implementation is necessary to make the algorithm efficient.
> > Being able to call the tailored algorithms from the DSL would combine the
> > benefits.
> >
> >
> > As a concrete initial step, I suggest to do the following:
> >
> > 1) We create a dedicated maven sub-project for that ML library
> > (flink-lib-ml). The project gets two sub-projects, one for the collection
> > of specialized algorithms, one for the Mahout DSL
> >
> > 2) We add the code for the existing specialized algorithms. As followup
> > work, we need to consolidate data types between those algorithms, to
> ensure
> > that they can easily be combined/chained.
> >
> > 3) The code for the Flink bindings to the Mahout DSL will actually reside
> > in the Mahout project, which we need to add as a dependency to
> flink-lib-ml.
> >
> > 4) We add some examples of Mahout DSL algorithms, and a template how to
> use
> > them within Flink programs.
> >
> > 5) Create a good introductory readme.md, outlining this structure. The
> > readme can also track the implemented algorithms and the ones we put on
> the
> > roadmap.
> >
> >
> > Comments welcome :-)
> >
> >
> > Greetings,
> > Stephan
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Stephan Ewen
Thanks Henry!

Do you know of a good source that gives pointers or examples how to
interact with H2O ?

Stephan


On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote:

> The idea to work with H2O sounds really interesting.
>
> In terms of the Mahout DSL this would mean that we have to translate a
> Flink dataset into H2O's basic abstraction of distributed data and vice
> versa. Everything other than writing to disk with one system and reading
> from there with the other is probably non-trivial and hard to realize.
> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote:
>
> > Happy new year all!
> >
> > Like the idea to add ML module with Flink.
> >
> > As I have mentioned to Kostas, Stephan, and Robert before, I would
> > love to see if we could work with H20 project [1], and it seemed like
> > the community has added support for it for Apache Mahout backend
> > binding [2].
> >
> > So we might get some additional scale ML algos like deep learning.
> >
> > Definitely would love to help with this initiative =)
> >
> > - Henry
> >
> > [1] https://github.com/h2oai/h2o-dev
> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
> >
> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:
> > > Hi everyone!
> > >
> > > Happy new year, first of all and I hope you had a nice end-of-the-year
> > > season.
> > >
> > > I thought that it is a good time now to officially kick off the
> creation
> > of
> > > a library of machine learning algorithms. There are a lot of individual
> > > artifacts and algorithms floating around which we should consolidate.
> > >
> > > The machine-learning library in Flink would stand on two legs:
> > >
> > >  - A collection of efficient implementations for common problems and
> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> > > Matrix Factorization (ALS), ...
> > >
> > >  - An adapter to the linear algebra DSL in Apache Mahout.
> > >
> > > In the long run, it would be the goal to be able to mix and match code
> > from
> > > both parts.
> > > The linear algebra DSL is very convenient when it comes to quickly
> > > composing an algorithm, or some custom pre- and post-processing steps.
> > > For some complex algorithms, however, a low level system specific
> > > implementation is necessary to make the algorithm efficient.
> > > Being able to call the tailored algorithms from the DSL would combine
> the
> > > benefits.
> > >
> > >
> > > As a concrete initial step, I suggest to do the following:
> > >
> > > 1) We create a dedicated maven sub-project for that ML library
> > > (flink-lib-ml). The project gets two sub-projects, one for the
> collection
> > > of specialized algorithms, one for the Mahout DSL
> > >
> > > 2) We add the code for the existing specialized algorithms. As followup
> > > work, we need to consolidate data types between those algorithms, to
> > ensure
> > > that they can easily be combined/chained.
> > >
> > > 3) The code for the Flink bindings to the Mahout DSL will actually
> reside
> > > in the Mahout project, which we need to add as a dependency to
> > flink-lib-ml.
> > >
> > > 4) We add some examples of Mahout DSL algorithms, and a template how to
> > use
> > > them within Flink programs.
> > >
> > > 5) Create a good introductory readme.md, outlining this structure. The
> > > readme can also track the implemented algorithms and the ones we put on
> > the
> > > roadmap.
> > >
> > >
> > > Comments welcome :-)
> > >
> > >
> > > Greetings,
> > > Stephan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Henry Saputra
0xdata (now is called H2O) is developing integration with Spark with
the project called Sparkling Water [1].
It creates new RDD that could connect to H2O cluster to pass the
higher order function to execute in the ML flow.

The easiest way to use H2O is with R binding [2][3] but I think we
would want to interact with H2O as via the REST APIs [4].

- Henry

[1] https://github.com/h2oai/sparkling-water
[2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r
[3] http://docs.h2o.ai/Ruser/rtutorial.html
[4] http://docs.h2o.ai/developuser/rest.html

On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote:

> Thanks Henry!
>
> Do you know of a good source that gives pointers or examples how to
> interact with H2O ?
>
> Stephan
>
>
> On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote:
>
>> The idea to work with H2O sounds really interesting.
>>
>> In terms of the Mahout DSL this would mean that we have to translate a
>> Flink dataset into H2O's basic abstraction of distributed data and vice
>> versa. Everything other than writing to disk with one system and reading
>> from there with the other is probably non-trivial and hard to realize.
>> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote:
>>
>> > Happy new year all!
>> >
>> > Like the idea to add ML module with Flink.
>> >
>> > As I have mentioned to Kostas, Stephan, and Robert before, I would
>> > love to see if we could work with H20 project [1], and it seemed like
>> > the community has added support for it for Apache Mahout backend
>> > binding [2].
>> >
>> > So we might get some additional scale ML algos like deep learning.
>> >
>> > Definitely would love to help with this initiative =)
>> >
>> > - Henry
>> >
>> > [1] https://github.com/h2oai/h2o-dev
>> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
>> >
>> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:
>> > > Hi everyone!
>> > >
>> > > Happy new year, first of all and I hope you had a nice end-of-the-year
>> > > season.
>> > >
>> > > I thought that it is a good time now to officially kick off the
>> creation
>> > of
>> > > a library of machine learning algorithms. There are a lot of individual
>> > > artifacts and algorithms floating around which we should consolidate.
>> > >
>> > > The machine-learning library in Flink would stand on two legs:
>> > >
>> > >  - A collection of efficient implementations for common problems and
>> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
>> > > Matrix Factorization (ALS), ...
>> > >
>> > >  - An adapter to the linear algebra DSL in Apache Mahout.
>> > >
>> > > In the long run, it would be the goal to be able to mix and match code
>> > from
>> > > both parts.
>> > > The linear algebra DSL is very convenient when it comes to quickly
>> > > composing an algorithm, or some custom pre- and post-processing steps.
>> > > For some complex algorithms, however, a low level system specific
>> > > implementation is necessary to make the algorithm efficient.
>> > > Being able to call the tailored algorithms from the DSL would combine
>> the
>> > > benefits.
>> > >
>> > >
>> > > As a concrete initial step, I suggest to do the following:
>> > >
>> > > 1) We create a dedicated maven sub-project for that ML library
>> > > (flink-lib-ml). The project gets two sub-projects, one for the
>> collection
>> > > of specialized algorithms, one for the Mahout DSL
>> > >
>> > > 2) We add the code for the existing specialized algorithms. As followup
>> > > work, we need to consolidate data types between those algorithms, to
>> > ensure
>> > > that they can easily be combined/chained.
>> > >
>> > > 3) The code for the Flink bindings to the Mahout DSL will actually
>> reside
>> > > in the Mahout project, which we need to add as a dependency to
>> > flink-lib-ml.
>> > >
>> > > 4) We add some examples of Mahout DSL algorithms, and a template how to
>> > use
>> > > them within Flink programs.
>> > >
>> > > 5) Create a good introductory readme.md, outlining this structure. The
>> > > readme can also track the implemented algorithms and the ones we put on
>> > the
>> > > roadmap.
>> > >
>> > >
>> > > Comments welcome :-)
>> > >
>> > >
>> > > Greetings,
>> > > Stephan
>> >
>>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Till Rohrmann
Great Henry,

Sparkling Water looks really interesting. We would probably have to take a
similar approach to make Flink interact with H2O.

I briefly looked into the code and that's how they do it (or at least how I
understood it ;-). Correct me if I got it wrong:

They first start up a Spark cluster and start from within a RDD an H2O
worker. Afterwards, they have on each node where a Spark executor runs also
an H2O worker running. Consequently, they have on each node access to the
distributed key value storage of H2O. I really like how elegantly they set
up H2O from within Spark :-) Flink should be capable of doing the same. We
only have to leave some memory for H2O.

Once this is done, they only have to translate a RDD into a DataFrame and
vice versa:

RDD => DataFrame:
This operation is realized as a mapPartition operation of Spark which is
executed as an action. Currently, tuple elements and a SchemaRDD are
supported. The mapPartition operation takes the tuples and groups fields
with the same positional index together to build H2O's vector chunks which
are then stored in the distributed key value store (DKV). We can do the
same.

DataFrame => RDD:
Here they implemented a H2ORDD which reads from the DKV the corresponding
vector chunks of the partition and constructs for each row an instance. We
should be able to convert a DataFrame to a DataSet by implementing a
H2ODKVInputFormat which basically does the same.

I think that the implementation details might not be trivial but judging
from the size of sparkling-water (1300 lines of Scala code) it should
definitely be feasible. And as Henry already mentioned by having the H2O
integration, we get a lot of ML algorithms for free. We could also talk to
0xdata to see if they are interested to help us with the effort.

Greets,

Till


On Wed, Jan 7, 2015 at 7:08 PM, Henry Saputra <[hidden email]>
wrote:

> 0xdata (now is called H2O) is developing integration with Spark with
> the project called Sparkling Water [1].
> It creates new RDD that could connect to H2O cluster to pass the
> higher order function to execute in the ML flow.
>
> The easiest way to use H2O is with R binding [2][3] but I think we
> would want to interact with H2O as via the REST APIs [4].
>
> - Henry
>
> [1] https://github.com/h2oai/sparkling-water
> [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r
> [3] http://docs.h2o.ai/Ruser/rtutorial.html
> [4] http://docs.h2o.ai/developuser/rest.html
>
> On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote:
> > Thanks Henry!
> >
> > Do you know of a good source that gives pointers or examples how to
> > interact with H2O ?
> >
> > Stephan
> >
> >
> > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]>
> wrote:
> >
> >> The idea to work with H2O sounds really interesting.
> >>
> >> In terms of the Mahout DSL this would mean that we have to translate a
> >> Flink dataset into H2O's basic abstraction of distributed data and vice
> >> versa. Everything other than writing to disk with one system and reading
> >> from there with the other is probably non-trivial and hard to realize.
> >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]>
> wrote:
> >>
> >> > Happy new year all!
> >> >
> >> > Like the idea to add ML module with Flink.
> >> >
> >> > As I have mentioned to Kostas, Stephan, and Robert before, I would
> >> > love to see if we could work with H20 project [1], and it seemed like
> >> > the community has added support for it for Apache Mahout backend
> >> > binding [2].
> >> >
> >> > So we might get some additional scale ML algos like deep learning.
> >> >
> >> > Definitely would love to help with this initiative =)
> >> >
> >> > - Henry
> >> >
> >> > [1] https://github.com/h2oai/h2o-dev
> >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
> >> >
> >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]>
> wrote:
> >> > > Hi everyone!
> >> > >
> >> > > Happy new year, first of all and I hope you had a nice
> end-of-the-year
> >> > > season.
> >> > >
> >> > > I thought that it is a good time now to officially kick off the
> >> creation
> >> > of
> >> > > a library of machine learning algorithms. There are a lot of
> individual
> >> > > artifacts and algorithms floating around which we should
> consolidate.
> >> > >
> >> > > The machine-learning library in Flink would stand on two legs:
> >> > >
> >> > >  - A collection of efficient implementations for common problems and
> >> > > algorithms, e.g., Regression (logistic), clustering (k-Means,
> Canopy),
> >> > > Matrix Factorization (ALS), ...
> >> > >
> >> > >  - An adapter to the linear algebra DSL in Apache Mahout.
> >> > >
> >> > > In the long run, it would be the goal to be able to mix and match
> code
> >> > from
> >> > > both parts.
> >> > > The linear algebra DSL is very convenient when it comes to quickly
> >> > > composing an algorithm, or some custom pre- and post-processing
> steps.
> >> > > For some complex algorithms, however, a low level system specific
> >> > > implementation is necessary to make the algorithm efficient.
> >> > > Being able to call the tailored algorithms from the DSL would
> combine
> >> the
> >> > > benefits.
> >> > >
> >> > >
> >> > > As a concrete initial step, I suggest to do the following:
> >> > >
> >> > > 1) We create a dedicated maven sub-project for that ML library
> >> > > (flink-lib-ml). The project gets two sub-projects, one for the
> >> collection
> >> > > of specialized algorithms, one for the Mahout DSL
> >> > >
> >> > > 2) We add the code for the existing specialized algorithms. As
> followup
> >> > > work, we need to consolidate data types between those algorithms, to
> >> > ensure
> >> > > that they can easily be combined/chained.
> >> > >
> >> > > 3) The code for the Flink bindings to the Mahout DSL will actually
> >> reside
> >> > > in the Mahout project, which we need to add as a dependency to
> >> > flink-lib-ml.
> >> > >
> >> > > 4) We add some examples of Mahout DSL algorithms, and a template
> how to
> >> > use
> >> > > them within Flink programs.
> >> > >
> >> > > 5) Create a good introductory readme.md, outlining this structure.
> The
> >> > > readme can also track the implemented algorithms and the ones we
> put on
> >> > the
> >> > > roadmap.
> >> > >
> >> > >
> >> > > Comments welcome :-)
> >> > >
> >> > >
> >> > > Greetings,
> >> > > Stephan
> >> >
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Henry Saputra
Hi Till,

Yes, that is pretty much how they do it.
The trick is to get access to shared data between Spark realm and H2O realm.
Prev gen they use Tachyon but in the latest stint they use run H2O in
the Spark executor to use the heap shared memory to do the trick.

I am trying to hook us up with H2O guys, lets hope it pays off =)

- Henry

On Thu, Jan 8, 2015 at 3:48 AM, Till Rohrmann <[hidden email]> wrote:

> Great Henry,
>
> Sparkling Water looks really interesting. We would probably have to take a
> similar approach to make Flink interact with H2O.
>
> I briefly looked into the code and that's how they do it (or at least how I
> understood it ;-). Correct me if I got it wrong:
>
> They first start up a Spark cluster and start from within a RDD an H2O
> worker. Afterwards, they have on each node where a Spark executor runs also
> an H2O worker running. Consequently, they have on each node access to the
> distributed key value storage of H2O. I really like how elegantly they set
> up H2O from within Spark :-) Flink should be capable of doing the same. We
> only have to leave some memory for H2O.
>
> Once this is done, they only have to translate a RDD into a DataFrame and
> vice versa:
>
> RDD => DataFrame:
> This operation is realized as a mapPartition operation of Spark which is
> executed as an action. Currently, tuple elements and a SchemaRDD are
> supported. The mapPartition operation takes the tuples and groups fields
> with the same positional index together to build H2O's vector chunks which
> are then stored in the distributed key value store (DKV). We can do the
> same.
>
> DataFrame => RDD:
> Here they implemented a H2ORDD which reads from the DKV the corresponding
> vector chunks of the partition and constructs for each row an instance. We
> should be able to convert a DataFrame to a DataSet by implementing a
> H2ODKVInputFormat which basically does the same.
>
> I think that the implementation details might not be trivial but judging
> from the size of sparkling-water (1300 lines of Scala code) it should
> definitely be feasible. And as Henry already mentioned by having the H2O
> integration, we get a lot of ML algorithms for free. We could also talk to
> 0xdata to see if they are interested to help us with the effort.
>
> Greets,
>
> Till
>
>
> On Wed, Jan 7, 2015 at 7:08 PM, Henry Saputra <[hidden email]>
> wrote:
>
>> 0xdata (now is called H2O) is developing integration with Spark with
>> the project called Sparkling Water [1].
>> It creates new RDD that could connect to H2O cluster to pass the
>> higher order function to execute in the ML flow.
>>
>> The easiest way to use H2O is with R binding [2][3] but I think we
>> would want to interact with H2O as via the REST APIs [4].
>>
>> - Henry
>>
>> [1] https://github.com/h2oai/sparkling-water
>> [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r
>> [3] http://docs.h2o.ai/Ruser/rtutorial.html
>> [4] http://docs.h2o.ai/developuser/rest.html
>>
>> On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <[hidden email]> wrote:
>> > Thanks Henry!
>> >
>> > Do you know of a good source that gives pointers or examples how to
>> > interact with H2O ?
>> >
>> > Stephan
>> >
>> >
>> > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <[hidden email]>
>> wrote:
>> >
>> >> The idea to work with H2O sounds really interesting.
>> >>
>> >> In terms of the Mahout DSL this would mean that we have to translate a
>> >> Flink dataset into H2O's basic abstraction of distributed data and vice
>> >> versa. Everything other than writing to disk with one system and reading
>> >> from there with the other is probably non-trivial and hard to realize.
>> >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]>
>> wrote:
>> >>
>> >> > Happy new year all!
>> >> >
>> >> > Like the idea to add ML module with Flink.
>> >> >
>> >> > As I have mentioned to Kostas, Stephan, and Robert before, I would
>> >> > love to see if we could work with H20 project [1], and it seemed like
>> >> > the community has added support for it for Apache Mahout backend
>> >> > binding [2].
>> >> >
>> >> > So we might get some additional scale ML algos like deep learning.
>> >> >
>> >> > Definitely would love to help with this initiative =)
>> >> >
>> >> > - Henry
>> >> >
>> >> > [1] https://github.com/h2oai/h2o-dev
>> >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
>> >> >
>> >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]>
>> wrote:
>> >> > > Hi everyone!
>> >> > >
>> >> > > Happy new year, first of all and I hope you had a nice
>> end-of-the-year
>> >> > > season.
>> >> > >
>> >> > > I thought that it is a good time now to officially kick off the
>> >> creation
>> >> > of
>> >> > > a library of machine learning algorithms. There are a lot of
>> individual
>> >> > > artifacts and algorithms floating around which we should
>> consolidate.
>> >> > >
>> >> > > The machine-learning library in Flink would stand on two legs:
>> >> > >
>> >> > >  - A collection of efficient implementations for common problems and
>> >> > > algorithms, e.g., Regression (logistic), clustering (k-Means,
>> Canopy),
>> >> > > Matrix Factorization (ALS), ...
>> >> > >
>> >> > >  - An adapter to the linear algebra DSL in Apache Mahout.
>> >> > >
>> >> > > In the long run, it would be the goal to be able to mix and match
>> code
>> >> > from
>> >> > > both parts.
>> >> > > The linear algebra DSL is very convenient when it comes to quickly
>> >> > > composing an algorithm, or some custom pre- and post-processing
>> steps.
>> >> > > For some complex algorithms, however, a low level system specific
>> >> > > implementation is necessary to make the algorithm efficient.
>> >> > > Being able to call the tailored algorithms from the DSL would
>> combine
>> >> the
>> >> > > benefits.
>> >> > >
>> >> > >
>> >> > > As a concrete initial step, I suggest to do the following:
>> >> > >
>> >> > > 1) We create a dedicated maven sub-project for that ML library
>> >> > > (flink-lib-ml). The project gets two sub-projects, one for the
>> >> collection
>> >> > > of specialized algorithms, one for the Mahout DSL
>> >> > >
>> >> > > 2) We add the code for the existing specialized algorithms. As
>> followup
>> >> > > work, we need to consolidate data types between those algorithms, to
>> >> > ensure
>> >> > > that they can easily be combined/chained.
>> >> > >
>> >> > > 3) The code for the Flink bindings to the Mahout DSL will actually
>> >> reside
>> >> > > in the Mahout project, which we need to add as a dependency to
>> >> > flink-lib-ml.
>> >> > >
>> >> > > 4) We add some examples of Mahout DSL algorithms, and a template
>> how to
>> >> > use
>> >> > > them within Flink programs.
>> >> > >
>> >> > > 5) Create a good introductory readme.md, outlining this structure.
>> The
>> >> > > readme can also track the implemented algorithms and the ones we
>> put on
>> >> > the
>> >> > > roadmap.
>> >> > >
>> >> > >
>> >> > > Comments welcome :-)
>> >> > >
>> >> > >
>> >> > > Greetings,
>> >> > > Stephan
>> >> >
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Dmitriy Lyubimov
In reply to this post by Till Rohrmann
In terms of Mahout DSL it means implementing a bunch of physical operators
such as transpose, A'B or B'A on large row or column partitioned matrices.

Mahout optimizer takes care of simplifying algebraic expressions such as 1+
exp(drm) => drm.apply-unary(1+exp(x)) and tracking things like identical
partitioning of datasets where applicable.

Adding a back  turned out to be pretty trivial for h20 back. I don't think
there's a need to bridge flink operations to any of existing backs of DSL
for support of Flink. instead i was hoping flink could have its own native
back. Also keep in mind that I am not aware of any actual applications with
Mahout + h2o back whereas i have built dozens on Spark in past 2 years
(albeit on a quite intensely hacked version of the bindings).

With Flink it is a bit less trivial i guess because mahout optimizer
sometimes tries to run quick summary things such as matrix geometry
detection as a part of optimizer action itself. And Flink, last time i
checked, did not support running multiple computational actions on a
cache-pinned dataset. So that was the main difficulty last time. Perhaps no
more. Or perhaps there's a clever way to work around this. not sure.

At the same time Flink back for algebraic optimizer also may be more
trivial than e.g H20 back was since H20 insisted on marrying their own
in-core Matrix api so they created a bridge between Mahout (former Colt)
Matrix api and one of their own. Whereas if distributed system just takes
care of serializing Writables of Matrices and Vectors (such as it was done
for Spark backend) then there's practically nothing to do. Which, my
understanding is, Flink is doing good.

Now, on the part of current shortcomings, currently there is a fairly long
list of performance problems, many of which i fixed internally but not
publicly.  But hopefully some or all of that will eventually be pushed
sooner rather than later.

The biggest sticking issue is in-core performance of the Colt apis. Even
jvm-based matrices could do much better if they took a bit of cost-based
approach doing so, and perhaps integrating a bit more sophisticated
Winograd type of gemm approaches. Not to mention integration of things like
Magma, jcuda or even simply netlib.org native dense blas stuff (as in
Breeze). Good thing though it does not affect backend work since once it is
fixed for in-core algebra, it is fixed elsewhere and this is hidden behind
scalabindings (in-core algebra DSL).

These in-core and out-of-core performance optimization issues are probably
the only thing between code base and a good well rounded release. Out of
core bad stuff is mostly fixed internally though.

Anyone interested in working on any of the issues I mentioned please throw
a note to Mahout list to me or Suneel Marthi.

On the issue of optimized ML stuff vs. general algebra, here's food for
thought.

I found that 70% of algorithms are not purely algebraic in R-like operation
set sense.

But close to 95% could contain significant elements of simplification thru
algebra. Even probabilistic things that just use stuff like MVN or Wishart
sampling, or Gaussian processes. A lot easier to read and maintain. Even
LLoyd iteration has a simple algebraic expression (turns out).

Close to 80% could not avoid using some algebra.

Only 5% could get away with not doing algebra at all.

Thus, 95% of things i ever worked on are either purely or quasi algebraic.
Even when they are probabilisitic at core. What it means is that any ML
library would benefit enormously if it acts in terms of common algebraic
data structures. It will create opportunity for pipelines to seamlessly
connect elements of learning such as standardization, dimensionality
reduction, MDS/visualization methods, recommender methods as well as
clustering methods. My general criticism for MLLib has been that until
recently, they did not work on creating such common math structure
standards, algebraic data in particular, and every new method's
input/output came in its own form an shape. Still does.

So dilemma of separating efforts on ones  using and not using algebra is a
bit false. Most methods are quasi-algebraic (meaning they have at least
some need for R-like matrix manipulations). Of course there's need for
specific distributed primitives from time to time, there's no arguing about
it (like i said, 70% of all methods cannot be purely algebraic, only about
25% are). There's an issue of algebra physical ops performance on spark and
in-core, but there's no reason (for me) to believe that effort to fix it in
POJO based implementations would require any less effort.

There were also some discussion whether and how it is possible to create
quasi algebraic methods such that their non-algebraic part is easy to port
to another platform (provided algebraic part is already compatible), but
that's completely another topic. But what i am saying additional benefit
might be that moving ML methodology to platform-agnostic package in Mahout
(if such interest ever appears) would also be much easier even for
quasi-algebraic approaches if the solutions gravitated to using algebra as
opposed to not using it .

just some food for thought.

thanks.

-D




On Sun, Jan 4, 2015 at 10:14 AM, Till Rohrmann <[hidden email]> wrote:

> The idea to work with H2O sounds really interesting.
>
> In terms of the Mahout DSL this would mean that we have to translate a
> Flink dataset into H2O's basic abstraction of distributed data and vice
> versa. Everything other than writing to disk with one system and reading
> from there with the other is probably non-trivial and hard to realize.
> On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote:
>
> > Happy new year all!
> >
> > Like the idea to add ML module with Flink.
> >
> > As I have mentioned to Kostas, Stephan, and Robert before, I would
> > love to see if we could work with H20 project [1], and it seemed like
> > the community has added support for it for Apache Mahout backend
> > binding [2].
> >
> > So we might get some additional scale ML algos like deep learning.
> >
> > Definitely would love to help with this initiative =)
> >
> > - Henry
> >
> > [1] https://github.com/h2oai/h2o-dev
> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
> >
> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:
> > > Hi everyone!
> > >
> > > Happy new year, first of all and I hope you had a nice end-of-the-year
> > > season.
> > >
> > > I thought that it is a good time now to officially kick off the
> creation
> > of
> > > a library of machine learning algorithms. There are a lot of individual
> > > artifacts and algorithms floating around which we should consolidate.
> > >
> > > The machine-learning library in Flink would stand on two legs:
> > >
> > >  - A collection of efficient implementations for common problems and
> > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
> > > Matrix Factorization (ALS), ...
> > >
> > >  - An adapter to the linear algebra DSL in Apache Mahout.
> > >
> > > In the long run, it would be the goal to be able to mix and match code
> > from
> > > both parts.
> > > The linear algebra DSL is very convenient when it comes to quickly
> > > composing an algorithm, or some custom pre- and post-processing steps.
> > > For some complex algorithms, however, a low level system specific
> > > implementation is necessary to make the algorithm efficient.
> > > Being able to call the tailored algorithms from the DSL would combine
> the
> > > benefits.
> > >
> > >
> > > As a concrete initial step, I suggest to do the following:
> > >
> > > 1) We create a dedicated maven sub-project for that ML library
> > > (flink-lib-ml). The project gets two sub-projects, one for the
> collection
> > > of specialized algorithms, one for the Mahout DSL
> > >
> > > 2) We add the code for the existing specialized algorithms. As followup
> > > work, we need to consolidate data types between those algorithms, to
> > ensure
> > > that they can easily be combined/chained.
> > >
> > > 3) The code for the Flink bindings to the Mahout DSL will actually
> reside
> > > in the Mahout project, which we need to add as a dependency to
> > flink-lib-ml.
> > >
> > > 4) We add some examples of Mahout DSL algorithms, and a template how to
> > use
> > > them within Flink programs.
> > >
> > > 5) Create a good introductory readme.md, outlining this structure. The
> > > readme can also track the implemented algorithms and the ones we put on
> > the
> > > roadmap.
> > >
> > >
> > > Comments welcome :-)
> > >
> > >
> > > Greetings,
> > > Stephan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Stephan Ewen
Thank you Dmitriy!

That was quite a bit food for thought - I think I will need a bit more time
to digest that ;-)
Especially the part about the algebraic data structures and how they allow
you to pipeline and combine algorithms is exactly along the lines of what I
was thinking - thank you for sharing your observations there.

Concerning features of Flink: We recently had a large piece of code going
in that is exactly the underpinning for running multiple programs on cached
data sets.
Some of the primitives to bring back counters or data sets to the driver
exist already in pull requests, others will be added soon. So I think we
will be good there.
I would go for an integration between Flink and Mahout along the same lines
as the integration with Spark, rather than in the way it is done with H2O.

Greetings,
Stephan


On Tue, Jan 13, 2015 at 11:53 PM, Dmitriy Lyubimov <[hidden email]>
wrote:

> In terms of Mahout DSL it means implementing a bunch of physical operators
> such as transpose, A'B or B'A on large row or column partitioned matrices.
>
> Mahout optimizer takes care of simplifying algebraic expressions such as 1+
> exp(drm) => drm.apply-unary(1+exp(x)) and tracking things like identical
> partitioning of datasets where applicable.
>
> Adding a back  turned out to be pretty trivial for h20 back. I don't think
> there's a need to bridge flink operations to any of existing backs of DSL
> for support of Flink. instead i was hoping flink could have its own native
> back. Also keep in mind that I am not aware of any actual applications with
> Mahout + h2o back whereas i have built dozens on Spark in past 2 years
> (albeit on a quite intensely hacked version of the bindings).
>
> With Flink it is a bit less trivial i guess because mahout optimizer
> sometimes tries to run quick summary things such as matrix geometry
> detection as a part of optimizer action itself. And Flink, last time i
> checked, did not support running multiple computational actions on a
> cache-pinned dataset. So that was the main difficulty last time. Perhaps no
> more. Or perhaps there's a clever way to work around this. not sure.
>
> At the same time Flink back for algebraic optimizer also may be more
> trivial than e.g H20 back was since H20 insisted on marrying their own
> in-core Matrix api so they created a bridge between Mahout (former Colt)
> Matrix api and one of their own. Whereas if distributed system just takes
> care of serializing Writables of Matrices and Vectors (such as it was done
> for Spark backend) then there's practically nothing to do. Which, my
> understanding is, Flink is doing good.
>
> Now, on the part of current shortcomings, currently there is a fairly long
> list of performance problems, many of which i fixed internally but not
> publicly.  But hopefully some or all of that will eventually be pushed
> sooner rather than later.
>
> The biggest sticking issue is in-core performance of the Colt apis. Even
> jvm-based matrices could do much better if they took a bit of cost-based
> approach doing so, and perhaps integrating a bit more sophisticated
> Winograd type of gemm approaches. Not to mention integration of things like
> Magma, jcuda or even simply netlib.org native dense blas stuff (as in
> Breeze). Good thing though it does not affect backend work since once it is
> fixed for in-core algebra, it is fixed elsewhere and this is hidden behind
> scalabindings (in-core algebra DSL).
>
> These in-core and out-of-core performance optimization issues are probably
> the only thing between code base and a good well rounded release. Out of
> core bad stuff is mostly fixed internally though.
>
> Anyone interested in working on any of the issues I mentioned please throw
> a note to Mahout list to me or Suneel Marthi.
>
> On the issue of optimized ML stuff vs. general algebra, here's food for
> thought.
>
> I found that 70% of algorithms are not purely algebraic in R-like operation
> set sense.
>
> But close to 95% could contain significant elements of simplification thru
> algebra. Even probabilistic things that just use stuff like MVN or Wishart
> sampling, or Gaussian processes. A lot easier to read and maintain. Even
> LLoyd iteration has a simple algebraic expression (turns out).
>
> Close to 80% could not avoid using some algebra.
>
> Only 5% could get away with not doing algebra at all.
>
> Thus, 95% of things i ever worked on are either purely or quasi algebraic.
> Even when they are probabilisitic at core. What it means is that any ML
> library would benefit enormously if it acts in terms of common algebraic
> data structures. It will create opportunity for pipelines to seamlessly
> connect elements of learning such as standardization, dimensionality
> reduction, MDS/visualization methods, recommender methods as well as
> clustering methods. My general criticism for MLLib has been that until
> recently, they did not work on creating such common math structure
> standards, algebraic data in particular, and every new method's
> input/output came in its own form an shape. Still does.
>
> So dilemma of separating efforts on ones  using and not using algebra is a
> bit false. Most methods are quasi-algebraic (meaning they have at least
> some need for R-like matrix manipulations). Of course there's need for
> specific distributed primitives from time to time, there's no arguing about
> it (like i said, 70% of all methods cannot be purely algebraic, only about
> 25% are). There's an issue of algebra physical ops performance on spark and
> in-core, but there's no reason (for me) to believe that effort to fix it in
> POJO based implementations would require any less effort.
>
> There were also some discussion whether and how it is possible to create
> quasi algebraic methods such that their non-algebraic part is easy to port
> to another platform (provided algebraic part is already compatible), but
> that's completely another topic. But what i am saying additional benefit
> might be that moving ML methodology to platform-agnostic package in Mahout
> (if such interest ever appears) would also be much easier even for
> quasi-algebraic approaches if the solutions gravitated to using algebra as
> opposed to not using it .
>
> just some food for thought.
>
> thanks.
>
> -D
>
>
>
>
> On Sun, Jan 4, 2015 at 10:14 AM, Till Rohrmann <[hidden email]>
> wrote:
>
> > The idea to work with H2O sounds really interesting.
> >
> > In terms of the Mahout DSL this would mean that we have to translate a
> > Flink dataset into H2O's basic abstraction of distributed data and vice
> > versa. Everything other than writing to disk with one system and reading
> > from there with the other is probably non-trivial and hard to realize.
> > On Jan 4, 2015 9:18 AM, "Henry Saputra" <[hidden email]> wrote:
> >
> > > Happy new year all!
> > >
> > > Like the idea to add ML module with Flink.
> > >
> > > As I have mentioned to Kostas, Stephan, and Robert before, I would
> > > love to see if we could work with H20 project [1], and it seemed like
> > > the community has added support for it for Apache Mahout backend
> > > binding [2].
> > >
> > > So we might get some additional scale ML algos like deep learning.
> > >
> > > Definitely would love to help with this initiative =)
> > >
> > > - Henry
> > >
> > > [1] https://github.com/h2oai/h2o-dev
> > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
> > >
> > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <[hidden email]> wrote:
> > > > Hi everyone!
> > > >
> > > > Happy new year, first of all and I hope you had a nice
> end-of-the-year
> > > > season.
> > > >
> > > > I thought that it is a good time now to officially kick off the
> > creation
> > > of
> > > > a library of machine learning algorithms. There are a lot of
> individual
> > > > artifacts and algorithms floating around which we should consolidate.
> > > >
> > > > The machine-learning library in Flink would stand on two legs:
> > > >
> > > >  - A collection of efficient implementations for common problems and
> > > > algorithms, e.g., Regression (logistic), clustering (k-Means,
> Canopy),
> > > > Matrix Factorization (ALS), ...
> > > >
> > > >  - An adapter to the linear algebra DSL in Apache Mahout.
> > > >
> > > > In the long run, it would be the goal to be able to mix and match
> code
> > > from
> > > > both parts.
> > > > The linear algebra DSL is very convenient when it comes to quickly
> > > > composing an algorithm, or some custom pre- and post-processing
> steps.
> > > > For some complex algorithms, however, a low level system specific
> > > > implementation is necessary to make the algorithm efficient.
> > > > Being able to call the tailored algorithms from the DSL would combine
> > the
> > > > benefits.
> > > >
> > > >
> > > > As a concrete initial step, I suggest to do the following:
> > > >
> > > > 1) We create a dedicated maven sub-project for that ML library
> > > > (flink-lib-ml). The project gets two sub-projects, one for the
> > collection
> > > > of specialized algorithms, one for the Mahout DSL
> > > >
> > > > 2) We add the code for the existing specialized algorithms. As
> followup
> > > > work, we need to consolidate data types between those algorithms, to
> > > ensure
> > > > that they can easily be combined/chained.
> > > >
> > > > 3) The code for the Flink bindings to the Mahout DSL will actually
> > reside
> > > > in the Mahout project, which we need to add as a dependency to
> > > flink-lib-ml.
> > > >
> > > > 4) We add some examples of Mahout DSL algorithms, and a template how
> to
> > > use
> > > > them within Flink programs.
> > > >
> > > > 5) Create a good introductory readme.md, outlining this structure.
> The
> > > > readme can also track the implemented algorithms and the ones we put
> on
> > > the
> > > > roadmap.
> > > >
> > > >
> > > > Comments welcome :-)
> > > >
> > > >
> > > > Greetings,
> > > > Stephan
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Ted Dunning
In reply to this post by Henry Saputra
On Thu, Jan 8, 2015 at 5:09 PM, Henry Saputra <[hidden email]>
wrote:

> I am trying to hook us up with H2O guys, lets hope it pays off =)
>

I know the CEO and CTO reasonably well and will ping them.
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Henry Saputra
Thanks Ted, I have reached out to them as well. The more requests the
merrier I suppose =)

- Henry

On Wed, Jan 14, 2015 at 11:45 AM, Ted Dunning <[hidden email]> wrote:
> On Thu, Jan 8, 2015 at 5:09 PM, Henry Saputra <[hidden email]>
> wrote:
>
>> I am trying to hook us up with H2O guys, lets hope it pays off =)
>>
>
> I know the CEO and CTO reasonably well and will ping them.
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

mkaul
In reply to this post by Stephan Ewen
Hi All,
At TUB, we are looking at the possibility of hiring a new student programmer and possibly another Masters student to work on integrating Flink with Mahout DSL, to get a declarative language that can then be used to implement other ML algorithms.
Just wanted to know if someone has already started looking into this topic and if there were any efforts already started in this direction? If yes, what were the main challenges faced? Would be interesting to know.
If not, I would also be interested to hear some possible design decisions in order to make this work.

Cheers,
Manu
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

till.rohrmann
Hi Manu,

I looked into it and I'm also working on the integration. When I started my
first try, Flink had the problem to not properly support interfaces and
subclasses. This was relevant, because the distributed row-wise partitioned
matrices can be indexed by int keys or string keys. By now, this should be
fixed.

Another issue is the intermediate result retrieval to the driver program.
But we can work around this problem by either using the
RemoteCollectorOutputFormat or using the convenience methods which will be
introduced by the pending PR #210 [1].

Apart from that, I hope that Flink supports all means necessary to
implement the Mahout DSL. I don't know how quickly you'll find someone to
work on the Mahout DSL but considering that I'm quite familiar with the
topic and already started working on it, it would make sense for me to
continue working on it.

But having support for the Mahout DSL one could start thinking about some
mixed specialized Flink algorithms with high-level linear algebra pre- and
post-processing using the DSL. Moreover, Alexander told me that you have
the Impro 3 course where you implemented several ML algorithms which could
be ported to the latest version of Flink. At least, that would be a good
start to familiarize oneself with the system.

Greets,

Till

[1] https://github.com/apache/flink/pull/210
[2] https://github.com/TU-Berlin-DIMA/IMPRO-3.SS14

On Sat, Jan 31, 2015 at 6:10 PM, mkaul <[hidden email]> wrote:

> Hi All,
> At TUB, we are looking at the possibility of hiring a new student
> programmer
> and possibly another Masters student to work on integrating Flink with
> Mahout DSL, to get a declarative language that can then be used to
> implement
> other ML algorithms.
> Just wanted to know if someone has already started looking into this topic
> and if there were any efforts already started in this direction? If yes,
> what were the main challenges faced? Would be interesting to know.
> If not, I would also be interested to hear some possible design decisions
> in
> order to make this work.
>
> Cheers,
> Manu
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3592.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

mkaul
Ok makes sense to me! :) So we will find something else for the student to do.
Would it then be possible for you to maybe very briefly describe your strategy of how
you will make this work? Like slides for a design diagram maybe?
It might be useful for us to see how the internals would look like at our end too.

Cheers,
Manu
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Dmitriy Lyubimov
I may be able to help.

The official link on mahout talks page points to slide share, which mangles
slides in a weird way, but if it helps, here's the (hidden) link to pptx
source of those in case it helps:

http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx

On Mon, Feb 2, 2015 at 1:00 AM, mkaul <[hidden email]> wrote:

> Ok makes sense to me! :) So we will find something else for the student to
> do.
> Would it then be possible for you to maybe very briefly describe your
> strategy of how
> you will make this work? Like slides for a design diagram maybe?
> It might be useful for us to see how the internals would look like at our
> end too.
>
> Cheers,
> Manu
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Kicking off the Machine Learning Library

Dmitriy Lyubimov
Perhaps one of good ways to go about it is to look at the spark module of
mahout.

Minimum stuff that is needed is stuff in sparkbindings.SparkEngine and
CheckpointedDRM support.

The idea is simple. When expressions are written, they are translated into
logical operators impelmenting DrmLike[K] api which don't know anything of
concrete engine.

When expression checkpoint is invoked, optimizer applies rule-based
optimizations (which may create more complex but still logical operators
like A'B, AB' or A'A type of things, or elementwise unary function
applications that are translated from elementwise expressions like
exp(A*A). ). This stuff (strategies building engine-specific physical
lineages for these blas-like building primitives) are found in
sparkbindings.blas package. Unfortunately public version is kind of mm...
behind of mine there quite a bit...

These rewritten operators are passed to Engine.toPhysical which creates a
"checkpointed" DRM. Checkpointed DRM is encapsulation of engine-specific
physical operators (now logical part knows very little of them). In case of
Spark, the logic goes over created TWA and translates it into Spark RDD
lineage (applying some cost estimates along the way).

So main points are to implement Engine.toPhysical and Flink's version of
CheckpointedDRM.

A few other concepts are broadcasting of in-core matrix and vectors (making
them available in every task), and explicit cache pinning level (which is
kind of right now maps 1:1 to supported cache strategies in spark - but
that can be adjusted). Algorithm provide cache level hints in a sign that
they intend to go over the same source matrix again and again, (welcome to
iterative world...)

Finally, another concept is that in-core stuff is backed by scalabindings
dsl for mahout-math (i.e. pure in-core algebra) which is sort of consistent
with distributed operations dsl. i.e. matrix multiplication will look like
A %*% B regardless of whether A or B are in-core or distributed. The major
difference for in-core is that in-place modification operators are enabled
such as += *= or function assignment like mxA := exp _ (elementwise taking
of an exponent) but distributed references of course are immutable
(logically anyway). I need to update docs for all forms of in-core (aka
'scalabindings') DSL, but basic version of documentation kind of covers the
capabilities more or less.

So... as long as mahout-math's Vector and Matrix interface serialization is
supported in the backend, that's the only persistence that is needed. No
other types are either persisted or passed around.

Mahout supports implicit Writable conversion for those in scala (or, more
concretely, MatrixWritable and VectorWritable), and I have added native
kryo support for those as well (not in public version).


Glossary: DRM = distributed row matrix (row-wise partitioned matrix
representation).
TWA= tree walking automaton

On Tue, Feb 3, 2015 at 5:33 PM, Dmitriy Lyubimov <[hidden email]> wrote:

> I may be able to help.
>
> The official link on mahout talks page points to slide share, which
> mangles slides in a weird way, but if it helps, here's the (hidden) link to
> pptx source of those in case it helps:
>
>
> http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx
>
> On Mon, Feb 2, 2015 at 1:00 AM, mkaul <[hidden email]> wrote:
>
>> Ok makes sense to me! :) So we will find something else for the student to
>> do.
>> Would it then be possible for you to maybe very briefly describe your
>> strategy of how
>> you will make this work? Like slides for a design diagram maybe?
>> It might be useful for us to see how the internals would look like at our
>> end too.
>>
>> Cheers,
>> Manu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>
>