[DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Dong Lin
Hi everyone,

I am opening this thread to discuss the idea of moving Flink ML pipeline
API and library code to a separate repository in Flink (similar to what we
did for flink-statefun <https://github.com/apache/flink-statefun>).

The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and ML
libs
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs>.
It allows MLlib developers and users to develop ML pipelines on top of
Flink.

According to the discussion in this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html>
thread, we plan to remove SQL planner in Flink 1.14. However,
there exist ML libraries which currently use Flink's DataSet API together
with Table API. Those libraries will either stop working or suffer
considerable performance regression if they bump up dependency to Flink
1.14. As a result, if we keep ML pipeline API in Flink, then those ML
libraries can not use the latest ML pipeline API/lib in Flink until Flink
compenstates the missing functionality with new DataStream APIs, which is
supposed to happen about 1 year from now in e.g. Flink 1.15.

In order to allow us to remove SQL planner in Flink 1.14 while still
allowing ML pipeline API/lib development in the coming year, we propose to
move Flink ML pipeline API and library code to a separate repository. More
specifically, the new repo will have the following setup:
- The repo will be created at https://github.com/apache/flink-ml. This repo
will depend on the core Flink repo.
- The flink-ml documentation will be linked from the existing main Flink
docs similar to
https://ci.apache.org/projects/flink/flink-statefun-docs-master.
- The new repo will be under namespace org.apache.flink.
- We can revisit whether we should put it back to the core Flink repo after
the above issue is resolved and if there is good reason to make the change.

Here is the proposed plan if we agree to make this change:
- We will create the flink-ml repo and move Flink ML pipeline related code
to this repo before Flink 1.13 code release (3/31/2021)
- Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 is
released.
- Then we update core Flink with new DataStream API (e.g. DataStream
iteration) such that core Flink can support the same (or better) ML lib
performance as it does now with the SQL planner. This is supposed to happen
in about 1 year.
- Then we update flink-ml repo to depend on the latest Flink version once
Flink has the new DataStream API.

Besides the main motivation described above, this change also shares
similar pros/cons of creating a separate repo for flink-statefun
<https://github.com/apache/flink-statefun> (see this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html>
and this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html>
for priory discussion).

Pros:
- A separate repos allows faster development for an early stage project
like flink ML pipeline (both API and libs).
- Flink repo is already super large and it is good not to bloat its size
(and the number of tests)
- Less tests to run when we make code changes in each repo.

Cons:
- The code change in the core Flink might potentially break the test or
cause performance regression in flink-ml since they are in different repo.
So more effort is needed when we bump up flink-ml's Flink dependency.

Overall it seems that the pros outweigh the cons. Looking forward to
hearing what you think!


Regards,
Dong
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Becket Qin
Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
separate repository.

Machine learning is a big area which deserves a separate project so the
development can be decoupled from Flink core. In the meantime, it gives us
the flexibility of evolving Flink without breaking the existing ML users.

Thanks,

Jiangjie (Becket) Qin

On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote:

> Hi everyone,
>
> I am opening this thread to discuss the idea of moving Flink ML pipeline
> API and library code to a separate repository in Flink (similar to what we
> did for flink-statefun <https://github.com/apache/flink-statefun>).
>
> The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and ML
> libs
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> >.
> It allows MLlib developers and users to develop ML pipelines on top of
> Flink.
>
> According to the discussion in this
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html
> >
> thread, we plan to remove SQL planner in Flink 1.14. However,
> there exist ML libraries which currently use Flink's DataSet API together
> with Table API. Those libraries will either stop working or suffer
> considerable performance regression if they bump up dependency to Flink
> 1.14. As a result, if we keep ML pipeline API in Flink, then those ML
> libraries can not use the latest ML pipeline API/lib in Flink until Flink
> compenstates the missing functionality with new DataStream APIs, which is
> supposed to happen about 1 year from now in e.g. Flink 1.15.
>
> In order to allow us to remove SQL planner in Flink 1.14 while still
> allowing ML pipeline API/lib development in the coming year, we propose to
> move Flink ML pipeline API and library code to a separate repository. More
> specifically, the new repo will have the following setup:
> - The repo will be created at https://github.com/apache/flink-ml. This
> repo
> will depend on the core Flink repo.
> - The flink-ml documentation will be linked from the existing main Flink
> docs similar to
> https://ci.apache.org/projects/flink/flink-statefun-docs-master.
> - The new repo will be under namespace org.apache.flink.
> - We can revisit whether we should put it back to the core Flink repo after
> the above issue is resolved and if there is good reason to make the change.
>
> Here is the proposed plan if we agree to make this change:
> - We will create the flink-ml repo and move Flink ML pipeline related code
> to this repo before Flink 1.13 code release (3/31/2021)
> - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 is
> released.
> - Then we update core Flink with new DataStream API (e.g. DataStream
> iteration) such that core Flink can support the same (or better) ML lib
> performance as it does now with the SQL planner. This is supposed to happen
> in about 1 year.
> - Then we update flink-ml repo to depend on the latest Flink version once
> Flink has the new DataStream API.
>
> Besides the main motivation described above, this change also shares
> similar pros/cons of creating a separate repo for flink-statefun
> <https://github.com/apache/flink-statefun> (see this
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html
> >
> and this
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html
> >
> for priory discussion).
>
> Pros:
> - A separate repos allows faster development for an early stage project
> like flink ML pipeline (both API and libs).
> - Flink repo is already super large and it is good not to bloat its size
> (and the number of tests)
> - Less tests to run when we make code changes in each repo.
>
> Cons:
> - The code change in the core Flink might potentially break the test or
> cause performance regression in flink-ml since they are in different repo.
> So more effort is needed when we bump up flink-ml's Flink dependency.
>
> Overall it seems that the pros outweigh the cons. Looking forward to
> hearing what you think!
>
>
> Regards,
> Dong
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Till Rohrmann
+1 for moving Flink ML to a separate repository. Thanks for driving this
discussion and effort Dong!

Cheers,
Till

On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <[hidden email]> wrote:

> Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
> separate repository.
>
> Machine learning is a big area which deserves a separate project so the
> development can be decoupled from Flink core. In the meantime, it gives us
> the flexibility of evolving Flink without breaking the existing ML users.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote:
>
> > Hi everyone,
> >
> > I am opening this thread to discuss the idea of moving Flink ML pipeline
> > API and library code to a separate repository in Flink (similar to what
> we
> > did for flink-statefun <https://github.com/apache/flink-statefun>).
> >
> > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and
> ML
> > libs
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> > >.
> > It allows MLlib developers and users to develop ML pipelines on top of
> > Flink.
> >
> > According to the discussion in this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html
> > >
> > thread, we plan to remove SQL planner in Flink 1.14. However,
> > there exist ML libraries which currently use Flink's DataSet API together
> > with Table API. Those libraries will either stop working or suffer
> > considerable performance regression if they bump up dependency to Flink
> > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML
> > libraries can not use the latest ML pipeline API/lib in Flink until Flink
> > compenstates the missing functionality with new DataStream APIs, which is
> > supposed to happen about 1 year from now in e.g. Flink 1.15.
> >
> > In order to allow us to remove SQL planner in Flink 1.14 while still
> > allowing ML pipeline API/lib development in the coming year, we propose
> to
> > move Flink ML pipeline API and library code to a separate repository.
> More
> > specifically, the new repo will have the following setup:
> > - The repo will be created at https://github.com/apache/flink-ml. This
> > repo
> > will depend on the core Flink repo.
> > - The flink-ml documentation will be linked from the existing main Flink
> > docs similar to
> > https://ci.apache.org/projects/flink/flink-statefun-docs-master.
> > - The new repo will be under namespace org.apache.flink.
> > - We can revisit whether we should put it back to the core Flink repo
> after
> > the above issue is resolved and if there is good reason to make the
> change.
> >
> > Here is the proposed plan if we agree to make this change:
> > - We will create the flink-ml repo and move Flink ML pipeline related
> code
> > to this repo before Flink 1.13 code release (3/31/2021)
> > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13
> is
> > released.
> > - Then we update core Flink with new DataStream API (e.g. DataStream
> > iteration) such that core Flink can support the same (or better) ML lib
> > performance as it does now with the SQL planner. This is supposed to
> happen
> > in about 1 year.
> > - Then we update flink-ml repo to depend on the latest Flink version once
> > Flink has the new DataStream API.
> >
> > Besides the main motivation described above, this change also shares
> > similar pros/cons of creating a separate repo for flink-statefun
> > <https://github.com/apache/flink-statefun> (see this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html
> > >
> > and this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html
> > >
> > for priory discussion).
> >
> > Pros:
> > - A separate repos allows faster development for an early stage project
> > like flink ML pipeline (both API and libs).
> > - Flink repo is already super large and it is good not to bloat its size
> > (and the number of tests)
> > - Less tests to run when we make code changes in each repo.
> >
> > Cons:
> > - The code change in the core Flink might potentially break the test or
> > cause performance regression in flink-ml since they are in different
> repo.
> > So more effort is needed when we bump up flink-ml's Flink dependency.
> >
> > Overall it seems that the pros outweigh the cons. Looking forward to
> > hearing what you think!
> >
> >
> > Regards,
> > Dong
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Dong Lin
Thank you Becket and Till for your comments!

Since the discussion has been open for about 1 week and there is no concern
with this proposal, I have started the voting thread. Please help vote when
you get time.

Cheers,
Dong

On Mon, Mar 15, 2021 at 6:00 PM Till Rohrmann <[hidden email]> wrote:

> +1 for moving Flink ML to a separate repository. Thanks for driving this
> discussion and effort Dong!
>
> Cheers,
> Till
>
> On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <[hidden email]> wrote:
>
> > Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
> > separate repository.
> >
> > Machine learning is a big area which deserves a separate project so the
> > development can be decoupled from Flink core. In the meantime, it gives
> us
> > the flexibility of evolving Flink without breaking the existing ML users.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote:
> >
> > > Hi everyone,
> > >
> > > I am opening this thread to discuss the idea of moving Flink ML
> pipeline
> > > API and library code to a separate repository in Flink (similar to what
> > we
> > > did for flink-statefun <https://github.com/apache/flink-statefun>).
> > >
> > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline
> and
> > ML
> > > libs
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> > > >.
> > > It allows MLlib developers and users to develop ML pipelines on top of
> > > Flink.
> > >
> > > According to the discussion in this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html
> > > >
> > > thread, we plan to remove SQL planner in Flink 1.14. However,
> > > there exist ML libraries which currently use Flink's DataSet API
> together
> > > with Table API. Those libraries will either stop working or suffer
> > > considerable performance regression if they bump up dependency to Flink
> > > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML
> > > libraries can not use the latest ML pipeline API/lib in Flink until
> Flink
> > > compenstates the missing functionality with new DataStream APIs, which
> is
> > > supposed to happen about 1 year from now in e.g. Flink 1.15.
> > >
> > > In order to allow us to remove SQL planner in Flink 1.14 while still
> > > allowing ML pipeline API/lib development in the coming year, we propose
> > to
> > > move Flink ML pipeline API and library code to a separate repository.
> > More
> > > specifically, the new repo will have the following setup:
> > > - The repo will be created at https://github.com/apache/flink-ml. This
> > > repo
> > > will depend on the core Flink repo.
> > > - The flink-ml documentation will be linked from the existing main
> Flink
> > > docs similar to
> > > https://ci.apache.org/projects/flink/flink-statefun-docs-master.
> > > - The new repo will be under namespace org.apache.flink.
> > > - We can revisit whether we should put it back to the core Flink repo
> > after
> > > the above issue is resolved and if there is good reason to make the
> > change.
> > >
> > > Here is the proposed plan if we agree to make this change:
> > > - We will create the flink-ml repo and move Flink ML pipeline related
> > code
> > > to this repo before Flink 1.13 code release (3/31/2021)
> > > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13
> > is
> > > released.
> > > - Then we update core Flink with new DataStream API (e.g. DataStream
> > > iteration) such that core Flink can support the same (or better) ML lib
> > > performance as it does now with the SQL planner. This is supposed to
> > happen
> > > in about 1 year.
> > > - Then we update flink-ml repo to depend on the latest Flink version
> once
> > > Flink has the new DataStream API.
> > >
> > > Besides the main motivation described above, this change also shares
> > > similar pros/cons of creating a separate repo for flink-statefun
> > > <https://github.com/apache/flink-statefun> (see this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html
> > > >
> > > and this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html
> > > >
> > > for priory discussion).
> > >
> > > Pros:
> > > - A separate repos allows faster development for an early stage project
> > > like flink ML pipeline (both API and libs).
> > > - Flink repo is already super large and it is good not to bloat its
> size
> > > (and the number of tests)
> > > - Less tests to run when we make code changes in each repo.
> > >
> > > Cons:
> > > - The code change in the core Flink might potentially break the test or
> > > cause performance regression in flink-ml since they are in different
> > repo.
> > > So more effort is needed when we bump up flink-ml's Flink dependency.
> > >
> > > Overall it seems that the pros outweigh the cons. Looking forward to
> > > hearing what you think!
> > >
> > >
> > > Regards,
> > > Dong
> > >
> >
>