Hi everyone,
I am opening this thread to discuss the idea of moving Flink ML pipeline API and library code to a separate repository in Flink (similar to what we did for flink-statefun <https://github.com/apache/flink-statefun>). The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and ML libs <https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs>. It allows MLlib developers and users to develop ML pipelines on top of Flink. According to the discussion in this <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html> thread, we plan to remove SQL planner in Flink 1.14. However, there exist ML libraries which currently use Flink's DataSet API together with Table API. Those libraries will either stop working or suffer considerable performance regression if they bump up dependency to Flink 1.14. As a result, if we keep ML pipeline API in Flink, then those ML libraries can not use the latest ML pipeline API/lib in Flink until Flink compenstates the missing functionality with new DataStream APIs, which is supposed to happen about 1 year from now in e.g. Flink 1.15. In order to allow us to remove SQL planner in Flink 1.14 while still allowing ML pipeline API/lib development in the coming year, we propose to move Flink ML pipeline API and library code to a separate repository. More specifically, the new repo will have the following setup: - The repo will be created at https://github.com/apache/flink-ml. This repo will depend on the core Flink repo. - The flink-ml documentation will be linked from the existing main Flink docs similar to https://ci.apache.org/projects/flink/flink-statefun-docs-master. - The new repo will be under namespace org.apache.flink. - We can revisit whether we should put it back to the core Flink repo after the above issue is resolved and if there is good reason to make the change. Here is the proposed plan if we agree to make this change: - We will create the flink-ml repo and move Flink ML pipeline related code to this repo before Flink 1.13 code release (3/31/2021) - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 is released. - Then we update core Flink with new DataStream API (e.g. DataStream iteration) such that core Flink can support the same (or better) ML lib performance as it does now with the SQL planner. This is supposed to happen in about 1 year. - Then we update flink-ml repo to depend on the latest Flink version once Flink has the new DataStream API. Besides the main motivation described above, this change also shares similar pros/cons of creating a separate repo for flink-statefun <https://github.com/apache/flink-statefun> (see this <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html> and this <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html> for priory discussion). Pros: - A separate repos allows faster development for an early stage project like flink ML pipeline (both API and libs). - Flink repo is already super large and it is good not to bloat its size (and the number of tests) - Less tests to run when we make code changes in each repo. Cons: - The code change in the core Flink might potentially break the test or cause performance regression in flink-ml since they are in different repo. So more effort is needed when we bump up flink-ml's Flink dependency. Overall it seems that the pros outweigh the cons. Looking forward to hearing what you think! Regards, Dong |
Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
separate repository. Machine learning is a big area which deserves a separate project so the development can be decoupled from Flink core. In the meantime, it gives us the flexibility of evolving Flink without breaking the existing ML users. Thanks, Jiangjie (Becket) Qin On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote: > Hi everyone, > > I am opening this thread to discuss the idea of moving Flink ML pipeline > API and library code to a separate repository in Flink (similar to what we > did for flink-statefun <https://github.com/apache/flink-statefun>). > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and ML > libs > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > >. > It allows MLlib developers and users to develop ML pipelines on top of > Flink. > > According to the discussion in this > < > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html > > > thread, we plan to remove SQL planner in Flink 1.14. However, > there exist ML libraries which currently use Flink's DataSet API together > with Table API. Those libraries will either stop working or suffer > considerable performance regression if they bump up dependency to Flink > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML > libraries can not use the latest ML pipeline API/lib in Flink until Flink > compenstates the missing functionality with new DataStream APIs, which is > supposed to happen about 1 year from now in e.g. Flink 1.15. > > In order to allow us to remove SQL planner in Flink 1.14 while still > allowing ML pipeline API/lib development in the coming year, we propose to > move Flink ML pipeline API and library code to a separate repository. More > specifically, the new repo will have the following setup: > - The repo will be created at https://github.com/apache/flink-ml. This > repo > will depend on the core Flink repo. > - The flink-ml documentation will be linked from the existing main Flink > docs similar to > https://ci.apache.org/projects/flink/flink-statefun-docs-master. > - The new repo will be under namespace org.apache.flink. > - We can revisit whether we should put it back to the core Flink repo after > the above issue is resolved and if there is good reason to make the change. > > Here is the proposed plan if we agree to make this change: > - We will create the flink-ml repo and move Flink ML pipeline related code > to this repo before Flink 1.13 code release (3/31/2021) > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 is > released. > - Then we update core Flink with new DataStream API (e.g. DataStream > iteration) such that core Flink can support the same (or better) ML lib > performance as it does now with the SQL planner. This is supposed to happen > in about 1 year. > - Then we update flink-ml repo to depend on the latest Flink version once > Flink has the new DataStream API. > > Besides the main motivation described above, this change also shares > similar pros/cons of creating a separate repo for flink-statefun > <https://github.com/apache/flink-statefun> (see this > < > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html > > > and this > < > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html > > > for priory discussion). > > Pros: > - A separate repos allows faster development for an early stage project > like flink ML pipeline (both API and libs). > - Flink repo is already super large and it is good not to bloat its size > (and the number of tests) > - Less tests to run when we make code changes in each repo. > > Cons: > - The code change in the core Flink might potentially break the test or > cause performance regression in flink-ml since they are in different repo. > So more effort is needed when we bump up flink-ml's Flink dependency. > > Overall it seems that the pros outweigh the cons. Looking forward to > hearing what you think! > > > Regards, > Dong > |
+1 for moving Flink ML to a separate repository. Thanks for driving this
discussion and effort Dong! Cheers, Till On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <[hidden email]> wrote: > Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a > separate repository. > > Machine learning is a big area which deserves a separate project so the > development can be decoupled from Flink core. In the meantime, it gives us > the flexibility of evolving Flink without breaking the existing ML users. > > Thanks, > > Jiangjie (Becket) Qin > > On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote: > > > Hi everyone, > > > > I am opening this thread to discuss the idea of moving Flink ML pipeline > > API and library code to a separate repository in Flink (similar to what > we > > did for flink-statefun <https://github.com/apache/flink-statefun>). > > > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and > ML > > libs > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > > >. > > It allows MLlib developers and users to develop ML pipelines on top of > > Flink. > > > > According to the discussion in this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html > > > > > thread, we plan to remove SQL planner in Flink 1.14. However, > > there exist ML libraries which currently use Flink's DataSet API together > > with Table API. Those libraries will either stop working or suffer > > considerable performance regression if they bump up dependency to Flink > > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML > > libraries can not use the latest ML pipeline API/lib in Flink until Flink > > compenstates the missing functionality with new DataStream APIs, which is > > supposed to happen about 1 year from now in e.g. Flink 1.15. > > > > In order to allow us to remove SQL planner in Flink 1.14 while still > > allowing ML pipeline API/lib development in the coming year, we propose > to > > move Flink ML pipeline API and library code to a separate repository. > More > > specifically, the new repo will have the following setup: > > - The repo will be created at https://github.com/apache/flink-ml. This > > repo > > will depend on the core Flink repo. > > - The flink-ml documentation will be linked from the existing main Flink > > docs similar to > > https://ci.apache.org/projects/flink/flink-statefun-docs-master. > > - The new repo will be under namespace org.apache.flink. > > - We can revisit whether we should put it back to the core Flink repo > after > > the above issue is resolved and if there is good reason to make the > change. > > > > Here is the proposed plan if we agree to make this change: > > - We will create the flink-ml repo and move Flink ML pipeline related > code > > to this repo before Flink 1.13 code release (3/31/2021) > > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 > is > > released. > > - Then we update core Flink with new DataStream API (e.g. DataStream > > iteration) such that core Flink can support the same (or better) ML lib > > performance as it does now with the SQL planner. This is supposed to > happen > > in about 1 year. > > - Then we update flink-ml repo to depend on the latest Flink version once > > Flink has the new DataStream API. > > > > Besides the main motivation described above, this change also shares > > similar pros/cons of creating a separate repo for flink-statefun > > <https://github.com/apache/flink-statefun> (see this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html > > > > > and this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html > > > > > for priory discussion). > > > > Pros: > > - A separate repos allows faster development for an early stage project > > like flink ML pipeline (both API and libs). > > - Flink repo is already super large and it is good not to bloat its size > > (and the number of tests) > > - Less tests to run when we make code changes in each repo. > > > > Cons: > > - The code change in the core Flink might potentially break the test or > > cause performance regression in flink-ml since they are in different > repo. > > So more effort is needed when we bump up flink-ml's Flink dependency. > > > > Overall it seems that the pros outweigh the cons. Looking forward to > > hearing what you think! > > > > > > Regards, > > Dong > > > |
Thank you Becket and Till for your comments!
Since the discussion has been open for about 1 week and there is no concern with this proposal, I have started the voting thread. Please help vote when you get time. Cheers, Dong On Mon, Mar 15, 2021 at 6:00 PM Till Rohrmann <[hidden email]> wrote: > +1 for moving Flink ML to a separate repository. Thanks for driving this > discussion and effort Dong! > > Cheers, > Till > > On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <[hidden email]> wrote: > > > Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a > > separate repository. > > > > Machine learning is a big area which deserves a separate project so the > > development can be decoupled from Flink core. In the meantime, it gives > us > > the flexibility of evolving Flink without breaking the existing ML users. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[hidden email]> wrote: > > > > > Hi everyone, > > > > > > I am opening this thread to discuss the idea of moving Flink ML > pipeline > > > API and library code to a separate repository in Flink (similar to what > > we > > > did for flink-statefun <https://github.com/apache/flink-statefun>). > > > > > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline > and > > ML > > > libs > > > < > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > > > >. > > > It allows MLlib developers and users to develop ML pipelines on top of > > > Flink. > > > > > > According to the discussion in this > > > < > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html > > > > > > > thread, we plan to remove SQL planner in Flink 1.14. However, > > > there exist ML libraries which currently use Flink's DataSet API > together > > > with Table API. Those libraries will either stop working or suffer > > > considerable performance regression if they bump up dependency to Flink > > > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML > > > libraries can not use the latest ML pipeline API/lib in Flink until > Flink > > > compenstates the missing functionality with new DataStream APIs, which > is > > > supposed to happen about 1 year from now in e.g. Flink 1.15. > > > > > > In order to allow us to remove SQL planner in Flink 1.14 while still > > > allowing ML pipeline API/lib development in the coming year, we propose > > to > > > move Flink ML pipeline API and library code to a separate repository. > > More > > > specifically, the new repo will have the following setup: > > > - The repo will be created at https://github.com/apache/flink-ml. This > > > repo > > > will depend on the core Flink repo. > > > - The flink-ml documentation will be linked from the existing main > Flink > > > docs similar to > > > https://ci.apache.org/projects/flink/flink-statefun-docs-master. > > > - The new repo will be under namespace org.apache.flink. > > > - We can revisit whether we should put it back to the core Flink repo > > after > > > the above issue is resolved and if there is good reason to make the > > change. > > > > > > Here is the proposed plan if we agree to make this change: > > > - We will create the flink-ml repo and move Flink ML pipeline related > > code > > > to this repo before Flink 1.13 code release (3/31/2021) > > > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 > > is > > > released. > > > - Then we update core Flink with new DataStream API (e.g. DataStream > > > iteration) such that core Flink can support the same (or better) ML lib > > > performance as it does now with the SQL planner. This is supposed to > > happen > > > in about 1 year. > > > - Then we update flink-ml repo to depend on the latest Flink version > once > > > Flink has the new DataStream API. > > > > > > Besides the main motivation described above, this change also shares > > > similar pros/cons of creating a separate repo for flink-statefun > > > <https://github.com/apache/flink-statefun> (see this > > > < > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html > > > > > > > and this > > > < > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html > > > > > > > for priory discussion). > > > > > > Pros: > > > - A separate repos allows faster development for an early stage project > > > like flink ML pipeline (both API and libs). > > > - Flink repo is already super large and it is good not to bloat its > size > > > (and the number of tests) > > > - Less tests to run when we make code changes in each repo. > > > > > > Cons: > > > - The code change in the core Flink might potentially break the test or > > > cause performance regression in flink-ml since they are in different > > repo. > > > So more effort is needed when we bump up flink-ml's Flink dependency. > > > > > > Overall it seems that the pros outweigh the cons. Looking forward to > > > hearing what you think! > > > > > > > > > Regards, > > > Dong > > > > > > |
Free forum by Nabble | Edit this page |