ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this idea and made their own implementations [Spark ML Pipeline <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline <https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>]. NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML and DL pipelines. ML Pipeline is quite helpful for model composition (i.e. using model(s) for feature engineering) . And it enables logic reuse in train and inference phases (via pipeline persistence and load), which is essential for AI engineering. ML Pipeline can also be a good base for Flink based AI engineering platform if we can make ML Pipeline have good tooling support (i.e. meta data human readable). As the Table API will be the unified high level API for both stream and batch processing, I want to initiate the design discussion of new Table based Flink ML Pipeline. I drafted a design document [1] for this discussion. This design tries to create a new ML Pipeline implementation so that concrete ML/DL algorithms can fit to this new API to achieve interoperability. Any feedback is highly appreciated. Thanks Weihua [1] https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing |
Hi Weihua,
Thanks for bring up this discuss! I quickly read the google doc,and I fully agree that ML can be well supported on TableAPI (at some stage in the future). In fact, Xiaowei and I have already brought up a discussion on enhancing the Table API. In the first phase, we will add support for map/flatmap/agg/flatagg in TableAPI. So I am very happy to be involved in this discussion and will leave a comment in the good doc later. I think It's grateful if you can add a phased implementation plan in google doc. What to do you think? Thanks, Jincheng Weihua Jiang <[hidden email]> 于2018年11月20日周二 下午8:53写道: > ML Pipeline is the idea brought by Scikit-learn > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this > idea and made their own implementations [Spark ML Pipeline > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline > < > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > >]. > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > and DL pipelines. > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) for > feature engineering) . And it enables logic reuse in train and inference > phases (via pipeline persistence and load), which is essential for AI > engineering. ML Pipeline can also be a good base for Flink based AI > engineering platform if we can make ML Pipeline have good tooling support > (i.e. meta data human readable). > > > As the Table API will be the unified high level API for both stream and > batch processing, I want to initiate the design discussion of new Table > based Flink ML Pipeline. > > > I drafted a design document [1] for this discussion. This design tries to > create a new ML Pipeline implementation so that concrete ML/DL algorithms > can fit to this new API to achieve interoperability. > > > Any feedback is highly appreciated. > > > Thanks > > Weihua > > > [1] > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > |
Hi Weihua,
Thanks for the well written design doc! The abstraction of ML pipeline is pretty handy to the AI engineers. As Jincheng mentioned, there is an undergoing effort to enhance the Table API for ML. But it would still be helpful to understand what is missing in Table API to fully support the ML pipeline. Given that there are quite a few proposed API and different related items to discuss, do you think having some examples of how the pipeline works would facilitate the discussion? Again, thanks for kicking off the discussion. Jiangjie (Becket) Qin On Tue, Nov 20, 2018 at 9:17 PM jincheng sun <[hidden email]> wrote: > Hi Weihua, > Thanks for bring up this discuss! > > I quickly read the google doc,and I fully agree that ML can be well > supported on TableAPI (at some stage in the future). > In fact, Xiaowei and I have already brought up a discussion on enhancing > the Table API. In the first phase, we will add support for > map/flatmap/agg/flatagg in TableAPI. > So I am very happy to be involved in this discussion and will leave a > comment in the good doc later. > > I think It's grateful if you can add a phased implementation plan in google > doc. What to do you think? > > Thanks, > Jincheng > > > Weihua Jiang <[hidden email]> 于2018年11月20日周二 下午8:53写道: > > > ML Pipeline is the idea brought by Scikit-learn > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > this > > idea and made their own implementations [Spark ML Pipeline > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > Pipeline > > < > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > >]. > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > > and DL pipelines. > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > for > > feature engineering) . And it enables logic reuse in train and inference > > phases (via pipeline persistence and load), which is essential for AI > > engineering. ML Pipeline can also be a good base for Flink based AI > > engineering platform if we can make ML Pipeline have good tooling support > > (i.e. meta data human readable). > > > > > > As the Table API will be the unified high level API for both stream and > > batch processing, I want to initiate the design discussion of new Table > > based Flink ML Pipeline. > > > > > > I drafted a design document [1] for this discussion. This design tries to > > create a new ML Pipeline implementation so that concrete ML/DL algorithms > > can fit to this new API to achieve interoperability. > > > > > > Any feedback is highly appreciated. > > > > > > Thanks > > > > Weihua > > > > > > [1] > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > |
In reply to this post by Weihua Jiang
Hi Weihua,
Thanks for the proposal. I have quickly read through it. It looks great. A quick question. Do you consider changing the ML Lib (implementation of Estimator/Predictor/Transformer) also on top of the tableAPI? I will be very happy if this is also included in the scope. It is not easy and needs lots of new tableAPI functionalities, which is exactly one of the reasons that motivate us to "enhance the tableAPI" discussed in other threads. The entire scope of your proposal is so big that I would suggest we should complete it step by step. I think you have mainly proposed 3 things: 1. Redesign the ML pipeline based on tableAPI 2. Take streaming ML pipeline into account 3. Enhance ML pipeline with some new features for a better user experience Maybe we should first replace the ml pipeline interface with tableAPI, then move into #2 and #3. In the meanwhile, we can also explore the possibility of changing the ML lib also on top of tableAPI. What do you think? BTW, we should not break the current ML pipeline interface (which is based on dataset) when we introduce the new ones. Let us leave it for a while before the new interface is completed and well adopted. Then we can deprecate the old ones. I will take a more thorough look at your proposal and leave comments directly on the doc. Regards, Shaoxuan On 11/20/18, Weihua Jiang <[hidden email]> wrote: > ML Pipeline is the idea brought by Scikit-learn > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this > idea and made their own implementations [Spark ML Pipeline > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>]. > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > and DL pipelines. > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) for > feature engineering) . And it enables logic reuse in train and inference > phases (via pipeline persistence and load), which is essential for AI > engineering. ML Pipeline can also be a good base for Flink based AI > engineering platform if we can make ML Pipeline have good tooling support > (i.e. meta data human readable). > > > As the Table API will be the unified high level API for both stream and > batch processing, I want to initiate the design discussion of new Table > based Flink ML Pipeline. > > > I drafted a design document [1] for this discussion. This design tries to > create a new ML Pipeline implementation so that concrete ML/DL algorithms > can fit to this new API to achieve interoperability. > > > Any feedback is highly appreciated. > > > Thanks > > Weihua > > > [1] > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > -- ----------------------------------------------------------------------------------- *Rome was not built in one day* ----------------------------------------------------------------------------------- |
In reply to this post by Weihua Jiang
Hi Weihua,
Thanks for the exciting proposal! I have quickly read through it, and I really appropriate the idea of providing the ML Pipeline API similar to the commonly used library scikit-learn, since it greatly reduce the learning cost for the AI engineers to transfer to the Flink platform. Currently we are also working on a related issue, namely enhancing the stream iteration of Flink to support both SGD and online learning, and it also support batch training as a special case. we have had a rough design and will start a new discussion in the next few days. I think the enhanced stream iteration will help to implement Estimators directly in Flink, and it may help to simplify the online learning pipeline by eliminating the requirement to load the models from external file systems. I will read the design doc more carefully. Thanks again for sharing the design doc! Yours sincerely Yun Gao ------------------------------------------------------------------ 发件人:Weihua Jiang <[hidden email]> 发送时间:2018年11月20日(星期二) 20:53 收件人:dev <[hidden email]> 主 题:[DISCUSS] Embracing Table API in Flink ML ML Pipeline is the idea brought by Scikit-learn <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this idea and made their own implementations [Spark ML Pipeline <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline <https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>]. NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML and DL pipelines. ML Pipeline is quite helpful for model composition (i.e. using model(s) for feature engineering) . And it enables logic reuse in train and inference phases (via pipeline persistence and load), which is essential for AI engineering. ML Pipeline can also be a good base for Flink based AI engineering platform if we can make ML Pipeline have good tooling support (i.e. meta data human readable). As the Table API will be the unified high level API for both stream and batch processing, I want to initiate the design discussion of new Table based Flink ML Pipeline. I drafted a design document [1] for this discussion. This design tries to create a new ML Pipeline implementation so that concrete ML/DL algorithms can fit to this new API to achieve interoperability. Any feedback is highly appreciated. Thanks Weihua [1] https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing |
In reply to this post by jincheng sun
Hi Jincheng,
Thanks a lot for the warm feedback. I've already read your Table API enhancement google doc. Those enhancements are essential to implement any ML/DL algorithm on Table API. Our two designs are perfectly complementary to each other. :) Will add a section in my google doc for the implementation phased plan. Thanks Weihua jincheng sun <[hidden email]> 于2018年11月20日周二 下午9:17写道: > Hi Weihua, > Thanks for bring up this discuss! > > I quickly read the google doc,and I fully agree that ML can be well > supported on TableAPI (at some stage in the future). > In fact, Xiaowei and I have already brought up a discussion on enhancing > the Table API. In the first phase, we will add support for > map/flatmap/agg/flatagg in TableAPI. > So I am very happy to be involved in this discussion and will leave a > comment in the good doc later. > > I think It's grateful if you can add a phased implementation plan in google > doc. What to do you think? > > Thanks, > Jincheng > > > Weihua Jiang <[hidden email]> 于2018年11月20日周二 下午8:53写道: > > > ML Pipeline is the idea brought by Scikit-learn > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > this > > idea and made their own implementations [Spark ML Pipeline > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > Pipeline > > < > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > >]. > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > > and DL pipelines. > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > for > > feature engineering) . And it enables logic reuse in train and inference > > phases (via pipeline persistence and load), which is essential for AI > > engineering. ML Pipeline can also be a good base for Flink based AI > > engineering platform if we can make ML Pipeline have good tooling support > > (i.e. meta data human readable). > > > > > > As the Table API will be the unified high level API for both stream and > > batch processing, I want to initiate the design discussion of new Table > > based Flink ML Pipeline. > > > > > > I drafted a design document [1] for this discussion. This design tries to > > create a new ML Pipeline implementation so that concrete ML/DL algorithms > > can fit to this new API to achieve interoperability. > > > > > > Any feedback is highly appreciated. > > > > > > Thanks > > > > Weihua > > > > > > [1] > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > |
In reply to this post by Becket Qin
HI Becket,
Thanks a lot for the Table API enhancement design doc. I am working on some simple ML algorithm using this new ML pipeline. Will feedback you if there is any Table enhancement needed. Thanks Weihua Becket Qin <[hidden email]> 于2018年11月20日周二 下午10:43写道: > Hi Weihua, > > Thanks for the well written design doc! > > The abstraction of ML pipeline is pretty handy to the AI engineers. As > Jincheng mentioned, there is an undergoing effort to enhance the Table API > for ML. But it would still be helpful to understand what is missing in > Table API to fully support the ML pipeline. Given that there are quite a > few proposed API and different related items to discuss, do you think > having some examples of how the pipeline works would facilitate the > discussion? > > Again, thanks for kicking off the discussion. > > Jiangjie (Becket) Qin > > > On Tue, Nov 20, 2018 at 9:17 PM jincheng sun <[hidden email]> > wrote: > > > Hi Weihua, > > Thanks for bring up this discuss! > > > > I quickly read the google doc,and I fully agree that ML can be well > > supported on TableAPI (at some stage in the future). > > In fact, Xiaowei and I have already brought up a discussion on enhancing > > the Table API. In the first phase, we will add support for > > map/flatmap/agg/flatagg in TableAPI. > > So I am very happy to be involved in this discussion and will leave a > > comment in the good doc later. > > > > I think It's grateful if you can add a phased implementation plan in > > doc. What to do you think? > > > > Thanks, > > Jincheng > > > > > > Weihua Jiang <[hidden email]> 于2018年11月20日周二 下午8:53写道: > > > > > ML Pipeline is the idea brought by Scikit-learn > > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > > this > > > idea and made their own implementations [Spark ML Pipeline > > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > > Pipeline > > > < > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > > >]. > > > > > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both > ML > > > and DL pipelines. > > > > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > > for > > > feature engineering) . And it enables logic reuse in train and > inference > > > phases (via pipeline persistence and load), which is essential for AI > > > engineering. ML Pipeline can also be a good base for Flink based AI > > > engineering platform if we can make ML Pipeline have good tooling > support > > > (i.e. meta data human readable). > > > > > > > > > As the Table API will be the unified high level API for both stream and > > > batch processing, I want to initiate the design discussion of new Table > > > based Flink ML Pipeline. > > > > > > > > > I drafted a design document [1] for this discussion. This design tries > to > > > create a new ML Pipeline implementation so that concrete ML/DL > algorithms > > > can fit to this new API to achieve interoperability. > > > > > > > > > Any feedback is highly appreciated. > > > > > > > > > Thanks > > > > > > Weihua > > > > > > > > > [1] > > > > > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > > |
In reply to this post by Shaoxuan Wang
Hi Shaoxuan,
You are perfectly right. What I want to achieve is a combination of all your 3 points. Let me rephrase here: 1. Define a Table based ML Pipeline interface to have the same functionality as current DataSet based implementations. 2. Support new features like online learning, streaming inference. 3. Provide a base for Flink AI tooling (i.e. AI platform) and ML/DL SQL support. This definitely will be step-by-step actions and will need a lot of help from Table enhancements. I am currently working on #1. Thanks Weihua Shaoxuan Wang <[hidden email]> 于2018年11月20日周二 下午11:11写道: > Hi Weihua, > > Thanks for the proposal. I have quickly read through it. It looks great. > A quick question. Do you consider changing the ML Lib (implementation > of Estimator/Predictor/Transformer) also on top of the tableAPI? I > will be very happy if this is also included in the scope. It is not > easy and needs lots of new tableAPI functionalities, which is exactly > one of the reasons that motivate us to "enhance the tableAPI" > discussed in other threads. > > The entire scope of your proposal is so big that I would suggest we > should complete it step by step. I think you have mainly proposed 3 > things: > 1. Redesign the ML pipeline based on tableAPI > 2. Take streaming ML pipeline into account > 3. Enhance ML pipeline with some new features for a better user experience > Maybe we should first replace the ml pipeline interface with tableAPI, > then move into #2 and #3. In the meanwhile, we can also explore the > possibility of changing the ML lib also on top of tableAPI. What do > you think? > > BTW, we should not break the current ML pipeline interface (which is > based on dataset) when we introduce the new ones. Let us leave it for > a while before the new interface is completed and well adopted. Then > we can deprecate the old ones. > > I will take a more thorough look at your proposal and leave comments > directly on the doc. > > Regards, > Shaoxuan > > > On 11/20/18, Weihua Jiang <[hidden email]> wrote: > > ML Pipeline is the idea brought by Scikit-learn > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > this > > idea and made their own implementations [Spark ML Pipeline > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > Pipeline > > < > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > >]. > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > > and DL pipelines. > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > for > > feature engineering) . And it enables logic reuse in train and inference > > phases (via pipeline persistence and load), which is essential for AI > > engineering. ML Pipeline can also be a good base for Flink based AI > > engineering platform if we can make ML Pipeline have good tooling support > > (i.e. meta data human readable). > > > > > > As the Table API will be the unified high level API for both stream and > > batch processing, I want to initiate the design discussion of new Table > > based Flink ML Pipeline. > > > > > > I drafted a design document [1] for this discussion. This design tries to > > create a new ML Pipeline implementation so that concrete ML/DL algorithms > > can fit to this new API to achieve interoperability. > > > > > > Any feedback is highly appreciated. > > > > > > Thanks > > > > Weihua > > > > > > [1] > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > -- > > ----------------------------------------------------------------------------------- > > *Rome was not built in one day* > > > ----------------------------------------------------------------------------------- > |
In reply to this post by Yun Gao
Hi Yun,
Can't wait to see your design. Thanks Weihua Yun Gao <[hidden email]> 于2018年11月21日周三 上午12:43写道: > Hi Weihua, > > Thanks for the exciting proposal! > > I have quickly read through it, and I really appropriate the idea of > providing the ML Pipeline API similar to the commonly used library > scikit-learn, since it greatly reduce the learning cost for the AI > engineers to transfer to the Flink platform. > > Currently we are also working on a related issue, namely enhancing the > stream iteration of Flink to support both SGD and online learning, and it > also support batch training as a special case. we have had a rough design > and will start a new discussion in the next few days. I think the enhanced > stream iteration will help to implement Estimators directly in Flink, and > it may help to simplify the online learning pipeline by eliminating the > requirement to load the models from external file systems. > > I will read the design doc more carefully. Thanks again for sharing > the design doc! > > Yours sincerely > Yun Gao > > > ------------------------------------------------------------------ > 发件人:Weihua Jiang <[hidden email]> > 发送时间:2018年11月20日(星期二) 20:53 > 收件人:dev <[hidden email]> > 主 题:[DISCUSS] Embracing Table API in Flink ML > > ML Pipeline is the idea brought by Scikit-learn > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this > idea and made their own implementations [Spark ML Pipeline > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline > < > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > >]. > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > and DL pipelines. > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) for > feature engineering) . And it enables logic reuse in train and inference > phases (via pipeline persistence and load), which is essential for AI > engineering. ML Pipeline can also be a good base for Flink based AI > engineering platform if we can make ML Pipeline have good tooling support > (i.e. meta data human readable). > > > As the Table API will be the unified high level API for both stream and > batch processing, I want to initiate the design discussion of new Table > based Flink ML Pipeline. > > > I drafted a design document [1] for this discussion. This design tries to > create a new ML Pipeline implementation so that concrete ML/DL algorithms > can fit to this new API to achieve interoperability. > > > Any feedback is highly appreciated. > > > Thanks > > Weihua > > > [1] > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > |
Hi Yun,
Very excited to see Flink ML forward! There are many touch points your document touched. I couldn't agree more the value of having a (unified) table API could bring to Flink ecosystem towards running ML workload. Most ML pipelines we observed starts from single box python scripts or adhoc tools researcher run to train model on powerful machine. When that proves successful, they need to hook up with data warehouse and extract features (SQL kick in). In training phase, the landscape is very segmented. Small to median sized model can be trained on JVM, while large/deep model needs to optimize operator per iteration data random shuffle (SGD based DL) often ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of hack around map-reduce) Hope it makes sense. BTW, xgboost (most popular ML competition framework) has very primitive flink support, might worth check out. https://github.com/dmlc/xgboost Chen On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <[hidden email]> wrote: > Hi Yun, > > Can't wait to see your design. > > Thanks > Weihua > > Yun Gao <[hidden email]> 于2018年11月21日周三 上午12:43写道: > > > Hi Weihua, > > > > Thanks for the exciting proposal! > > > > I have quickly read through it, and I really appropriate the idea of > > providing the ML Pipeline API similar to the commonly used library > > scikit-learn, since it greatly reduce the learning cost for the AI > > engineers to transfer to the Flink platform. > > > > Currently we are also working on a related issue, namely enhancing > the > > stream iteration of Flink to support both SGD and online learning, and it > > also support batch training as a special case. we have had a rough design > > and will start a new discussion in the next few days. I think the > enhanced > > stream iteration will help to implement Estimators directly in Flink, and > > it may help to simplify the online learning pipeline by eliminating the > > requirement to load the models from external file systems. > > > > I will read the design doc more carefully. Thanks again for sharing > > the design doc! > > > > Yours sincerely > > Yun Gao > > > > > > ------------------------------------------------------------------ > > 发件人:Weihua Jiang <[hidden email]> > > 发送时间:2018年11月20日(星期二) 20:53 > > 收件人:dev <[hidden email]> > > 主 题:[DISCUSS] Embracing Table API in Flink ML > > > > ML Pipeline is the idea brought by Scikit-learn > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > this > > idea and made their own implementations [Spark ML Pipeline > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > Pipeline > > < > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > >]. > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > > and DL pipelines. > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > for > > feature engineering) . And it enables logic reuse in train and inference > > phases (via pipeline persistence and load), which is essential for AI > > engineering. ML Pipeline can also be a good base for Flink based AI > > engineering platform if we can make ML Pipeline have good tooling support > > (i.e. meta data human readable). > > > > > > As the Table API will be the unified high level API for both stream and > > batch processing, I want to initiate the design discussion of new Table > > based Flink ML Pipeline. > > > > > > I drafted a design document [1] for this discussion. This design tries to > > create a new ML Pipeline implementation so that concrete ML/DL algorithms > > can fit to this new API to achieve interoperability. > > > > > > Any feedback is highly appreciated. > > > > > > Thanks > > > > Weihua > > > > > > [1] > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > |
It has pasted a while and I think we can move forward to JIRA discussion.
I will try to split the design into smaller pieces to make it more understandable. Actually, I have already implemented an initial version and ported some flink.ml algorithms using this new API. Thus, we can have a better base for design discussion. Thanks Weihua Chen Qin <[hidden email]> 于2018年11月21日周三 下午1:36写道: > Hi Yun, > > Very excited to see Flink ML forward! There are many touch points your > document touched. I couldn't agree more the value of having a (unified) > table API could bring to Flink ecosystem towards running ML workload. Most > ML pipelines we observed starts from single box python scripts or adhoc > tools researcher run to train model on powerful machine. When that proves > successful, they need to hook up with data warehouse and extract features > (SQL kick in). In training phase, the landscape is very segmented. Small to > median sized model can be trained on JVM, while large/deep model needs to > optimize operator per iteration data random shuffle (SGD based DL) often > ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of > hack around map-reduce) > > Hope it makes sense. BTW, xgboost (most popular ML competition framework) > has very primitive flink support, might worth check out. > https://github.com/dmlc/xgboost > > Chen > > On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <[hidden email]> > wrote: > > > Hi Yun, > > > > Can't wait to see your design. > > > > Thanks > > Weihua > > > > Yun Gao <[hidden email]> 于2018年11月21日周三 上午12:43写道: > > > > > Hi Weihua, > > > > > > Thanks for the exciting proposal! > > > > > > I have quickly read through it, and I really appropriate the idea > of > > > providing the ML Pipeline API similar to the commonly used library > > > scikit-learn, since it greatly reduce the learning cost for the AI > > > engineers to transfer to the Flink platform. > > > > > > Currently we are also working on a related issue, namely enhancing > > the > > > stream iteration of Flink to support both SGD and online learning, and > it > > > also support batch training as a special case. we have had a rough > design > > > and will start a new discussion in the next few days. I think the > > enhanced > > > stream iteration will help to implement Estimators directly in Flink, > and > > > it may help to simplify the online learning pipeline by eliminating the > > > requirement to load the models from external file systems. > > > > > > I will read the design doc more carefully. Thanks again for sharing > > > the design doc! > > > > > > Yours sincerely > > > Yun Gao > > > > > > > > > ------------------------------------------------------------------ > > > 发件人:Weihua Jiang <[hidden email]> > > > 发送时间:2018年11月20日(星期二) 20:53 > > > 收件人:dev <[hidden email]> > > > 主 题:[DISCUSS] Embracing Table API in Flink ML > > > > > > ML Pipeline is the idea brought by Scikit-learn > > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > > this > > > idea and made their own implementations [Spark ML Pipeline > > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > > Pipeline > > > < > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > > > >]. > > > > > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both > ML > > > and DL pipelines. > > > > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > > for > > > feature engineering) . And it enables logic reuse in train and > inference > > > phases (via pipeline persistence and load), which is essential for AI > > > engineering. ML Pipeline can also be a good base for Flink based AI > > > engineering platform if we can make ML Pipeline have good tooling > support > > > (i.e. meta data human readable). > > > > > > > > > As the Table API will be the unified high level API for both stream and > > > batch processing, I want to initiate the design discussion of new Table > > > based Flink ML Pipeline. > > > > > > > > > I drafted a design document [1] for this discussion. This design tries > to > > > create a new ML Pipeline implementation so that concrete ML/DL > algorithms > > > can fit to this new API to achieve interoperability. > > > > > > > > > Any feedback is highly appreciated. > > > > > > > > > Thanks > > > > > > Weihua > > > > > > > > > [1] > > > > > > > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > > > > > |
Free forum by Nabble | Edit this page |