Hello all,
From our previous discussion started by Stavros, we decided to start a planning document [1] to figure out possible next steps for ML on Flink. Our concerns where mainly ensuring active development while satisfying the needs of the community. We have listed a number of proposals for future work in the document. In short they are: - Offline learning with the batch API - Online learning - Offline learning with the streaming API - Low-latency prediction serving I saw there is a number of people willing to work on ML for Flink, but the truth is that we cannot cover all of these suggestions without fragmenting the development too much. So my recommendation is to pick out 2 of these options, create design documents and build prototypes for each library. We can then assess their viability and together with the community decide if we should try to include one (or both) of them in the main Flink distribution. So I invite people to express their opinion about which task they would be willing to contribute and hopefully we can settle on two of these options. Once that is done we can decide how we do the actual work. Since this is highly experimental I would suggest we work on repositories where we have complete control. For that purpose I have created an organization [2] on Github which we can use to create repositories and teams that work on them in an organized manner. Once enough work has accumulated we can start discussing contributing the code to the main distribution. Regards, Theodore [1] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/ [2] https://github.com/flinkml |
Thank you, Theodore.
Shortly speaking I vote for: 1) Online learning 2) Low-latency prediction serving -> Offline learning with the batch API In details: 1) If streaming is strong side of Flink lets use it, and try to support some online learning or light weight inmemory learning algorithms. Try to build pipeline for them. 2) I think that Flink should be part of production ecosystem, and if now productions require ML support, multiple models deployment and so on, we should serve this. But in my opinion we shouldn’t compete with such projects like PredictionIO, but serve them, to be an execution core. But that means a lot: a. Offline training should be supported, because typically most of ML algs are for offline training. b. Model lifecycle should be supported: ETL+transformation+training+scoring+exploitation quality monitoring I understand that batch world is full of competitors, but for me that doesn’t mean that batch should be ignored. I think that separated streaming/batching applications causes additional deployment and exploitation overhead which typically tried to be avoided. That means that we should attract community to this problem in my opinion. пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < [hidden email]>: Hello all, From our previous discussion started by Stavros, we decided to start a planning document [1] to figure out possible next steps for ML on Flink. Our concerns where mainly ensuring active development while satisfying the needs of the community. We have listed a number of proposals for future work in the document. In short they are: - Offline learning with the batch API - Online learning - Offline learning with the streaming API - Low-latency prediction serving I saw there is a number of people willing to work on ML for Flink, but the truth is that we cannot cover all of these suggestions without fragmenting the development too much. So my recommendation is to pick out 2 of these options, create design documents and build prototypes for each library. We can then assess their viability and together with the community decide if we should try to include one (or both) of them in the main Flink distribution. So I invite people to express their opinion about which task they would be willing to contribute and hopefully we can settle on two of these options. Once that is done we can decide how we do the actual work. Since this is highly experimental I would suggest we work on repositories where we have complete control. For that purpose I have created an organization [2] on Github which we can use to create repositories and teams that work on them in an organized manner. Once enough work has accumulated we can start discussing contributing the code to the main distribution. Regards, Theodore [1] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/ [2] https://github.com/flinkml -- *Yours faithfully, * *Kate Eri.* |
Great points to start: - Online learning
- Offline learning with the streaming API Thanks + have a great weekend. From: Katherin Eri <[hidden email]> To: [hidden email] Sent: Friday, March 3, 2017 7:41 AM Subject: Re: Machine Learning on Flink - Next steps Thank you, Theodore. Shortly speaking I vote for: 1) Online learning 2) Low-latency prediction serving -> Offline learning with the batch API In details: 1) If streaming is strong side of Flink lets use it, and try to support some online learning or light weight inmemory learning algorithms. Try to build pipeline for them. 2) I think that Flink should be part of production ecosystem, and if now productions require ML support, multiple models deployment and so on, we should serve this. But in my opinion we shouldn’t compete with such projects like PredictionIO, but serve them, to be an execution core. But that means a lot: a. Offline training should be supported, because typically most of ML algs are for offline training. b. Model lifecycle should be supported: ETL+transformation+training+scoring+exploitation quality monitoring I understand that batch world is full of competitors, but for me that doesn’t mean that batch should be ignored. I think that separated streaming/batching applications causes additional deployment and exploitation overhead which typically tried to be avoided. That means that we should attract community to this problem in my opinion. пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < [hidden email]>: Hello all, From our previous discussion started by Stavros, we decided to start a planning document [1] to figure out possible next steps for ML on Flink. Our concerns where mainly ensuring active development while satisfying the needs of the community. We have listed a number of proposals for future work in the document. In short they are: - Offline learning with the batch API - Online learning - Offline learning with the streaming API - Low-latency prediction serving I saw there is a number of people willing to work on ML for Flink, but the truth is that we cannot cover all of these suggestions without fragmenting the development too much. So my recommendation is to pick out 2 of these options, create design documents and build prototypes for each library. We can then assess their viability and together with the community decide if we should try to include one (or both) of them in the main Flink distribution. So I invite people to express their opinion about which task they would be willing to contribute and hopefully we can settle on two of these options. Once that is done we can decide how we do the actual work. Since this is highly experimental I would suggest we work on repositories where we have complete control. For that purpose I have created an organization [2] on Github which we can use to create repositories and teams that work on them in an organized manner. Once enough work has accumulated we can start discussing contributing the code to the main distribution. Regards, Theodore [1] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/ [2] https://github.com/flinkml -- *Yours faithfully, * *Kate Eri.* |
Hi All,
I'd like to start working on: - Offline learning with Streaming API - Online learning I think also that using a new organisation on github, as Theodore propsed, to keep an initial indipendency to speed up the prototyping and development phases it's really interesting. I totally agree with Katherin, we need offline learning, but my opinion is that it will be more straightforward to fix the streaming issues than batch issues because we will have more support on that by the Flink community. Thanks and have a nice weekend, Roberto On 3 March 2017 at 20:20, amir bahmanyari <[hidden email]> wrote: > Great points to start: - Online learning > - Offline learning with the streaming API > > Thanks + have a great weekend. > > From: Katherin Eri <[hidden email]> > To: [hidden email] > Sent: Friday, March 3, 2017 7:41 AM > Subject: Re: Machine Learning on Flink - Next steps > > Thank you, Theodore. > > Shortly speaking I vote for: > 1) Online learning > 2) Low-latency prediction serving -> Offline learning with the batch API > > In details: > 1) If streaming is strong side of Flink lets use it, and try to support > some online learning or light weight inmemory learning algorithms. Try to > build pipeline for them. > > 2) I think that Flink should be part of production ecosystem, and if now > productions require ML support, multiple models deployment and so on, we > should serve this. But in my opinion we shouldn’t compete with such > projects like PredictionIO, but serve them, to be an execution core. But > that means a lot: > > a. Offline training should be supported, because typically most of ML algs > are for offline training. > b. Model lifecycle should be supported: > ETL+transformation+training+scoring+exploitation quality monitoring > > I understand that batch world is full of competitors, but for me that > doesn’t mean that batch should be ignored. I think that separated > streaming/batching applications causes additional deployment and > exploitation overhead which typically tried to be avoided. That means that > we should attract community to this problem in my opinion. > > > пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > [hidden email]>: > > Hello all, > > From our previous discussion started by Stavros, we decided to start a > planning document [1] > to figure out possible next steps for ML on Flink. > > Our concerns where mainly ensuring active development while satisfying the > needs of > the community. > > We have listed a number of proposals for future work in the document. In > short they are: > > - Offline learning with the batch API > - Online learning > - Offline learning with the streaming API > - Low-latency prediction serving > > I saw there is a number of people willing to work on ML for Flink, but the > truth is that we cannot > cover all of these suggestions without fragmenting the development too > much. > > So my recommendation is to pick out 2 of these options, create design > documents and build prototypes for each library. > We can then assess their viability and together with the community decide > if we should try > to include one (or both) of them in the main Flink distribution. > > So I invite people to express their opinion about which task they would be > willing to contribute > and hopefully we can settle on two of these options. > > Once that is done we can decide how we do the actual work. Since this is > highly experimental > I would suggest we work on repositories where we have complete control. > > For that purpose I have created an organization [2] on Github which we can > use to create repositories and teams that work on them in an organized > manner. > Once enough work has accumulated we can start discussing contributing the > code > to the main distribution. > > Regards, > Theodore > > [1] > https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > d06MIRhahtJ6dw/ > [2] https://github.com/flinkml > > -- > > *Yours faithfully, * > > *Kate Eri.* > > > -- Roberto Bentivoglio e. [hidden email] radicalbit.io |
Hey all,
Sorry for the bit late response. I'd like to work on - Offline learning with Streaming API - Low-latency prediction serving I would drop the batch API ML because of past experience with lack of support, and online learning because the lack of use-cases. I completely agree with Kate that offline learning should be supported, but given Flink's resources I prefer using the streaming API as Roberto suggested. Also, full model lifecycle (or end-to-end ML) could be more easily supported in one system (one API). Connecting Flink Batch with Flink Streaming is currently cumbersome (although side inputs [1] might help). In my opinion, a crucial part of end-to-end ML is low-latency predictions. As another direction, we could integrate Flink Streaming API with other projects (such as Prediction IO). However, I believe it's better to first evaluate the capabilities and drawbacks of the streaming API with some prototype of using Flink Streaming for some ML task. Otherwise we could run into critical issues just as the System ML integration with e.g. caching. These issues makes the integration of Batch API with other ML projects practically infeasible. I've already been experimenting with offline learning with the Streaming API. Hopefully, I can share some initial performance results next week on matrix factorization. Naturally, I've run into issues. E.g. I could only mark the end of input with some hacks, because this is not needed at a streaming job consuming input forever. AFAIK, this would be resolved by side inputs [1]. @Theodore: +1 for doing the prototype project(s) separately the main Flink repository. Although, I would strongly suggest to follow Flink development guidelines as closely as possible. As another note, there is already a GitHub organization for Flink related projects [2], but it seems like it has not been used much. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API [2] https://github.com/project-flink On 2017-03-04 08:44, Roberto Bentivoglio wrote: > Hi All, > > I'd like to start working on: > - Offline learning with Streaming API > - Online learning > > I think also that using a new organisation on github, as Theodore propsed, > to keep an initial indipendency to speed up the prototyping and development > phases it's really interesting. > > I totally agree with Katherin, we need offline learning, but my opinion is > that it will be more straightforward to fix the streaming issues than batch > issues because we will have more support on that by the Flink community. > > Thanks and have a nice weekend, > Roberto > > On 3 March 2017 at 20:20, amir bahmanyari <[hidden email]> > wrote: > >> Great points to start: - Online learning >> - Offline learning with the streaming API >> >> Thanks + have a great weekend. >> >> From: Katherin Eri <[hidden email]> >> To: [hidden email] >> Sent: Friday, March 3, 2017 7:41 AM >> Subject: Re: Machine Learning on Flink - Next steps >> >> Thank you, Theodore. >> >> Shortly speaking I vote for: >> 1) Online learning >> 2) Low-latency prediction serving -> Offline learning with the batch API >> >> In details: >> 1) If streaming is strong side of Flink lets use it, and try to support >> some online learning or light weight inmemory learning algorithms. Try to >> build pipeline for them. >> >> 2) I think that Flink should be part of production ecosystem, and if now >> productions require ML support, multiple models deployment and so on, we >> should serve this. But in my opinion we shouldn’t compete with such >> projects like PredictionIO, but serve them, to be an execution core. But >> that means a lot: >> >> a. Offline training should be supported, because typically most of ML algs >> are for offline training. >> b. Model lifecycle should be supported: >> ETL+transformation+training+scoring+exploitation quality monitoring >> >> I understand that batch world is full of competitors, but for me that >> doesn’t mean that batch should be ignored. I think that separated >> streaming/batching applications causes additional deployment and >> exploitation overhead which typically tried to be avoided. That means that >> we should attract community to this problem in my opinion. >> >> >> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < >> [hidden email]>: >> >> Hello all, >> >> From our previous discussion started by Stavros, we decided to start a >> planning document [1] >> to figure out possible next steps for ML on Flink. >> >> Our concerns where mainly ensuring active development while satisfying the >> needs of >> the community. >> >> We have listed a number of proposals for future work in the document. In >> short they are: >> >> - Offline learning with the batch API >> - Online learning >> - Offline learning with the streaming API >> - Low-latency prediction serving >> >> I saw there is a number of people willing to work on ML for Flink, but the >> truth is that we cannot >> cover all of these suggestions without fragmenting the development too >> much. >> >> So my recommendation is to pick out 2 of these options, create design >> documents and build prototypes for each library. >> We can then assess their viability and together with the community decide >> if we should try >> to include one (or both) of them in the main Flink distribution. >> >> So I invite people to express their opinion about which task they would be >> willing to contribute >> and hopefully we can settle on two of these options. >> >> Once that is done we can decide how we do the actual work. Since this is >> highly experimental >> I would suggest we work on repositories where we have complete control. >> >> For that purpose I have created an organization [2] on Github which we can >> use to create repositories and teams that work on them in an organized >> manner. >> Once enough work has accumulated we can start discussing contributing the >> code >> to the main distribution. >> >> Regards, >> Theodore >> >> [1] >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U >> d06MIRhahtJ6dw/ >> [2] https://github.com/flinkml >> >> -- >> >> *Yours faithfully, * >> >> *Kate Eri.* >> >> >> > > |
Thanks Theo for steering Flink's ML effort here :-)
I'd vote to concentrate on - Online learning - Low-latency prediction serving because of the following reasons: Online learning: I agree that this topic is highly researchy and it's not even clear whether it will ever be of any interest outside of academia. However, it was the same for other things as well. Adoption in industry is usually slow and sometimes one has to dare to explore something new. Low-latency prediction serving: Flink with its streaming engine seems to be the natural fit for such a task and it is a rather low hanging fruit. Furthermore, I think that users would directly benefit from such a feature. Offline learning with Streaming API: I'm not fully convinced yet that the streaming API is powerful enough (mainly due to lack of proper iteration support and spilling capabilities) to support a wide range of offline ML algorithms. And if then it will only support rather small problem sizes because streaming cannot gracefully spill the data to disk. There are still to many open issues with the streaming API to be applicable for this use case imo. Offline learning with the batch API: For offline learning the batch API is imo still better suited than the streaming API. I think it will only make sense to port the algorithms to the streaming API once batch and streaming are properly unified. Alone the highly efficient implementations for joining and sorting of data which can go out of memory are important to support big sized ML problems. In general, I think it might make sense to offer a basic set of ML primitives. However, already offering this basic set is a considerable amount of work. Concering the independent organization for the development: I think it would be great if the development could still happen under the umbrella of Flink's ML library because otherwise we might risk some kind of fragmentation. In order for people to collaborate, one can also open PRs against a branch of a forked repo. I'm currently working on wrapping the project re-organization discussion up. The general position was that it would be best to have an incremental build and keep everything in the same repo. If this is not possible then we want to look into creating a sub repository for the libraries (maybe other components will follow later). I hope to make some progress on this front in the next couple of days/week. I'll keep you updated. As a general remark for the discussions on the google doc. I think it would be great if we could at least mirror the discussions happening in the google doc back on the mailing list or ideally conduct the discussions directly on the mailing list. That's at least what the ASF encourages. Cheers, Till On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email]> wrote: > Hey all, > > Sorry for the bit late response. > > I'd like to work on > - Offline learning with Streaming API > - Low-latency prediction serving > > I would drop the batch API ML because of past experience with lack of > support, and online learning because the lack of use-cases. > > I completely agree with Kate that offline learning should be supported, > but given Flink's resources I prefer using the streaming API as Roberto > suggested. Also, full model lifecycle (or end-to-end ML) could be more > easily supported in one system (one API). Connecting Flink Batch with Flink > Streaming is currently cumbersome (although side inputs [1] might help). In > my opinion, a crucial part of end-to-end ML is low-latency predictions. > > As another direction, we could integrate Flink Streaming API with other > projects (such as Prediction IO). However, I believe it's better to first > evaluate the capabilities and drawbacks of the streaming API with some > prototype of using Flink Streaming for some ML task. Otherwise we could run > into critical issues just as the System ML integration with e.g. caching. > These issues makes the integration of Batch API with other ML projects > practically infeasible. > > I've already been experimenting with offline learning with the Streaming > API. Hopefully, I can share some initial performance results next week on > matrix factorization. Naturally, I've run into issues. E.g. I could only > mark the end of input with some hacks, because this is not needed at a > streaming job consuming input forever. AFAIK, this would be resolved by > side inputs [1]. > > @Theodore: > +1 for doing the prototype project(s) separately the main Flink > repository. Although, I would strongly suggest to follow Flink development > guidelines as closely as possible. As another note, there is already a > GitHub organization for Flink related projects [2], but it seems like it > has not been used much. > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > Side+Inputs+for+DataStream+API > [2] https://github.com/project-flink > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > Hi All, >> >> I'd like to start working on: >> - Offline learning with Streaming API >> - Online learning >> >> I think also that using a new organisation on github, as Theodore propsed, >> to keep an initial indipendency to speed up the prototyping and >> development >> phases it's really interesting. >> >> I totally agree with Katherin, we need offline learning, but my opinion is >> that it will be more straightforward to fix the streaming issues than >> batch >> issues because we will have more support on that by the Flink community. >> >> Thanks and have a nice weekend, >> Roberto >> >> On 3 March 2017 at 20:20, amir bahmanyari <[hidden email]> >> wrote: >> >> Great points to start: - Online learning >>> - Offline learning with the streaming API >>> >>> Thanks + have a great weekend. >>> >>> From: Katherin Eri <[hidden email]> >>> To: [hidden email] >>> Sent: Friday, March 3, 2017 7:41 AM >>> Subject: Re: Machine Learning on Flink - Next steps >>> >>> Thank you, Theodore. >>> >>> Shortly speaking I vote for: >>> 1) Online learning >>> 2) Low-latency prediction serving -> Offline learning with the batch API >>> >>> In details: >>> 1) If streaming is strong side of Flink lets use it, and try to support >>> some online learning or light weight inmemory learning algorithms. Try to >>> build pipeline for them. >>> >>> 2) I think that Flink should be part of production ecosystem, and if now >>> productions require ML support, multiple models deployment and so on, we >>> should serve this. But in my opinion we shouldn’t compete with such >>> projects like PredictionIO, but serve them, to be an execution core. But >>> that means a lot: >>> >>> a. Offline training should be supported, because typically most of ML >>> algs >>> are for offline training. >>> b. Model lifecycle should be supported: >>> ETL+transformation+training+scoring+exploitation quality monitoring >>> >>> I understand that batch world is full of competitors, but for me that >>> doesn’t mean that batch should be ignored. I think that separated >>> streaming/batching applications causes additional deployment and >>> exploitation overhead which typically tried to be avoided. That means >>> that >>> we should attract community to this problem in my opinion. >>> >>> >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < >>> [hidden email]>: >>> >>> Hello all, >>> >>> From our previous discussion started by Stavros, we decided to start a >>> planning document [1] >>> to figure out possible next steps for ML on Flink. >>> >>> Our concerns where mainly ensuring active development while satisfying >>> the >>> needs of >>> the community. >>> >>> We have listed a number of proposals for future work in the document. In >>> short they are: >>> >>> - Offline learning with the batch API >>> - Online learning >>> - Offline learning with the streaming API >>> - Low-latency prediction serving >>> >>> I saw there is a number of people willing to work on ML for Flink, but >>> the >>> truth is that we cannot >>> cover all of these suggestions without fragmenting the development too >>> much. >>> >>> So my recommendation is to pick out 2 of these options, create design >>> documents and build prototypes for each library. >>> We can then assess their viability and together with the community decide >>> if we should try >>> to include one (or both) of them in the main Flink distribution. >>> >>> So I invite people to express their opinion about which task they would >>> be >>> willing to contribute >>> and hopefully we can settle on two of these options. >>> >>> Once that is done we can decide how we do the actual work. Since this is >>> highly experimental >>> I would suggest we work on repositories where we have complete control. >>> >>> For that purpose I have created an organization [2] on Github which we >>> can >>> use to create repositories and teams that work on them in an organized >>> manner. >>> Once enough work has accumulated we can start discussing contributing the >>> code >>> to the main distribution. >>> >>> Regards, >>> Theodore >>> >>> [1] >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U >>> d06MIRhahtJ6dw/ >>> [2] https://github.com/flinkml >>> >>> -- >>> >>> *Yours faithfully, * >>> >>> *Kate Eri.* >>> >>> >>> >>> >> >> > |
Thanks Theodore,
I'd vote for - Offline learning with Streaming API - Low-latency prediction serving Some comments... Online learning Good to have but my feeling is that it is not a strong requirement (if a requirement at all) across the industry right now. May become hot in the future. Offline learning with Streaming API: Although it requires engine changes or extensions (feasibility is an issue here), my understanding is that it reflects the industry common practice (train every few minutes at most) and it would be great if that was supported out of the box providing a friendly API for the developer. Offline learning with the batch API: I would love to have a limited set of algorithms so someone does not leave Flink to work with another tool for some initial dataset if he wants to. In other words, let's reach a mature state with some basic algos merged. There is a lot of work pending let's not waste it. Low-latency prediction serving Model serving is a long standing problem, we could definitely help with that. Regards, Stavros On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> wrote: > Thanks Theo for steering Flink's ML effort here :-) > > I'd vote to concentrate on > > - Online learning > - Low-latency prediction serving > > because of the following reasons: > > Online learning: > > I agree that this topic is highly researchy and it's not even clear whether > it will ever be of any interest outside of academia. However, it was the > same for other things as well. Adoption in industry is usually slow and > sometimes one has to dare to explore something new. > > Low-latency prediction serving: > > Flink with its streaming engine seems to be the natural fit for such a task > and it is a rather low hanging fruit. Furthermore, I think that users would > directly benefit from such a feature. > > Offline learning with Streaming API: > > I'm not fully convinced yet that the streaming API is powerful enough > (mainly due to lack of proper iteration support and spilling capabilities) > to support a wide range of offline ML algorithms. And if then it will only > support rather small problem sizes because streaming cannot gracefully > spill the data to disk. There are still to many open issues with the > streaming API to be applicable for this use case imo. > > Offline learning with the batch API: > > For offline learning the batch API is imo still better suited than the > streaming API. I think it will only make sense to port the algorithms to > the streaming API once batch and streaming are properly unified. Alone the > highly efficient implementations for joining and sorting of data which can > go out of memory are important to support big sized ML problems. In > general, I think it might make sense to offer a basic set of ML primitives. > However, already offering this basic set is a considerable amount of work. > > Concering the independent organization for the development: I think it > would be great if the development could still happen under the umbrella of > Flink's ML library because otherwise we might risk some kind of > fragmentation. In order for people to collaborate, one can also open PRs > against a branch of a forked repo. > > I'm currently working on wrapping the project re-organization discussion > up. The general position was that it would be best to have an incremental > build and keep everything in the same repo. If this is not possible then we > want to look into creating a sub repository for the libraries (maybe other > components will follow later). I hope to make some progress on this front > in the next couple of days/week. I'll keep you updated. > > As a general remark for the discussions on the google doc. I think it would > be great if we could at least mirror the discussions happening in the > google doc back on the mailing list or ideally conduct the discussions > directly on the mailing list. That's at least what the ASF encourages. > > Cheers, > Till > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email]> > wrote: > > > Hey all, > > > > Sorry for the bit late response. > > > > I'd like to work on > > - Offline learning with Streaming API > > - Low-latency prediction serving > > > > I would drop the batch API ML because of past experience with lack of > > support, and online learning because the lack of use-cases. > > > > I completely agree with Kate that offline learning should be supported, > > but given Flink's resources I prefer using the streaming API as Roberto > > suggested. Also, full model lifecycle (or end-to-end ML) could be more > > easily supported in one system (one API). Connecting Flink Batch with > Flink > > Streaming is currently cumbersome (although side inputs [1] might help). > In > > my opinion, a crucial part of end-to-end ML is low-latency predictions. > > > > As another direction, we could integrate Flink Streaming API with other > > projects (such as Prediction IO). However, I believe it's better to first > > evaluate the capabilities and drawbacks of the streaming API with some > > prototype of using Flink Streaming for some ML task. Otherwise we could > run > > into critical issues just as the System ML integration with e.g. caching. > > These issues makes the integration of Batch API with other ML projects > > practically infeasible. > > > > I've already been experimenting with offline learning with the Streaming > > API. Hopefully, I can share some initial performance results next week on > > matrix factorization. Naturally, I've run into issues. E.g. I could only > > mark the end of input with some hacks, because this is not needed at a > > streaming job consuming input forever. AFAIK, this would be resolved by > > side inputs [1]. > > > > @Theodore: > > +1 for doing the prototype project(s) separately the main Flink > > repository. Although, I would strongly suggest to follow Flink > development > > guidelines as closely as possible. As another note, there is already a > > GitHub organization for Flink related projects [2], but it seems like it > > has not been used much. > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > > Side+Inputs+for+DataStream+API > > [2] https://github.com/project-flink > > > > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > > > Hi All, > >> > >> I'd like to start working on: > >> - Offline learning with Streaming API > >> - Online learning > >> > >> I think also that using a new organisation on github, as Theodore > propsed, > >> to keep an initial indipendency to speed up the prototyping and > >> development > >> phases it's really interesting. > >> > >> I totally agree with Katherin, we need offline learning, but my opinion > is > >> that it will be more straightforward to fix the streaming issues than > >> batch > >> issues because we will have more support on that by the Flink community. > >> > >> Thanks and have a nice weekend, > >> Roberto > >> > >> On 3 March 2017 at 20:20, amir bahmanyari <[hidden email]> > >> wrote: > >> > >> Great points to start: - Online learning > >>> - Offline learning with the streaming API > >>> > >>> Thanks + have a great weekend. > >>> > >>> From: Katherin Eri <[hidden email]> > >>> To: [hidden email] > >>> Sent: Friday, March 3, 2017 7:41 AM > >>> Subject: Re: Machine Learning on Flink - Next steps > >>> > >>> Thank you, Theodore. > >>> > >>> Shortly speaking I vote for: > >>> 1) Online learning > >>> 2) Low-latency prediction serving -> Offline learning with the batch > API > >>> > >>> In details: > >>> 1) If streaming is strong side of Flink lets use it, and try to support > >>> some online learning or light weight inmemory learning algorithms. Try > to > >>> build pipeline for them. > >>> > >>> 2) I think that Flink should be part of production ecosystem, and if > now > >>> productions require ML support, multiple models deployment and so on, > we > >>> should serve this. But in my opinion we shouldn’t compete with such > >>> projects like PredictionIO, but serve them, to be an execution core. > But > >>> that means a lot: > >>> > >>> a. Offline training should be supported, because typically most of ML > >>> algs > >>> are for offline training. > >>> b. Model lifecycle should be supported: > >>> ETL+transformation+training+scoring+exploitation quality monitoring > >>> > >>> I understand that batch world is full of competitors, but for me that > >>> doesn’t mean that batch should be ignored. I think that separated > >>> streaming/batching applications causes additional deployment and > >>> exploitation overhead which typically tried to be avoided. That means > >>> that > >>> we should attract community to this problem in my opinion. > >>> > >>> > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > >>> [hidden email]>: > >>> > >>> Hello all, > >>> > >>> From our previous discussion started by Stavros, we decided to start a > >>> planning document [1] > >>> to figure out possible next steps for ML on Flink. > >>> > >>> Our concerns where mainly ensuring active development while satisfying > >>> the > >>> needs of > >>> the community. > >>> > >>> We have listed a number of proposals for future work in the document. > In > >>> short they are: > >>> > >>> - Offline learning with the batch API > >>> - Online learning > >>> - Offline learning with the streaming API > >>> - Low-latency prediction serving > >>> > >>> I saw there is a number of people willing to work on ML for Flink, but > >>> the > >>> truth is that we cannot > >>> cover all of these suggestions without fragmenting the development too > >>> much. > >>> > >>> So my recommendation is to pick out 2 of these options, create design > >>> documents and build prototypes for each library. > >>> We can then assess their viability and together with the community > decide > >>> if we should try > >>> to include one (or both) of them in the main Flink distribution. > >>> > >>> So I invite people to express their opinion about which task they would > >>> be > >>> willing to contribute > >>> and hopefully we can settle on two of these options. > >>> > >>> Once that is done we can decide how we do the actual work. Since this > is > >>> highly experimental > >>> I would suggest we work on repositories where we have complete control. > >>> > >>> For that purpose I have created an organization [2] on Github which we > >>> can > >>> use to create repositories and teams that work on them in an organized > >>> manner. > >>> Once enough work has accumulated we can start discussing contributing > the > >>> code > >>> to the main distribution. > >>> > >>> Regards, > >>> Theodore > >>> > >>> [1] > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > >>> d06MIRhahtJ6dw/ > >>> [2] https://github.com/flinkml > >>> > >>> -- > >>> > >>> *Yours faithfully, * > >>> > >>> *Kate Eri.* > >>> > >>> > >>> > >>> > >> > >> > > > |
Hello all,
## Executive summary: - Offline-on-streaming most popular, then online and model serving. - Need shepherds to lead development/coordination of each task. - I can shepherd online learning, need shepherds for the other two. so from the people sharing their opinion it seems most people would like to try out offline learning with the streaming API. I also think this is an interesting option, but probably the most risky of the bunch. After that online learning and model serving seem to have around the same amount of interest. Given that, and the discussions we had in the Gdoc, here's what I recommend as next actions: - *Offline on streaming: *Start by creating a design document, with an MVP specification about what we imagine such a library to look like and what we think should be possible to do. It should state clear goals and limitations; scoping the amount of work is more important at this point than specific engineering choices. - *Online learning: *If someone would like instead to work on online learning I can help out there, I have one student working on such a library right now, and I'm sure people at TU Berlin (Felix?) have similar efforts. Ideally we would like to communicate with them. Since this is a much more explored space, we could jump straight into a technical design document, (with scoping included of course) discussing abstractions, and comparing with existing frameworks. - *Model serving: *There will be a presentation at Flink Forward SF on such a framework (Flink Tensorflow) by Eron Wright [1]. My recommendation would be to communicate with the author and see if he would be interested in working together to generalize and extend the framework. For more research and resources on the topic see [2] or this presentation [3], particularly the Clipper system. In order to have some activity on each project I recommend we set a minimum of 2 people willing to contribute to each project. If we "assign" people by top choice, that should be possible to do, although my original plan was to only work on two of the above, to avoid fragmentation. But given that online learning will have work being done by students as well, it should be possible to keep it running. Next *I would like us to assign a "shepherd" for each of these tasks.* If you are willing to coordinate the development on one of these options, let us know here and you can take up the task of coordinating with the rest of of the people working on the task. I would like to volunteer to coordinate the *Online learning *effort, since I'm already supervising a student working on this, and I'm currently developing such algorithms. I plan to contribute to the offline on streaming task as well, but not coordinate it. So if someone would like to take the lead on Offline on streaming or Model serving, let us know and we can take it from there. Regards, Theodore [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html [3] https://ucbrise.github.io/cs294-rise-fa16/assets/slides/prediction-serving-systems-cs294-RISE_seminar.pdf On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < [hidden email]> wrote: > Thanks Theodore, > > I'd vote for > > - Offline learning with Streaming API > > - Low-latency prediction serving > > Some comments... > > Online learning > > Good to have but my feeling is that it is not a strong requirement (if a > requirement at all) across the industry right now. May become hot in the > future. > > Offline learning with Streaming API: > > Although it requires engine changes or extensions (feasibility is an issue > here), my understanding is that it reflects the industry common practice > (train every few minutes at most) and it would be great if that was > supported out of the box providing a friendly API for the developer. > > Offline learning with the batch API: > > I would love to have a limited set of algorithms so someone does not leave > Flink to work with another tool > for some initial dataset if he wants to. In other words, let's reach a > mature state with some basic algos merged. > There is a lot of work pending let's not waste it. > > Low-latency prediction serving > > Model serving is a long standing problem, we could definitely help with > that. > > Regards, > Stavros > > > > On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> > wrote: > > > Thanks Theo for steering Flink's ML effort here :-) > > > > I'd vote to concentrate on > > > > - Online learning > > - Low-latency prediction serving > > > > because of the following reasons: > > > > Online learning: > > > > I agree that this topic is highly researchy and it's not even clear > whether > > it will ever be of any interest outside of academia. However, it was the > > same for other things as well. Adoption in industry is usually slow and > > sometimes one has to dare to explore something new. > > > > Low-latency prediction serving: > > > > Flink with its streaming engine seems to be the natural fit for such a > task > > and it is a rather low hanging fruit. Furthermore, I think that users > would > > directly benefit from such a feature. > > > > Offline learning with Streaming API: > > > > I'm not fully convinced yet that the streaming API is powerful enough > > (mainly due to lack of proper iteration support and spilling > capabilities) > > to support a wide range of offline ML algorithms. And if then it will > only > > support rather small problem sizes because streaming cannot gracefully > > spill the data to disk. There are still to many open issues with the > > streaming API to be applicable for this use case imo. > > > > Offline learning with the batch API: > > > > For offline learning the batch API is imo still better suited than the > > streaming API. I think it will only make sense to port the algorithms to > > the streaming API once batch and streaming are properly unified. Alone > the > > highly efficient implementations for joining and sorting of data which > can > > go out of memory are important to support big sized ML problems. In > > general, I think it might make sense to offer a basic set of ML > primitives. > > However, already offering this basic set is a considerable amount of > work. > > > > Concering the independent organization for the development: I think it > > would be great if the development could still happen under the umbrella > of > > Flink's ML library because otherwise we might risk some kind of > > fragmentation. In order for people to collaborate, one can also open PRs > > against a branch of a forked repo. > > > > I'm currently working on wrapping the project re-organization discussion > > up. The general position was that it would be best to have an incremental > > build and keep everything in the same repo. If this is not possible then > we > > want to look into creating a sub repository for the libraries (maybe > other > > components will follow later). I hope to make some progress on this front > > in the next couple of days/week. I'll keep you updated. > > > > As a general remark for the discussions on the google doc. I think it > would > > be great if we could at least mirror the discussions happening in the > > google doc back on the mailing list or ideally conduct the discussions > > directly on the mailing list. That's at least what the ASF encourages. > > > > Cheers, > > Till > > > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email]> > > wrote: > > > > > Hey all, > > > > > > Sorry for the bit late response. > > > > > > I'd like to work on > > > - Offline learning with Streaming API > > > - Low-latency prediction serving > > > > > > I would drop the batch API ML because of past experience with lack of > > > support, and online learning because the lack of use-cases. > > > > > > I completely agree with Kate that offline learning should be supported, > > > but given Flink's resources I prefer using the streaming API as Roberto > > > suggested. Also, full model lifecycle (or end-to-end ML) could be more > > > easily supported in one system (one API). Connecting Flink Batch with > > Flink > > > Streaming is currently cumbersome (although side inputs [1] might > help). > > In > > > my opinion, a crucial part of end-to-end ML is low-latency predictions. > > > > > > As another direction, we could integrate Flink Streaming API with other > > > projects (such as Prediction IO). However, I believe it's better to > first > > > evaluate the capabilities and drawbacks of the streaming API with some > > > prototype of using Flink Streaming for some ML task. Otherwise we could > > run > > > into critical issues just as the System ML integration with e.g. > caching. > > > These issues makes the integration of Batch API with other ML projects > > > practically infeasible. > > > > > > I've already been experimenting with offline learning with the > Streaming > > > API. Hopefully, I can share some initial performance results next week > on > > > matrix factorization. Naturally, I've run into issues. E.g. I could > only > > > mark the end of input with some hacks, because this is not needed at a > > > streaming job consuming input forever. AFAIK, this would be resolved by > > > side inputs [1]. > > > > > > @Theodore: > > > +1 for doing the prototype project(s) separately the main Flink > > > repository. Although, I would strongly suggest to follow Flink > > development > > > guidelines as closely as possible. As another note, there is already a > > > GitHub organization for Flink related projects [2], but it seems like > it > > > has not been used much. > > > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > > > Side+Inputs+for+DataStream+API > > > [2] https://github.com/project-flink > > > > > > > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > > > > > Hi All, > > >> > > >> I'd like to start working on: > > >> - Offline learning with Streaming API > > >> - Online learning > > >> > > >> I think also that using a new organisation on github, as Theodore > > propsed, > > >> to keep an initial indipendency to speed up the prototyping and > > >> development > > >> phases it's really interesting. > > >> > > >> I totally agree with Katherin, we need offline learning, but my > opinion > > is > > >> that it will be more straightforward to fix the streaming issues than > > >> batch > > >> issues because we will have more support on that by the Flink > community. > > >> > > >> Thanks and have a nice weekend, > > >> Roberto > > >> > > >> On 3 March 2017 at 20:20, amir bahmanyari <[hidden email] > > > > >> wrote: > > >> > > >> Great points to start: - Online learning > > >>> - Offline learning with the streaming API > > >>> > > >>> Thanks + have a great weekend. > > >>> > > >>> From: Katherin Eri <[hidden email]> > > >>> To: [hidden email] > > >>> Sent: Friday, March 3, 2017 7:41 AM > > >>> Subject: Re: Machine Learning on Flink - Next steps > > >>> > > >>> Thank you, Theodore. > > >>> > > >>> Shortly speaking I vote for: > > >>> 1) Online learning > > >>> 2) Low-latency prediction serving -> Offline learning with the batch > > API > > >>> > > >>> In details: > > >>> 1) If streaming is strong side of Flink lets use it, and try to > support > > >>> some online learning or light weight inmemory learning algorithms. > Try > > to > > >>> build pipeline for them. > > >>> > > >>> 2) I think that Flink should be part of production ecosystem, and if > > now > > >>> productions require ML support, multiple models deployment and so on, > > we > > >>> should serve this. But in my opinion we shouldn’t compete with such > > >>> projects like PredictionIO, but serve them, to be an execution core. > > But > > >>> that means a lot: > > >>> > > >>> a. Offline training should be supported, because typically most of ML > > >>> algs > > >>> are for offline training. > > >>> b. Model lifecycle should be supported: > > >>> ETL+transformation+training+scoring+exploitation quality monitoring > > >>> > > >>> I understand that batch world is full of competitors, but for me that > > >>> doesn’t mean that batch should be ignored. I think that separated > > >>> streaming/batching applications causes additional deployment and > > >>> exploitation overhead which typically tried to be avoided. That means > > >>> that > > >>> we should attract community to this problem in my opinion. > > >>> > > >>> > > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > > >>> [hidden email]>: > > >>> > > >>> Hello all, > > >>> > > >>> From our previous discussion started by Stavros, we decided to > start a > > >>> planning document [1] > > >>> to figure out possible next steps for ML on Flink. > > >>> > > >>> Our concerns where mainly ensuring active development while > satisfying > > >>> the > > >>> needs of > > >>> the community. > > >>> > > >>> We have listed a number of proposals for future work in the document. > > In > > >>> short they are: > > >>> > > >>> - Offline learning with the batch API > > >>> - Online learning > > >>> - Offline learning with the streaming API > > >>> - Low-latency prediction serving > > >>> > > >>> I saw there is a number of people willing to work on ML for Flink, > but > > >>> the > > >>> truth is that we cannot > > >>> cover all of these suggestions without fragmenting the development > too > > >>> much. > > >>> > > >>> So my recommendation is to pick out 2 of these options, create design > > >>> documents and build prototypes for each library. > > >>> We can then assess their viability and together with the community > > decide > > >>> if we should try > > >>> to include one (or both) of them in the main Flink distribution. > > >>> > > >>> So I invite people to express their opinion about which task they > would > > >>> be > > >>> willing to contribute > > >>> and hopefully we can settle on two of these options. > > >>> > > >>> Once that is done we can decide how we do the actual work. Since this > > is > > >>> highly experimental > > >>> I would suggest we work on repositories where we have complete > control. > > >>> > > >>> For that purpose I have created an organization [2] on Github which > we > > >>> can > > >>> use to create repositories and teams that work on them in an > organized > > >>> manner. > > >>> Once enough work has accumulated we can start discussing contributing > > the > > >>> code > > >>> to the main distribution. > > >>> > > >>> Regards, > > >>> Theodore > > >>> > > >>> [1] > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > > >>> d06MIRhahtJ6dw/ > > >>> [2] https://github.com/flinkml > > >>> > > >>> -- > > >>> > > >>> *Yours faithfully, * > > >>> > > >>> *Kate Eri.* > > >>> > > >>> > > >>> > > >>> > > >> > > >> > > > > > > |
Thanks Theo. Just wrote some comments on the other thread, but it looks
like you got it covered already. Let me re-post what I think may help as input: *Concerning Model Evaluation / Serving * - My personal take is that the "model evaluation" over streams will be happening in any case - there is genuine interest in that and various users have built that themselves already. I would be a cool way to do something that has a very high chance of being productionized by users soon. - The model evaluation as one step of a streaming pipeline (classifying events), followed by CEP (pattern detection) or anomaly detection is a valuable use case on top of what pure model serving systems usually do. - A question I have not yet a good intuition on is whether the "model evaluation" and the training part are so different (one a good abstraction for model evaluation has been built) that there is little cross coordination needed, or whether there is potential in integrating them. *Thoughts on the ML training library (DataSet API or DataStream API)* - I honestly don't quite understand what the big difference will be in targeting the batch or streaming API. You can use the DataSet API in a quite low-level fashion (missing async iterations). - There seems especially now to be a big trend towards deep learning (is it just temporary or will this be the future?) and in that space, little works without GPU acceleration. - It is always easier to do something new than to be the n-th version of something existing (sorry for the generic true-ism). The later admittedly gives the "all in one integrated framework" advantage (which can be a very strong argument indeed), but the former attracts completely new communities and can often make more impact with less effort. - The "new" is not required to be "online learning", where Theo has described some concerns well. It can also be traditional ML re-imagined for "continuous applications", as "continuous / incremental re-training" or so. Even on the "model evaluation side", there is a lot of interesting stuff as mentioned already, like ensembles, multi-armed bandits, ... - It may be well worth tapping into the work of an existing library (like tensorflow) for an easy fix to some hard problems (pre-existing hardware integration, pre-existing optimized linear algebra solvers, etc) and think about how such use cases would look like in the context of typical Flink applications. *A bit of engine background information that may help in the planning:* - The DataStream API will in the future also support bounded data computations explicitly (I say this not as a fact, but as a strong believer that this is the right direction). - Batch runtime execution has seen less focus recently, but seems to get a bit more community focus, because some organizations that contribute a lot want to use the batch side as well. For example the effort on file-grained recovery will strengthen batch a lot already. Stephan On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis < [hidden email]> wrote: > Hello all, > > ## Executive summary: > > - Offline-on-streaming most popular, then online and model serving. > - Need shepherds to lead development/coordination of each task. > - I can shepherd online learning, need shepherds for the other two. > > > so from the people sharing their opinion it seems most people would like to > try out offline learning with the streaming API. > I also think this is an interesting option, but probably the most risky of > the bunch. > > After that online learning and model serving seem to have around the same > amount of interest. > > Given that, and the discussions we had in the Gdoc, here's what I recommend > as next actions: > > - > *Offline on streaming: *Start by creating a design document, with an MVP > specification about what we > imagine such a library to look like and what we think should be possible > to do. > It should state clear goals and limitations; scoping the amount of work > is > more important at this point than specific engineering choices. > - > *Online learning: *If someone would like instead to work on online learning > I can help out there, > I have one student working on such a library right now, and I'm sure > people > at TU Berlin (Felix?) have similar efforts. Ideally we would like to > communicate with > them. Since this is a much more explored space, we could jump straight > into a technical > design document, (with scoping included of course) discussing > abstractions, and comparing > with existing frameworks. > - > *Model serving: *There will be a presentation at Flink Forward SF on such a > framework (Flink Tensorflow) > by Eron Wright [1]. My recommendation would be to communicate with the > author and see > if he would be interested in working together to generalize and extend > the framework. > For more research and resources on the topic see [2] or this > presentation [3], particularly the Clipper system. > > In order to have some activity on each project I recommend we set a minimum > of 2 people willing to > contribute to each project. > > If we "assign" people by top choice, that should be possible to do, > although my original plan was > to only work on two of the above, to avoid fragmentation. But given that > online learning will have work > being done by students as well, it should be possible to keep it running. > > Next *I would like us to assign a "shepherd" for each of these tasks.* If > you are willing to coordinate the development > on one of these options, let us know here and you can take up the task of > coordinating with the rest of > of the people working on the task. > > I would like to volunteer to coordinate the *Online learning *effort, since > I'm already supervising a student > working on this, and I'm currently developing such algorithms. I plan to > contribute to the offline on streaming > task as well, but not coordinate it. > > So if someone would like to take the lead on Offline on streaming or Model > serving, let us know and > we can take it from there. > > Regards, > Theodore > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ > > [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html > > [3] > https://ucbrise.github.io/cs294-rise-fa16/assets/slides/ > prediction-serving-systems-cs294-RISE_seminar.pdf > > On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < > [hidden email]> wrote: > > > Thanks Theodore, > > > > I'd vote for > > > > - Offline learning with Streaming API > > > > - Low-latency prediction serving > > > > Some comments... > > > > Online learning > > > > Good to have but my feeling is that it is not a strong requirement (if a > > requirement at all) across the industry right now. May become hot in the > > future. > > > > Offline learning with Streaming API: > > > > Although it requires engine changes or extensions (feasibility is an > issue > > here), my understanding is that it reflects the industry common practice > > (train every few minutes at most) and it would be great if that was > > supported out of the box providing a friendly API for the developer. > > > > Offline learning with the batch API: > > > > I would love to have a limited set of algorithms so someone does not > leave > > Flink to work with another tool > > for some initial dataset if he wants to. In other words, let's reach a > > mature state with some basic algos merged. > > There is a lot of work pending let's not waste it. > > > > Low-latency prediction serving > > > > Model serving is a long standing problem, we could definitely help with > > that. > > > > Regards, > > Stavros > > > > > > > > On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> > > wrote: > > > > > Thanks Theo for steering Flink's ML effort here :-) > > > > > > I'd vote to concentrate on > > > > > > - Online learning > > > - Low-latency prediction serving > > > > > > because of the following reasons: > > > > > > Online learning: > > > > > > I agree that this topic is highly researchy and it's not even clear > > whether > > > it will ever be of any interest outside of academia. However, it was > the > > > same for other things as well. Adoption in industry is usually slow and > > > sometimes one has to dare to explore something new. > > > > > > Low-latency prediction serving: > > > > > > Flink with its streaming engine seems to be the natural fit for such a > > task > > > and it is a rather low hanging fruit. Furthermore, I think that users > > would > > > directly benefit from such a feature. > > > > > > Offline learning with Streaming API: > > > > > > I'm not fully convinced yet that the streaming API is powerful enough > > > (mainly due to lack of proper iteration support and spilling > > capabilities) > > > to support a wide range of offline ML algorithms. And if then it will > > only > > > support rather small problem sizes because streaming cannot gracefully > > > spill the data to disk. There are still to many open issues with the > > > streaming API to be applicable for this use case imo. > > > > > > Offline learning with the batch API: > > > > > > For offline learning the batch API is imo still better suited than the > > > streaming API. I think it will only make sense to port the algorithms > to > > > the streaming API once batch and streaming are properly unified. Alone > > the > > > highly efficient implementations for joining and sorting of data which > > can > > > go out of memory are important to support big sized ML problems. In > > > general, I think it might make sense to offer a basic set of ML > > primitives. > > > However, already offering this basic set is a considerable amount of > > work. > > > > > > Concering the independent organization for the development: I think it > > > would be great if the development could still happen under the umbrella > > of > > > Flink's ML library because otherwise we might risk some kind of > > > fragmentation. In order for people to collaborate, one can also open > PRs > > > against a branch of a forked repo. > > > > > > I'm currently working on wrapping the project re-organization > discussion > > > up. The general position was that it would be best to have an > incremental > > > build and keep everything in the same repo. If this is not possible > then > > we > > > want to look into creating a sub repository for the libraries (maybe > > other > > > components will follow later). I hope to make some progress on this > front > > > in the next couple of days/week. I'll keep you updated. > > > > > > As a general remark for the discussions on the google doc. I think it > > would > > > be great if we could at least mirror the discussions happening in the > > > google doc back on the mailing list or ideally conduct the discussions > > > directly on the mailing list. That's at least what the ASF encourages. > > > > > > Cheers, > > > Till > > > > > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email] > > > > > wrote: > > > > > > > Hey all, > > > > > > > > Sorry for the bit late response. > > > > > > > > I'd like to work on > > > > - Offline learning with Streaming API > > > > - Low-latency prediction serving > > > > > > > > I would drop the batch API ML because of past experience with lack of > > > > support, and online learning because the lack of use-cases. > > > > > > > > I completely agree with Kate that offline learning should be > supported, > > > > but given Flink's resources I prefer using the streaming API as > Roberto > > > > suggested. Also, full model lifecycle (or end-to-end ML) could be > more > > > > easily supported in one system (one API). Connecting Flink Batch with > > > Flink > > > > Streaming is currently cumbersome (although side inputs [1] might > > help). > > > In > > > > my opinion, a crucial part of end-to-end ML is low-latency > predictions. > > > > > > > > As another direction, we could integrate Flink Streaming API with > other > > > > projects (such as Prediction IO). However, I believe it's better to > > first > > > > evaluate the capabilities and drawbacks of the streaming API with > some > > > > prototype of using Flink Streaming for some ML task. Otherwise we > could > > > run > > > > into critical issues just as the System ML integration with e.g. > > caching. > > > > These issues makes the integration of Batch API with other ML > projects > > > > practically infeasible. > > > > > > > > I've already been experimenting with offline learning with the > > Streaming > > > > API. Hopefully, I can share some initial performance results next > week > > on > > > > matrix factorization. Naturally, I've run into issues. E.g. I could > > only > > > > mark the end of input with some hacks, because this is not needed at > a > > > > streaming job consuming input forever. AFAIK, this would be resolved > by > > > > side inputs [1]. > > > > > > > > @Theodore: > > > > +1 for doing the prototype project(s) separately the main Flink > > > > repository. Although, I would strongly suggest to follow Flink > > > development > > > > guidelines as closely as possible. As another note, there is already > a > > > > GitHub organization for Flink related projects [2], but it seems like > > it > > > > has not been used much. > > > > > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > > > > Side+Inputs+for+DataStream+API > > > > [2] https://github.com/project-flink > > > > > > > > > > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > > > > > > > Hi All, > > > >> > > > >> I'd like to start working on: > > > >> - Offline learning with Streaming API > > > >> - Online learning > > > >> > > > >> I think also that using a new organisation on github, as Theodore > > > propsed, > > > >> to keep an initial indipendency to speed up the prototyping and > > > >> development > > > >> phases it's really interesting. > > > >> > > > >> I totally agree with Katherin, we need offline learning, but my > > opinion > > > is > > > >> that it will be more straightforward to fix the streaming issues > than > > > >> batch > > > >> issues because we will have more support on that by the Flink > > community. > > > >> > > > >> Thanks and have a nice weekend, > > > >> Roberto > > > >> > > > >> On 3 March 2017 at 20:20, amir bahmanyari > <[hidden email] > > > > > > >> wrote: > > > >> > > > >> Great points to start: - Online learning > > > >>> - Offline learning with the streaming API > > > >>> > > > >>> Thanks + have a great weekend. > > > >>> > > > >>> From: Katherin Eri <[hidden email]> > > > >>> To: [hidden email] > > > >>> Sent: Friday, March 3, 2017 7:41 AM > > > >>> Subject: Re: Machine Learning on Flink - Next steps > > > >>> > > > >>> Thank you, Theodore. > > > >>> > > > >>> Shortly speaking I vote for: > > > >>> 1) Online learning > > > >>> 2) Low-latency prediction serving -> Offline learning with the > batch > > > API > > > >>> > > > >>> In details: > > > >>> 1) If streaming is strong side of Flink lets use it, and try to > > support > > > >>> some online learning or light weight inmemory learning algorithms. > > Try > > > to > > > >>> build pipeline for them. > > > >>> > > > >>> 2) I think that Flink should be part of production ecosystem, and > if > > > now > > > >>> productions require ML support, multiple models deployment and so > on, > > > we > > > >>> should serve this. But in my opinion we shouldn’t compete with such > > > >>> projects like PredictionIO, but serve them, to be an execution > core. > > > But > > > >>> that means a lot: > > > >>> > > > >>> a. Offline training should be supported, because typically most of > ML > > > >>> algs > > > >>> are for offline training. > > > >>> b. Model lifecycle should be supported: > > > >>> ETL+transformation+training+scoring+exploitation quality > monitoring > > > >>> > > > >>> I understand that batch world is full of competitors, but for me > that > > > >>> doesn’t mean that batch should be ignored. I think that separated > > > >>> streaming/batching applications causes additional deployment and > > > >>> exploitation overhead which typically tried to be avoided. That > means > > > >>> that > > > >>> we should attract community to this problem in my opinion. > > > >>> > > > >>> > > > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > > > >>> [hidden email]>: > > > >>> > > > >>> Hello all, > > > >>> > > > >>> From our previous discussion started by Stavros, we decided to > > start a > > > >>> planning document [1] > > > >>> to figure out possible next steps for ML on Flink. > > > >>> > > > >>> Our concerns where mainly ensuring active development while > > satisfying > > > >>> the > > > >>> needs of > > > >>> the community. > > > >>> > > > >>> We have listed a number of proposals for future work in the > document. > > > In > > > >>> short they are: > > > >>> > > > >>> - Offline learning with the batch API > > > >>> - Online learning > > > >>> - Offline learning with the streaming API > > > >>> - Low-latency prediction serving > > > >>> > > > >>> I saw there is a number of people willing to work on ML for Flink, > > but > > > >>> the > > > >>> truth is that we cannot > > > >>> cover all of these suggestions without fragmenting the development > > too > > > >>> much. > > > >>> > > > >>> So my recommendation is to pick out 2 of these options, create > design > > > >>> documents and build prototypes for each library. > > > >>> We can then assess their viability and together with the community > > > decide > > > >>> if we should try > > > >>> to include one (or both) of them in the main Flink distribution. > > > >>> > > > >>> So I invite people to express their opinion about which task they > > would > > > >>> be > > > >>> willing to contribute > > > >>> and hopefully we can settle on two of these options. > > > >>> > > > >>> Once that is done we can decide how we do the actual work. Since > this > > > is > > > >>> highly experimental > > > >>> I would suggest we work on repositories where we have complete > > control. > > > >>> > > > >>> For that purpose I have created an organization [2] on Github which > > we > > > >>> can > > > >>> use to create repositories and teams that work on them in an > > organized > > > >>> manner. > > > >>> Once enough work has accumulated we can start discussing > contributing > > > the > > > >>> code > > > >>> to the main distribution. > > > >>> > > > >>> Regards, > > > >>> Theodore > > > >>> > > > >>> [1] > > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > > > >>> d06MIRhahtJ6dw/ > > > >>> [2] https://github.com/flinkml > > > >>> > > > >>> -- > > > >>> > > > >>> *Yours faithfully, * > > > >>> > > > >>> *Kate Eri.* > > > >>> > > > >>> > > > >>> > > > >>> > > > >> > > > >> > > > > > > > > > > |
@Theodore: thanks for bringing the discussion together.
I think it's reasonable to go on all the three directions, just as you suggested. I agree we should concentrate our efforts, but we can do a low-effort evaluation of all the three. I would like to volunteer for shepherding *Offline learning on Streaming*. I am already working on related issues, and I believe I have a fairly good overview on the streaming API and its limitations. However, we need to find a good use-case to aim for, and I don't have one in mind yet, so please help with that if you can. I absolutely agree with Theodore, that setting the scope is the most important here. We should find a simple use-case for incremental learning. As Flink is really strong in low-latency data processing, the best would be a use-case where rapidly adapting the model to new data provides a value. We should also consider low-latency serving for such a use-case, as there is not much use in fast model updates if we cannot serve the predictions that fast. Of course, it's okay to simply implement offline algorithms, but showcasing would be easier if we could add prediction serving for the model in the same system. What should be the way of work here? We could have sketches for the separate projects in Gdocs, then the shepherds could make a proposal out of it. Would that be feasible? @Stephan: Thanks for your all insights. I also like the approach of aiming for new and somewhat unexplored areas. I guess we can do that with both the serving/evaluation and incremental training (that should be in scope of the offline ML on streaming). I agree GPU acceleration is an important issue, however it might be out-of-scope for the prototypes of these new ML directions. What do you think? Regarding your comments on the other thread, I'm really glad PMC is working towards growing the community. This is crucial to have anything merged in Flink while keeping the code quality. However, for the prototypes, I'd prefer Theodore's suggestion, to do it in a separate repository, to make initial development faster. After the prototypes have proven their usability we could merge them, and continue working on them inside the Flink repository. But we can decide that later. Cheers, Gabor On 2017-03-14 21:04, Stephan Ewen wrote: > Thanks Theo. Just wrote some comments on the other thread, but it looks > like you got it covered already. > > Let me re-post what I think may help as input: > > *Concerning Model Evaluation / Serving * > > - My personal take is that the "model evaluation" over streams will be > happening in any case - there > is genuine interest in that and various users have built that > themselves already. > I would be a cool way to do something that has a very high chance of > being productionized by users soon. > > - The model evaluation as one step of a streaming pipeline (classifying > events), followed by CEP (pattern detection) > or anomaly detection is a valuable use case on top of what pure model > serving systems usually do. > > - A question I have not yet a good intuition on is whether the "model > evaluation" and the training part are so > different (one a good abstraction for model evaluation has been built) > that there is little cross coordination needed, > or whether there is potential in integrating them. > > > *Thoughts on the ML training library (DataSet API or DataStream API)* > > - I honestly don't quite understand what the big difference will be in > targeting the batch or streaming API. You can use the > DataSet API in a quite low-level fashion (missing async iterations). > > - There seems especially now to be a big trend towards deep learning (is > it just temporary or will this be the future?) and in > that space, little works without GPU acceleration. > > - It is always easier to do something new than to be the n-th version of > something existing (sorry for the generic true-ism). > The later admittedly gives the "all in one integrated framework" > advantage (which can be a very strong argument indeed), > but the former attracts completely new communities and can often make > more impact with less effort. > > - The "new" is not required to be "online learning", where Theo has > described some concerns well. > It can also be traditional ML re-imagined for "continuous > applications", as "continuous / incremental re-training" or so. > Even on the "model evaluation side", there is a lot of interesting > stuff as mentioned already, like ensembles, multi-armed bandits, ... > > - It may be well worth tapping into the work of an existing library (like > tensorflow) for an easy fix to some hard problems (pre-existing > hardware integration, pre-existing optimized linear algebra solvers, > etc) and think about how such use cases would look like in > the context of typical Flink applications. > > > *A bit of engine background information that may help in the planning:* > > - The DataStream API will in the future also support bounded data > computations explicitly (I say this not as a fact, but as > a strong believer that this is the right direction). > > - Batch runtime execution has seen less focus recently, but seems to get > a bit more community focus, because some organizations > that contribute a lot want to use the batch side as well. For example > the effort on file-grained recovery will strengthen batch a lot already. > > > Stephan > > > > On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis < > [hidden email]> wrote: > >> Hello all, >> >> ## Executive summary: >> >> - Offline-on-streaming most popular, then online and model serving. >> - Need shepherds to lead development/coordination of each task. >> - I can shepherd online learning, need shepherds for the other two. >> >> >> so from the people sharing their opinion it seems most people would like to >> try out offline learning with the streaming API. >> I also think this is an interesting option, but probably the most risky of >> the bunch. >> >> After that online learning and model serving seem to have around the same >> amount of interest. >> >> Given that, and the discussions we had in the Gdoc, here's what I recommend >> as next actions: >> >> - >> *Offline on streaming: *Start by creating a design document, with an MVP >> specification about what we >> imagine such a library to look like and what we think should be possible >> to do. >> It should state clear goals and limitations; scoping the amount of work >> is >> more important at this point than specific engineering choices. >> - >> *Online learning: *If someone would like instead to work on online learning >> I can help out there, >> I have one student working on such a library right now, and I'm sure >> people >> at TU Berlin (Felix?) have similar efforts. Ideally we would like to >> communicate with >> them. Since this is a much more explored space, we could jump straight >> into a technical >> design document, (with scoping included of course) discussing >> abstractions, and comparing >> with existing frameworks. >> - >> *Model serving: *There will be a presentation at Flink Forward SF on such a >> framework (Flink Tensorflow) >> by Eron Wright [1]. My recommendation would be to communicate with the >> author and see >> if he would be interested in working together to generalize and extend >> the framework. >> For more research and resources on the topic see [2] or this >> presentation [3], particularly the Clipper system. >> >> In order to have some activity on each project I recommend we set a minimum >> of 2 people willing to >> contribute to each project. >> >> If we "assign" people by top choice, that should be possible to do, >> although my original plan was >> to only work on two of the above, to avoid fragmentation. But given that >> online learning will have work >> being done by students as well, it should be possible to keep it running. >> >> Next *I would like us to assign a "shepherd" for each of these tasks.* If >> you are willing to coordinate the development >> on one of these options, let us know here and you can take up the task of >> coordinating with the rest of >> of the people working on the task. >> >> I would like to volunteer to coordinate the *Online learning *effort, since >> I'm already supervising a student >> working on this, and I'm currently developing such algorithms. I plan to >> contribute to the offline on streaming >> task as well, but not coordinate it. >> >> So if someone would like to take the lead on Offline on streaming or Model >> serving, let us know and >> we can take it from there. >> >> Regards, >> Theodore >> >> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ >> >> [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html >> >> [3] >> https://ucbrise.github.io/cs294-rise-fa16/assets/slides/ >> prediction-serving-systems-cs294-RISE_seminar.pdf >> >> On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < >> [hidden email]> wrote: >> >>> Thanks Theodore, >>> >>> I'd vote for >>> >>> - Offline learning with Streaming API >>> >>> - Low-latency prediction serving >>> >>> Some comments... >>> >>> Online learning >>> >>> Good to have but my feeling is that it is not a strong requirement (if a >>> requirement at all) across the industry right now. May become hot in the >>> future. >>> >>> Offline learning with Streaming API: >>> >>> Although it requires engine changes or extensions (feasibility is an >> issue >>> here), my understanding is that it reflects the industry common practice >>> (train every few minutes at most) and it would be great if that was >>> supported out of the box providing a friendly API for the developer. >>> >>> Offline learning with the batch API: >>> >>> I would love to have a limited set of algorithms so someone does not >> leave >>> Flink to work with another tool >>> for some initial dataset if he wants to. In other words, let's reach a >>> mature state with some basic algos merged. >>> There is a lot of work pending let's not waste it. >>> >>> Low-latency prediction serving >>> >>> Model serving is a long standing problem, we could definitely help with >>> that. >>> >>> Regards, >>> Stavros >>> >>> >>> >>> On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> >>> wrote: >>> >>>> Thanks Theo for steering Flink's ML effort here :-) >>>> >>>> I'd vote to concentrate on >>>> >>>> - Online learning >>>> - Low-latency prediction serving >>>> >>>> because of the following reasons: >>>> >>>> Online learning: >>>> >>>> I agree that this topic is highly researchy and it's not even clear >>> whether >>>> it will ever be of any interest outside of academia. However, it was >> the >>>> same for other things as well. Adoption in industry is usually slow and >>>> sometimes one has to dare to explore something new. >>>> >>>> Low-latency prediction serving: >>>> >>>> Flink with its streaming engine seems to be the natural fit for such a >>> task >>>> and it is a rather low hanging fruit. Furthermore, I think that users >>> would >>>> directly benefit from such a feature. >>>> >>>> Offline learning with Streaming API: >>>> >>>> I'm not fully convinced yet that the streaming API is powerful enough >>>> (mainly due to lack of proper iteration support and spilling >>> capabilities) >>>> to support a wide range of offline ML algorithms. And if then it will >>> only >>>> support rather small problem sizes because streaming cannot gracefully >>>> spill the data to disk. There are still to many open issues with the >>>> streaming API to be applicable for this use case imo. >>>> >>>> Offline learning with the batch API: >>>> >>>> For offline learning the batch API is imo still better suited than the >>>> streaming API. I think it will only make sense to port the algorithms >> to >>>> the streaming API once batch and streaming are properly unified. Alone >>> the >>>> highly efficient implementations for joining and sorting of data which >>> can >>>> go out of memory are important to support big sized ML problems. In >>>> general, I think it might make sense to offer a basic set of ML >>> primitives. >>>> However, already offering this basic set is a considerable amount of >>> work. >>>> Concering the independent organization for the development: I think it >>>> would be great if the development could still happen under the umbrella >>> of >>>> Flink's ML library because otherwise we might risk some kind of >>>> fragmentation. In order for people to collaborate, one can also open >> PRs >>>> against a branch of a forked repo. >>>> >>>> I'm currently working on wrapping the project re-organization >> discussion >>>> up. The general position was that it would be best to have an >> incremental >>>> build and keep everything in the same repo. If this is not possible >> then >>> we >>>> want to look into creating a sub repository for the libraries (maybe >>> other >>>> components will follow later). I hope to make some progress on this >> front >>>> in the next couple of days/week. I'll keep you updated. >>>> >>>> As a general remark for the discussions on the google doc. I think it >>> would >>>> be great if we could at least mirror the discussions happening in the >>>> google doc back on the mailing list or ideally conduct the discussions >>>> directly on the mailing list. That's at least what the ASF encourages. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email] >>>> wrote: >>>> >>>>> Hey all, >>>>> >>>>> Sorry for the bit late response. >>>>> >>>>> I'd like to work on >>>>> - Offline learning with Streaming API >>>>> - Low-latency prediction serving >>>>> >>>>> I would drop the batch API ML because of past experience with lack of >>>>> support, and online learning because the lack of use-cases. >>>>> >>>>> I completely agree with Kate that offline learning should be >> supported, >>>>> but given Flink's resources I prefer using the streaming API as >> Roberto >>>>> suggested. Also, full model lifecycle (or end-to-end ML) could be >> more >>>>> easily supported in one system (one API). Connecting Flink Batch with >>>> Flink >>>>> Streaming is currently cumbersome (although side inputs [1] might >>> help). >>>> In >>>>> my opinion, a crucial part of end-to-end ML is low-latency >> predictions. >>>>> As another direction, we could integrate Flink Streaming API with >> other >>>>> projects (such as Prediction IO). However, I believe it's better to >>> first >>>>> evaluate the capabilities and drawbacks of the streaming API with >> some >>>>> prototype of using Flink Streaming for some ML task. Otherwise we >> could >>>> run >>>>> into critical issues just as the System ML integration with e.g. >>> caching. >>>>> These issues makes the integration of Batch API with other ML >> projects >>>>> practically infeasible. >>>>> >>>>> I've already been experimenting with offline learning with the >>> Streaming >>>>> API. Hopefully, I can share some initial performance results next >> week >>> on >>>>> matrix factorization. Naturally, I've run into issues. E.g. I could >>> only >>>>> mark the end of input with some hacks, because this is not needed at >> a >>>>> streaming job consuming input forever. AFAIK, this would be resolved >> by >>>>> side inputs [1]. >>>>> >>>>> @Theodore: >>>>> +1 for doing the prototype project(s) separately the main Flink >>>>> repository. Although, I would strongly suggest to follow Flink >>>> development >>>>> guidelines as closely as possible. As another note, there is already >> a >>>>> GitHub organization for Flink related projects [2], but it seems like >>> it >>>>> has not been used much. >>>>> >>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ >>>>> Side+Inputs+for+DataStream+API >>>>> [2] https://github.com/project-flink >>>>> >>>>> >>>>> On 2017-03-04 08:44, Roberto Bentivoglio wrote: >>>>> >>>>> Hi All, >>>>>> I'd like to start working on: >>>>>> - Offline learning with Streaming API >>>>>> - Online learning >>>>>> >>>>>> I think also that using a new organisation on github, as Theodore >>>> propsed, >>>>>> to keep an initial indipendency to speed up the prototyping and >>>>>> development >>>>>> phases it's really interesting. >>>>>> >>>>>> I totally agree with Katherin, we need offline learning, but my >>> opinion >>>> is >>>>>> that it will be more straightforward to fix the streaming issues >> than >>>>>> batch >>>>>> issues because we will have more support on that by the Flink >>> community. >>>>>> Thanks and have a nice weekend, >>>>>> Roberto >>>>>> >>>>>> On 3 March 2017 at 20:20, amir bahmanyari >> <[hidden email] >>>>>> wrote: >>>>>> >>>>>> Great points to start: - Online learning >>>>>>> - Offline learning with the streaming API >>>>>>> >>>>>>> Thanks + have a great weekend. >>>>>>> >>>>>>> From: Katherin Eri <[hidden email]> >>>>>>> To: [hidden email] >>>>>>> Sent: Friday, March 3, 2017 7:41 AM >>>>>>> Subject: Re: Machine Learning on Flink - Next steps >>>>>>> >>>>>>> Thank you, Theodore. >>>>>>> >>>>>>> Shortly speaking I vote for: >>>>>>> 1) Online learning >>>>>>> 2) Low-latency prediction serving -> Offline learning with the >> batch >>>> API >>>>>>> In details: >>>>>>> 1) If streaming is strong side of Flink lets use it, and try to >>> support >>>>>>> some online learning or light weight inmemory learning algorithms. >>> Try >>>> to >>>>>>> build pipeline for them. >>>>>>> >>>>>>> 2) I think that Flink should be part of production ecosystem, and >> if >>>> now >>>>>>> productions require ML support, multiple models deployment and so >> on, >>>> we >>>>>>> should serve this. But in my opinion we shouldn’t compete with such >>>>>>> projects like PredictionIO, but serve them, to be an execution >> core. >>>> But >>>>>>> that means a lot: >>>>>>> >>>>>>> a. Offline training should be supported, because typically most of >> ML >>>>>>> algs >>>>>>> are for offline training. >>>>>>> b. Model lifecycle should be supported: >>>>>>> ETL+transformation+training+scoring+exploitation quality >> monitoring >>>>>>> I understand that batch world is full of competitors, but for me >> that >>>>>>> doesn’t mean that batch should be ignored. I think that separated >>>>>>> streaming/batching applications causes additional deployment and >>>>>>> exploitation overhead which typically tried to be avoided. That >> means >>>>>>> that >>>>>>> we should attract community to this problem in my opinion. >>>>>>> >>>>>>> >>>>>>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < >>>>>>> [hidden email]>: >>>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> From our previous discussion started by Stavros, we decided to >>> start a >>>>>>> planning document [1] >>>>>>> to figure out possible next steps for ML on Flink. >>>>>>> >>>>>>> Our concerns where mainly ensuring active development while >>> satisfying >>>>>>> the >>>>>>> needs of >>>>>>> the community. >>>>>>> >>>>>>> We have listed a number of proposals for future work in the >> document. >>>> In >>>>>>> short they are: >>>>>>> >>>>>>> - Offline learning with the batch API >>>>>>> - Online learning >>>>>>> - Offline learning with the streaming API >>>>>>> - Low-latency prediction serving >>>>>>> >>>>>>> I saw there is a number of people willing to work on ML for Flink, >>> but >>>>>>> the >>>>>>> truth is that we cannot >>>>>>> cover all of these suggestions without fragmenting the development >>> too >>>>>>> much. >>>>>>> >>>>>>> So my recommendation is to pick out 2 of these options, create >> design >>>>>>> documents and build prototypes for each library. >>>>>>> We can then assess their viability and together with the community >>>> decide >>>>>>> if we should try >>>>>>> to include one (or both) of them in the main Flink distribution. >>>>>>> >>>>>>> So I invite people to express their opinion about which task they >>> would >>>>>>> be >>>>>>> willing to contribute >>>>>>> and hopefully we can settle on two of these options. >>>>>>> >>>>>>> Once that is done we can decide how we do the actual work. Since >> this >>>> is >>>>>>> highly experimental >>>>>>> I would suggest we work on repositories where we have complete >>> control. >>>>>>> For that purpose I have created an organization [2] on Github which >>> we >>>>>>> can >>>>>>> use to create repositories and teams that work on them in an >>> organized >>>>>>> manner. >>>>>>> Once enough work has accumulated we can start discussing >> contributing >>>> the >>>>>>> code >>>>>>> to the main distribution. >>>>>>> >>>>>>> Regards, >>>>>>> Theodore >>>>>>> >>>>>>> [1] >>>>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U >>>>>>> d06MIRhahtJ6dw/ >>>>>>> [2] https://github.com/flinkml >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> *Yours faithfully, * >>>>>>> >>>>>>> *Kate Eri.* >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> |
>
> What should be the way of work here? We could have sketches for the > separate projects in Gdocs, then the shepherds could make a proposal out of > it. Would that be feasible? That's what I was thinking as well. It's the responsibility of the shepherd to engage the people motivated to work on a project, starting with a rough Gdocs document and gradually transition it to a proper design doc. As an example use-case (for both online and "fast-batch") I would recommend an ad click scenario: Predicting CTR. The are multiple reasons I like this application: - it's a very popular application - it's directly tied to revenue so even small improvements are relevant, - it can often be a very large-scale problem in data and model size, - there are good systems out there already to benchmark against, like Vowpal Wabbit. - At least one one large-scale dataset exists [1], - We could even place a pre-processing pipeline to emulate a real application, and show the full benefits of using Flink as your one-stop-shop for an integrated prediction pipeline (up until model serving for now). We are still missing someone to take the lead on the model serving project, if somebody would be interested to coordinate that let us know. Regards, Theodore [1] Criteo click-through data (1TB): http://www.criteo.com/news/press-releases/2015/06/criteo-releases-industrys-largest-ever-dataset/ On Thu, Mar 16, 2017 at 11:50 PM, Gábor Hermann <[hidden email]> wrote: > @Theodore: thanks for bringing the discussion together. > I think it's reasonable to go on all the three directions, just as you > suggested. I agree we should concentrate our efforts, but we can do a > low-effort evaluation of all the three. > > I would like to volunteer for shepherding *Offline learning on Streaming*. > I am already working on related issues, and I believe I have a fairly good > overview on the streaming API and its limitations. However, we need to find > a good use-case to aim for, and I don't have one in mind yet, so please > help with that if you can. I absolutely agree with Theodore, that setting > the scope is the most important here. > > We should find a simple use-case for incremental learning. As Flink is > really strong in low-latency data processing, the best would be a use-case > where rapidly adapting the model to new data provides a value. We should > also consider low-latency serving for such a use-case, as there is not much > use in fast model updates if we cannot serve the predictions that fast. Of > course, it's okay to simply implement offline algorithms, but showcasing > would be easier if we could add prediction serving for the model in the > same system. > > What should be the way of work here? We could have sketches for the > separate projects in Gdocs, then the shepherds could make a proposal out of > it. Would that be feasible? > > @Stephan: > Thanks for your all insights. I also like the approach of aiming for new > and somewhat unexplored areas. I guess we can do that with both the > serving/evaluation and incremental training (that should be in scope of the > offline ML on streaming). > > I agree GPU acceleration is an important issue, however it might be > out-of-scope for the prototypes of these new ML directions. What do you > think? > > Regarding your comments on the other thread, I'm really glad PMC is > working towards growing the community. This is crucial to have anything > merged in Flink while keeping the code quality. However, for the > prototypes, I'd prefer Theodore's suggestion, to do it in a separate > repository, to make initial development faster. After the prototypes have > proven their usability we could merge them, and continue working on them > inside the Flink repository. But we can decide that later. > > Cheers, > Gabor > > > > On 2017-03-14 21:04, Stephan Ewen wrote: > >> Thanks Theo. Just wrote some comments on the other thread, but it looks >> like you got it covered already. >> >> Let me re-post what I think may help as input: >> >> *Concerning Model Evaluation / Serving * >> >> - My personal take is that the "model evaluation" over streams will be >> happening in any case - there >> is genuine interest in that and various users have built that >> themselves already. >> I would be a cool way to do something that has a very high chance of >> being productionized by users soon. >> >> - The model evaluation as one step of a streaming pipeline >> (classifying >> events), followed by CEP (pattern detection) >> or anomaly detection is a valuable use case on top of what pure >> model >> serving systems usually do. >> >> - A question I have not yet a good intuition on is whether the "model >> evaluation" and the training part are so >> different (one a good abstraction for model evaluation has been >> built) >> that there is little cross coordination needed, >> or whether there is potential in integrating them. >> >> >> *Thoughts on the ML training library (DataSet API or DataStream API)* >> >> - I honestly don't quite understand what the big difference will be in >> targeting the batch or streaming API. You can use the >> DataSet API in a quite low-level fashion (missing async iterations). >> >> - There seems especially now to be a big trend towards deep learning >> (is >> it just temporary or will this be the future?) and in >> that space, little works without GPU acceleration. >> >> - It is always easier to do something new than to be the n-th version >> of >> something existing (sorry for the generic true-ism). >> The later admittedly gives the "all in one integrated framework" >> advantage (which can be a very strong argument indeed), >> but the former attracts completely new communities and can often make >> more impact with less effort. >> >> - The "new" is not required to be "online learning", where Theo has >> described some concerns well. >> It can also be traditional ML re-imagined for "continuous >> applications", as "continuous / incremental re-training" or so. >> Even on the "model evaluation side", there is a lot of interesting >> stuff as mentioned already, like ensembles, multi-armed bandits, ... >> >> - It may be well worth tapping into the work of an existing library >> (like >> tensorflow) for an easy fix to some hard problems (pre-existing >> hardware integration, pre-existing optimized linear algebra solvers, >> etc) and think about how such use cases would look like in >> the context of typical Flink applications. >> >> >> *A bit of engine background information that may help in the planning:* >> >> - The DataStream API will in the future also support bounded data >> computations explicitly (I say this not as a fact, but as >> a strong believer that this is the right direction). >> >> - Batch runtime execution has seen less focus recently, but seems to >> get >> a bit more community focus, because some organizations >> that contribute a lot want to use the batch side as well. For example >> the effort on file-grained recovery will strengthen batch a lot already. >> >> >> Stephan >> >> >> >> On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis < >> [hidden email]> wrote: >> >> Hello all, >>> >>> ## Executive summary: >>> >>> - Offline-on-streaming most popular, then online and model serving. >>> - Need shepherds to lead development/coordination of each task. >>> - I can shepherd online learning, need shepherds for the other two. >>> >>> >>> so from the people sharing their opinion it seems most people would like >>> to >>> try out offline learning with the streaming API. >>> I also think this is an interesting option, but probably the most risky >>> of >>> the bunch. >>> >>> After that online learning and model serving seem to have around the same >>> amount of interest. >>> >>> Given that, and the discussions we had in the Gdoc, here's what I >>> recommend >>> as next actions: >>> >>> - >>> *Offline on streaming: *Start by creating a design document, with an MVP >>> specification about what we >>> imagine such a library to look like and what we think should be >>> possible >>> to do. >>> It should state clear goals and limitations; scoping the amount of >>> work >>> is >>> more important at this point than specific engineering choices. >>> - >>> *Online learning: *If someone would like instead to work on online >>> learning >>> I can help out there, >>> I have one student working on such a library right now, and I'm sure >>> people >>> at TU Berlin (Felix?) have similar efforts. Ideally we would like to >>> communicate with >>> them. Since this is a much more explored space, we could jump >>> straight >>> into a technical >>> design document, (with scoping included of course) discussing >>> abstractions, and comparing >>> with existing frameworks. >>> - >>> *Model serving: *There will be a presentation at Flink Forward SF on >>> such a >>> framework (Flink Tensorflow) >>> by Eron Wright [1]. My recommendation would be to communicate with >>> the >>> author and see >>> if he would be interested in working together to generalize and >>> extend >>> the framework. >>> For more research and resources on the topic see [2] or this >>> presentation [3], particularly the Clipper system. >>> >>> In order to have some activity on each project I recommend we set a >>> minimum >>> of 2 people willing to >>> contribute to each project. >>> >>> If we "assign" people by top choice, that should be possible to do, >>> although my original plan was >>> to only work on two of the above, to avoid fragmentation. But given that >>> online learning will have work >>> being done by students as well, it should be possible to keep it running. >>> >>> Next *I would like us to assign a "shepherd" for each of these tasks.* If >>> you are willing to coordinate the development >>> on one of these options, let us know here and you can take up the task of >>> coordinating with the rest of >>> of the people working on the task. >>> >>> I would like to volunteer to coordinate the *Online learning *effort, >>> since >>> I'm already supervising a student >>> working on this, and I'm currently developing such algorithms. I plan to >>> contribute to the offline on streaming >>> task as well, but not coordinate it. >>> >>> So if someone would like to take the lead on Offline on streaming or >>> Model >>> serving, let us know and >>> we can take it from there. >>> >>> Regards, >>> Theodore >>> >>> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-te >>> nsorflow/ >>> >>> [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html >>> >>> [3] >>> https://ucbrise.github.io/cs294-rise-fa16/assets/slides/ >>> prediction-serving-systems-cs294-RISE_seminar.pdf >>> >>> On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < >>> [hidden email]> wrote: >>> >>> Thanks Theodore, >>>> >>>> I'd vote for >>>> >>>> - Offline learning with Streaming API >>>> >>>> - Low-latency prediction serving >>>> >>>> Some comments... >>>> >>>> Online learning >>>> >>>> Good to have but my feeling is that it is not a strong requirement (if a >>>> requirement at all) across the industry right now. May become hot in the >>>> future. >>>> >>>> Offline learning with Streaming API: >>>> >>>> Although it requires engine changes or extensions (feasibility is an >>>> >>> issue >>> >>>> here), my understanding is that it reflects the industry common practice >>>> (train every few minutes at most) and it would be great if that was >>>> supported out of the box providing a friendly API for the developer. >>>> >>>> Offline learning with the batch API: >>>> >>>> I would love to have a limited set of algorithms so someone does not >>>> >>> leave >>> >>>> Flink to work with another tool >>>> for some initial dataset if he wants to. In other words, let's reach a >>>> mature state with some basic algos merged. >>>> There is a lot of work pending let's not waste it. >>>> >>>> Low-latency prediction serving >>>> >>>> Model serving is a long standing problem, we could definitely help with >>>> that. >>>> >>>> Regards, >>>> Stavros >>>> >>>> >>>> >>>> On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> >>>> wrote: >>>> >>>> Thanks Theo for steering Flink's ML effort here :-) >>>>> >>>>> I'd vote to concentrate on >>>>> >>>>> - Online learning >>>>> - Low-latency prediction serving >>>>> >>>>> because of the following reasons: >>>>> >>>>> Online learning: >>>>> >>>>> I agree that this topic is highly researchy and it's not even clear >>>>> >>>> whether >>>> >>>>> it will ever be of any interest outside of academia. However, it was >>>>> >>>> the >>> >>>> same for other things as well. Adoption in industry is usually slow and >>>>> sometimes one has to dare to explore something new. >>>>> >>>>> Low-latency prediction serving: >>>>> >>>>> Flink with its streaming engine seems to be the natural fit for such a >>>>> >>>> task >>>> >>>>> and it is a rather low hanging fruit. Furthermore, I think that users >>>>> >>>> would >>>> >>>>> directly benefit from such a feature. >>>>> >>>>> Offline learning with Streaming API: >>>>> >>>>> I'm not fully convinced yet that the streaming API is powerful enough >>>>> (mainly due to lack of proper iteration support and spilling >>>>> >>>> capabilities) >>>> >>>>> to support a wide range of offline ML algorithms. And if then it will >>>>> >>>> only >>>> >>>>> support rather small problem sizes because streaming cannot gracefully >>>>> spill the data to disk. There are still to many open issues with the >>>>> streaming API to be applicable for this use case imo. >>>>> >>>>> Offline learning with the batch API: >>>>> >>>>> For offline learning the batch API is imo still better suited than the >>>>> streaming API. I think it will only make sense to port the algorithms >>>>> >>>> to >>> >>>> the streaming API once batch and streaming are properly unified. Alone >>>>> >>>> the >>>> >>>>> highly efficient implementations for joining and sorting of data which >>>>> >>>> can >>>> >>>>> go out of memory are important to support big sized ML problems. In >>>>> general, I think it might make sense to offer a basic set of ML >>>>> >>>> primitives. >>>> >>>>> However, already offering this basic set is a considerable amount of >>>>> >>>> work. >>>> >>>>> Concering the independent organization for the development: I think it >>>>> would be great if the development could still happen under the umbrella >>>>> >>>> of >>>> >>>>> Flink's ML library because otherwise we might risk some kind of >>>>> fragmentation. In order for people to collaborate, one can also open >>>>> >>>> PRs >>> >>>> against a branch of a forked repo. >>>>> >>>>> I'm currently working on wrapping the project re-organization >>>>> >>>> discussion >>> >>>> up. The general position was that it would be best to have an >>>>> >>>> incremental >>> >>>> build and keep everything in the same repo. If this is not possible >>>>> >>>> then >>> >>>> we >>>> >>>>> want to look into creating a sub repository for the libraries (maybe >>>>> >>>> other >>>> >>>>> components will follow later). I hope to make some progress on this >>>>> >>>> front >>> >>>> in the next couple of days/week. I'll keep you updated. >>>>> >>>>> As a general remark for the discussions on the google doc. I think it >>>>> >>>> would >>>> >>>>> be great if we could at least mirror the discussions happening in the >>>>> google doc back on the mailing list or ideally conduct the discussions >>>>> directly on the mailing list. That's at least what the ASF encourages. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <[hidden email] >>>>> wrote: >>>>> >>>>> Hey all, >>>>>> >>>>>> Sorry for the bit late response. >>>>>> >>>>>> I'd like to work on >>>>>> - Offline learning with Streaming API >>>>>> - Low-latency prediction serving >>>>>> >>>>>> I would drop the batch API ML because of past experience with lack of >>>>>> support, and online learning because the lack of use-cases. >>>>>> >>>>>> I completely agree with Kate that offline learning should be >>>>>> >>>>> supported, >>> >>>> but given Flink's resources I prefer using the streaming API as >>>>>> >>>>> Roberto >>> >>>> suggested. Also, full model lifecycle (or end-to-end ML) could be >>>>>> >>>>> more >>> >>>> easily supported in one system (one API). Connecting Flink Batch with >>>>>> >>>>> Flink >>>>> >>>>>> Streaming is currently cumbersome (although side inputs [1] might >>>>>> >>>>> help). >>>> >>>>> In >>>>> >>>>>> my opinion, a crucial part of end-to-end ML is low-latency >>>>>> >>>>> predictions. >>> >>>> As another direction, we could integrate Flink Streaming API with >>>>>> >>>>> other >>> >>>> projects (such as Prediction IO). However, I believe it's better to >>>>>> >>>>> first >>>> >>>>> evaluate the capabilities and drawbacks of the streaming API with >>>>>> >>>>> some >>> >>>> prototype of using Flink Streaming for some ML task. Otherwise we >>>>>> >>>>> could >>> >>>> run >>>>> >>>>>> into critical issues just as the System ML integration with e.g. >>>>>> >>>>> caching. >>>> >>>>> These issues makes the integration of Batch API with other ML >>>>>> >>>>> projects >>> >>>> practically infeasible. >>>>>> >>>>>> I've already been experimenting with offline learning with the >>>>>> >>>>> Streaming >>>> >>>>> API. Hopefully, I can share some initial performance results next >>>>>> >>>>> week >>> >>>> on >>>> >>>>> matrix factorization. Naturally, I've run into issues. E.g. I could >>>>>> >>>>> only >>>> >>>>> mark the end of input with some hacks, because this is not needed at >>>>>> >>>>> a >>> >>>> streaming job consuming input forever. AFAIK, this would be resolved >>>>>> >>>>> by >>> >>>> side inputs [1]. >>>>>> >>>>>> @Theodore: >>>>>> +1 for doing the prototype project(s) separately the main Flink >>>>>> repository. Although, I would strongly suggest to follow Flink >>>>>> >>>>> development >>>>> >>>>>> guidelines as closely as possible. As another note, there is already >>>>>> >>>>> a >>> >>>> GitHub organization for Flink related projects [2], but it seems like >>>>>> >>>>> it >>>> >>>>> has not been used much. >>>>>> >>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ >>>>>> Side+Inputs+for+DataStream+API >>>>>> [2] https://github.com/project-flink >>>>>> >>>>>> >>>>>> On 2017-03-04 08:44, Roberto Bentivoglio wrote: >>>>>> >>>>>> Hi All, >>>>>> >>>>>>> I'd like to start working on: >>>>>>> - Offline learning with Streaming API >>>>>>> - Online learning >>>>>>> >>>>>>> I think also that using a new organisation on github, as Theodore >>>>>>> >>>>>> propsed, >>>>> >>>>>> to keep an initial indipendency to speed up the prototyping and >>>>>>> development >>>>>>> phases it's really interesting. >>>>>>> >>>>>>> I totally agree with Katherin, we need offline learning, but my >>>>>>> >>>>>> opinion >>>> >>>>> is >>>>> >>>>>> that it will be more straightforward to fix the streaming issues >>>>>>> >>>>>> than >>> >>>> batch >>>>>>> issues because we will have more support on that by the Flink >>>>>>> >>>>>> community. >>>> >>>>> Thanks and have a nice weekend, >>>>>>> Roberto >>>>>>> >>>>>>> On 3 March 2017 at 20:20, amir bahmanyari >>>>>>> >>>>>> <[hidden email] >>> >>>> wrote: >>>>>>> >>>>>>> Great points to start: - Online learning >>>>>>> >>>>>>>> - Offline learning with the streaming API >>>>>>>> >>>>>>>> Thanks + have a great weekend. >>>>>>>> >>>>>>>> From: Katherin Eri <[hidden email]> >>>>>>>> To: [hidden email] >>>>>>>> Sent: Friday, March 3, 2017 7:41 AM >>>>>>>> Subject: Re: Machine Learning on Flink - Next steps >>>>>>>> >>>>>>>> Thank you, Theodore. >>>>>>>> >>>>>>>> Shortly speaking I vote for: >>>>>>>> 1) Online learning >>>>>>>> 2) Low-latency prediction serving -> Offline learning with the >>>>>>>> >>>>>>> batch >>> >>>> API >>>>> >>>>>> In details: >>>>>>>> 1) If streaming is strong side of Flink lets use it, and try to >>>>>>>> >>>>>>> support >>>> >>>>> some online learning or light weight inmemory learning algorithms. >>>>>>>> >>>>>>> Try >>>> >>>>> to >>>>> >>>>>> build pipeline for them. >>>>>>>> >>>>>>>> 2) I think that Flink should be part of production ecosystem, and >>>>>>>> >>>>>>> if >>> >>>> now >>>>> >>>>>> productions require ML support, multiple models deployment and so >>>>>>>> >>>>>>> on, >>> >>>> we >>>>> >>>>>> should serve this. But in my opinion we shouldn’t compete with such >>>>>>>> projects like PredictionIO, but serve them, to be an execution >>>>>>>> >>>>>>> core. >>> >>>> But >>>>> >>>>>> that means a lot: >>>>>>>> >>>>>>>> a. Offline training should be supported, because typically most of >>>>>>>> >>>>>>> ML >>> >>>> algs >>>>>>>> are for offline training. >>>>>>>> b. Model lifecycle should be supported: >>>>>>>> ETL+transformation+training+scoring+exploitation quality >>>>>>>> >>>>>>> monitoring >>> >>>> I understand that batch world is full of competitors, but for me >>>>>>>> >>>>>>> that >>> >>>> doesn’t mean that batch should be ignored. I think that separated >>>>>>>> streaming/batching applications causes additional deployment and >>>>>>>> exploitation overhead which typically tried to be avoided. That >>>>>>>> >>>>>>> means >>> >>>> that >>>>>>>> we should attract community to this problem in my opinion. >>>>>>>> >>>>>>>> >>>>>>>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < >>>>>>>> [hidden email]>: >>>>>>>> >>>>>>>> Hello all, >>>>>>>> >>>>>>>> From our previous discussion started by Stavros, we decided to >>>>>>>> >>>>>>> start a >>>> >>>>> planning document [1] >>>>>>>> to figure out possible next steps for ML on Flink. >>>>>>>> >>>>>>>> Our concerns where mainly ensuring active development while >>>>>>>> >>>>>>> satisfying >>>> >>>>> the >>>>>>>> needs of >>>>>>>> the community. >>>>>>>> >>>>>>>> We have listed a number of proposals for future work in the >>>>>>>> >>>>>>> document. >>> >>>> In >>>>> >>>>>> short they are: >>>>>>>> >>>>>>>> - Offline learning with the batch API >>>>>>>> - Online learning >>>>>>>> - Offline learning with the streaming API >>>>>>>> - Low-latency prediction serving >>>>>>>> >>>>>>>> I saw there is a number of people willing to work on ML for Flink, >>>>>>>> >>>>>>> but >>>> >>>>> the >>>>>>>> truth is that we cannot >>>>>>>> cover all of these suggestions without fragmenting the development >>>>>>>> >>>>>>> too >>>> >>>>> much. >>>>>>>> >>>>>>>> So my recommendation is to pick out 2 of these options, create >>>>>>>> >>>>>>> design >>> >>>> documents and build prototypes for each library. >>>>>>>> We can then assess their viability and together with the community >>>>>>>> >>>>>>> decide >>>>> >>>>>> if we should try >>>>>>>> to include one (or both) of them in the main Flink distribution. >>>>>>>> >>>>>>>> So I invite people to express their opinion about which task they >>>>>>>> >>>>>>> would >>>> >>>>> be >>>>>>>> willing to contribute >>>>>>>> and hopefully we can settle on two of these options. >>>>>>>> >>>>>>>> Once that is done we can decide how we do the actual work. Since >>>>>>>> >>>>>>> this >>> >>>> is >>>>> >>>>>> highly experimental >>>>>>>> I would suggest we work on repositories where we have complete >>>>>>>> >>>>>>> control. >>>> >>>>> For that purpose I have created an organization [2] on Github which >>>>>>>> >>>>>>> we >>>> >>>>> can >>>>>>>> use to create repositories and teams that work on them in an >>>>>>>> >>>>>>> organized >>>> >>>>> manner. >>>>>>>> Once enough work has accumulated we can start discussing >>>>>>>> >>>>>>> contributing >>> >>>> the >>>>> >>>>>> code >>>>>>>> to the main distribution. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Theodore >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U >>>>>>>> d06MIRhahtJ6dw/ >>>>>>>> [2] https://github.com/flinkml >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> *Yours faithfully, * >>>>>>>> >>>>>>>> *Kate Eri.* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> > |
Hi there,
I am not a machine learning expert :) But in recent, I see more and more adoption and trends towards tensor flow[1], which is backed by google and big vendors. If flink and somehow compatible and run tensor flow pipelines (with some modifications is fine) I think the adoption would be faster. Thanks, Chen [1] https://github.com/tensorflow/tensorflow On Fri, Mar 17, 2017 at 7:44 AM, Theodore Vasiloudis < [hidden email]> wrote: > > > > What should be the way of work here? We could have sketches for the > > separate projects in Gdocs, then the shepherds could make a proposal out > of > > it. Would that be feasible? > > > That's what I was thinking as well. It's the responsibility of the shepherd > to engage the people motivated to work > on a project, starting with a rough Gdocs document and gradually transition > it to a proper design doc. > > As an example use-case (for both online and "fast-batch") I would recommend > an ad click scenario: Predicting CTR. > > The are multiple reasons I like this application: > > - it's a very popular application > - it's directly tied to revenue so even small improvements are relevant, > - it can often be a very large-scale problem in data and model size, > - there are good systems out there already to benchmark against, like > Vowpal Wabbit. > - At least one one large-scale dataset exists [1], > - We could even place a pre-processing pipeline to emulate a real > application, and show the full benefits of using Flink as your > one-stop-shop for an integrated prediction pipeline (up until model > serving > for now). > > We are still missing someone to take the lead on the model serving project, > if somebody would be interested to > coordinate that let us know. > > Regards, > Theodore > > [1] Criteo click-through data (1TB): > http://www.criteo.com/news/press-releases/2015/06/criteo- > releases-industrys-largest-ever-dataset/ > > On Thu, Mar 16, 2017 at 11:50 PM, Gábor Hermann <[hidden email]> > wrote: > > > @Theodore: thanks for bringing the discussion together. > > I think it's reasonable to go on all the three directions, just as you > > suggested. I agree we should concentrate our efforts, but we can do a > > low-effort evaluation of all the three. > > > > I would like to volunteer for shepherding *Offline learning on > Streaming*. > > I am already working on related issues, and I believe I have a fairly > good > > overview on the streaming API and its limitations. However, we need to > find > > a good use-case to aim for, and I don't have one in mind yet, so please > > help with that if you can. I absolutely agree with Theodore, that setting > > the scope is the most important here. > > > > We should find a simple use-case for incremental learning. As Flink is > > really strong in low-latency data processing, the best would be a > use-case > > where rapidly adapting the model to new data provides a value. We should > > also consider low-latency serving for such a use-case, as there is not > much > > use in fast model updates if we cannot serve the predictions that fast. > Of > > course, it's okay to simply implement offline algorithms, but showcasing > > would be easier if we could add prediction serving for the model in the > > same system. > > > > What should be the way of work here? We could have sketches for the > > separate projects in Gdocs, then the shepherds could make a proposal out > of > > it. Would that be feasible? > > > > @Stephan: > > Thanks for your all insights. I also like the approach of aiming for new > > and somewhat unexplored areas. I guess we can do that with both the > > serving/evaluation and incremental training (that should be in scope of > the > > offline ML on streaming). > > > > I agree GPU acceleration is an important issue, however it might be > > out-of-scope for the prototypes of these new ML directions. What do you > > think? > > > > Regarding your comments on the other thread, I'm really glad PMC is > > working towards growing the community. This is crucial to have anything > > merged in Flink while keeping the code quality. However, for the > > prototypes, I'd prefer Theodore's suggestion, to do it in a separate > > repository, to make initial development faster. After the prototypes have > > proven their usability we could merge them, and continue working on them > > inside the Flink repository. But we can decide that later. > > > > Cheers, > > Gabor > > > > > > > > On 2017-03-14 21:04, Stephan Ewen wrote: > > > >> Thanks Theo. Just wrote some comments on the other thread, but it looks > >> like you got it covered already. > >> > >> Let me re-post what I think may help as input: > >> > >> *Concerning Model Evaluation / Serving * > >> > >> - My personal take is that the "model evaluation" over streams will > be > >> happening in any case - there > >> is genuine interest in that and various users have built that > >> themselves already. > >> I would be a cool way to do something that has a very high chance > of > >> being productionized by users soon. > >> > >> - The model evaluation as one step of a streaming pipeline > >> (classifying > >> events), followed by CEP (pattern detection) > >> or anomaly detection is a valuable use case on top of what pure > >> model > >> serving systems usually do. > >> > >> - A question I have not yet a good intuition on is whether the > "model > >> evaluation" and the training part are so > >> different (one a good abstraction for model evaluation has been > >> built) > >> that there is little cross coordination needed, > >> or whether there is potential in integrating them. > >> > >> > >> *Thoughts on the ML training library (DataSet API or DataStream API)* > >> > >> - I honestly don't quite understand what the big difference will be > in > >> targeting the batch or streaming API. You can use the > >> DataSet API in a quite low-level fashion (missing async > iterations). > >> > >> - There seems especially now to be a big trend towards deep learning > >> (is > >> it just temporary or will this be the future?) and in > >> that space, little works without GPU acceleration. > >> > >> - It is always easier to do something new than to be the n-th version > >> of > >> something existing (sorry for the generic true-ism). > >> The later admittedly gives the "all in one integrated framework" > >> advantage (which can be a very strong argument indeed), > >> but the former attracts completely new communities and can often > make > >> more impact with less effort. > >> > >> - The "new" is not required to be "online learning", where Theo has > >> described some concerns well. > >> It can also be traditional ML re-imagined for "continuous > >> applications", as "continuous / incremental re-training" or so. > >> Even on the "model evaluation side", there is a lot of interesting > >> stuff as mentioned already, like ensembles, multi-armed bandits, ... > >> > >> - It may be well worth tapping into the work of an existing library > >> (like > >> tensorflow) for an easy fix to some hard problems (pre-existing > >> hardware integration, pre-existing optimized linear algebra > solvers, > >> etc) and think about how such use cases would look like in > >> the context of typical Flink applications. > >> > >> > >> *A bit of engine background information that may help in the planning:* > >> > >> - The DataStream API will in the future also support bounded data > >> computations explicitly (I say this not as a fact, but as > >> a strong believer that this is the right direction). > >> > >> - Batch runtime execution has seen less focus recently, but seems to > >> get > >> a bit more community focus, because some organizations > >> that contribute a lot want to use the batch side as well. For > example > >> the effort on file-grained recovery will strengthen batch a lot already. > >> > >> > >> Stephan > >> > >> > >> > >> On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis < > >> [hidden email]> wrote: > >> > >> Hello all, > >>> > >>> ## Executive summary: > >>> > >>> - Offline-on-streaming most popular, then online and model serving. > >>> - Need shepherds to lead development/coordination of each task. > >>> - I can shepherd online learning, need shepherds for the other two. > >>> > >>> > >>> so from the people sharing their opinion it seems most people would > like > >>> to > >>> try out offline learning with the streaming API. > >>> I also think this is an interesting option, but probably the most risky > >>> of > >>> the bunch. > >>> > >>> After that online learning and model serving seem to have around the > same > >>> amount of interest. > >>> > >>> Given that, and the discussions we had in the Gdoc, here's what I > >>> recommend > >>> as next actions: > >>> > >>> - > >>> *Offline on streaming: *Start by creating a design document, with an > MVP > >>> specification about what we > >>> imagine such a library to look like and what we think should be > >>> possible > >>> to do. > >>> It should state clear goals and limitations; scoping the amount of > >>> work > >>> is > >>> more important at this point than specific engineering choices. > >>> - > >>> *Online learning: *If someone would like instead to work on online > >>> learning > >>> I can help out there, > >>> I have one student working on such a library right now, and I'm > sure > >>> people > >>> at TU Berlin (Felix?) have similar efforts. Ideally we would like > to > >>> communicate with > >>> them. Since this is a much more explored space, we could jump > >>> straight > >>> into a technical > >>> design document, (with scoping included of course) discussing > >>> abstractions, and comparing > >>> with existing frameworks. > >>> - > >>> *Model serving: *There will be a presentation at Flink Forward SF on > >>> such a > >>> framework (Flink Tensorflow) > >>> by Eron Wright [1]. My recommendation would be to communicate with > >>> the > >>> author and see > >>> if he would be interested in working together to generalize and > >>> extend > >>> the framework. > >>> For more research and resources on the topic see [2] or this > >>> presentation [3], particularly the Clipper system. > >>> > >>> In order to have some activity on each project I recommend we set a > >>> minimum > >>> of 2 people willing to > >>> contribute to each project. > >>> > >>> If we "assign" people by top choice, that should be possible to do, > >>> although my original plan was > >>> to only work on two of the above, to avoid fragmentation. But given > that > >>> online learning will have work > >>> being done by students as well, it should be possible to keep it > running. > >>> > >>> Next *I would like us to assign a "shepherd" for each of these tasks.* > If > >>> you are willing to coordinate the development > >>> on one of these options, let us know here and you can take up the task > of > >>> coordinating with the rest of > >>> of the people working on the task. > >>> > >>> I would like to volunteer to coordinate the *Online learning *effort, > >>> since > >>> I'm already supervising a student > >>> working on this, and I'm currently developing such algorithms. I plan > to > >>> contribute to the offline on streaming > >>> task as well, but not coordinate it. > >>> > >>> So if someone would like to take the lead on Offline on streaming or > >>> Model > >>> serving, let us know and > >>> we can take it from there. > >>> > >>> Regards, > >>> Theodore > >>> > >>> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-te > >>> nsorflow/ > >>> > >>> [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html > >>> > >>> [3] > >>> https://ucbrise.github.io/cs294-rise-fa16/assets/slides/ > >>> prediction-serving-systems-cs294-RISE_seminar.pdf > >>> > >>> On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < > >>> [hidden email]> wrote: > >>> > >>> Thanks Theodore, > >>>> > >>>> I'd vote for > >>>> > >>>> - Offline learning with Streaming API > >>>> > >>>> - Low-latency prediction serving > >>>> > >>>> Some comments... > >>>> > >>>> Online learning > >>>> > >>>> Good to have but my feeling is that it is not a strong requirement > (if a > >>>> requirement at all) across the industry right now. May become hot in > the > >>>> future. > >>>> > >>>> Offline learning with Streaming API: > >>>> > >>>> Although it requires engine changes or extensions (feasibility is an > >>>> > >>> issue > >>> > >>>> here), my understanding is that it reflects the industry common > practice > >>>> (train every few minutes at most) and it would be great if that was > >>>> supported out of the box providing a friendly API for the developer. > >>>> > >>>> Offline learning with the batch API: > >>>> > >>>> I would love to have a limited set of algorithms so someone does not > >>>> > >>> leave > >>> > >>>> Flink to work with another tool > >>>> for some initial dataset if he wants to. In other words, let's reach a > >>>> mature state with some basic algos merged. > >>>> There is a lot of work pending let's not waste it. > >>>> > >>>> Low-latency prediction serving > >>>> > >>>> Model serving is a long standing problem, we could definitely help > with > >>>> that. > >>>> > >>>> Regards, > >>>> Stavros > >>>> > >>>> > >>>> > >>>> On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <[hidden email]> > >>>> wrote: > >>>> > >>>> Thanks Theo for steering Flink's ML effort here :-) > >>>>> > >>>>> I'd vote to concentrate on > >>>>> > >>>>> - Online learning > >>>>> - Low-latency prediction serving > >>>>> > >>>>> because of the following reasons: > >>>>> > >>>>> Online learning: > >>>>> > >>>>> I agree that this topic is highly researchy and it's not even clear > >>>>> > >>>> whether > >>>> > >>>>> it will ever be of any interest outside of academia. However, it was > >>>>> > >>>> the > >>> > >>>> same for other things as well. Adoption in industry is usually slow > and > >>>>> sometimes one has to dare to explore something new. > >>>>> > >>>>> Low-latency prediction serving: > >>>>> > >>>>> Flink with its streaming engine seems to be the natural fit for such > a > >>>>> > >>>> task > >>>> > >>>>> and it is a rather low hanging fruit. Furthermore, I think that users > >>>>> > >>>> would > >>>> > >>>>> directly benefit from such a feature. > >>>>> > >>>>> Offline learning with Streaming API: > >>>>> > >>>>> I'm not fully convinced yet that the streaming API is powerful enough > >>>>> (mainly due to lack of proper iteration support and spilling > >>>>> > >>>> capabilities) > >>>> > >>>>> to support a wide range of offline ML algorithms. And if then it will > >>>>> > >>>> only > >>>> > >>>>> support rather small problem sizes because streaming cannot > gracefully > >>>>> spill the data to disk. There are still to many open issues with the > >>>>> streaming API to be applicable for this use case imo. > >>>>> > >>>>> Offline learning with the batch API: > >>>>> > >>>>> For offline learning the batch API is imo still better suited than > the > >>>>> streaming API. I think it will only make sense to port the algorithms > >>>>> > >>>> to > >>> > >>>> the streaming API once batch and streaming are properly unified. Alone > >>>>> > >>>> the > >>>> > >>>>> highly efficient implementations for joining and sorting of data > which > >>>>> > >>>> can > >>>> > >>>>> go out of memory are important to support big sized ML problems. In > >>>>> general, I think it might make sense to offer a basic set of ML > >>>>> > >>>> primitives. > >>>> > >>>>> However, already offering this basic set is a considerable amount of > >>>>> > >>>> work. > >>>> > >>>>> Concering the independent organization for the development: I think > it > >>>>> would be great if the development could still happen under the > umbrella > >>>>> > >>>> of > >>>> > >>>>> Flink's ML library because otherwise we might risk some kind of > >>>>> fragmentation. In order for people to collaborate, one can also open > >>>>> > >>>> PRs > >>> > >>>> against a branch of a forked repo. > >>>>> > >>>>> I'm currently working on wrapping the project re-organization > >>>>> > >>>> discussion > >>> > >>>> up. The general position was that it would be best to have an > >>>>> > >>>> incremental > >>> > >>>> build and keep everything in the same repo. If this is not possible > >>>>> > >>>> then > >>> > >>>> we > >>>> > >>>>> want to look into creating a sub repository for the libraries (maybe > >>>>> > >>>> other > >>>> > >>>>> components will follow later). I hope to make some progress on this > >>>>> > >>>> front > >>> > >>>> in the next couple of days/week. I'll keep you updated. > >>>>> > >>>>> As a general remark for the discussions on the google doc. I think it > >>>>> > >>>> would > >>>> > >>>>> be great if we could at least mirror the discussions happening in the > >>>>> google doc back on the mailing list or ideally conduct the > discussions > >>>>> directly on the mailing list. That's at least what the ASF > encourages. > >>>>> > >>>>> Cheers, > >>>>> Till > >>>>> > >>>>> On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann < > [hidden email] > >>>>> wrote: > >>>>> > >>>>> Hey all, > >>>>>> > >>>>>> Sorry for the bit late response. > >>>>>> > >>>>>> I'd like to work on > >>>>>> - Offline learning with Streaming API > >>>>>> - Low-latency prediction serving > >>>>>> > >>>>>> I would drop the batch API ML because of past experience with lack > of > >>>>>> support, and online learning because the lack of use-cases. > >>>>>> > >>>>>> I completely agree with Kate that offline learning should be > >>>>>> > >>>>> supported, > >>> > >>>> but given Flink's resources I prefer using the streaming API as > >>>>>> > >>>>> Roberto > >>> > >>>> suggested. Also, full model lifecycle (or end-to-end ML) could be > >>>>>> > >>>>> more > >>> > >>>> easily supported in one system (one API). Connecting Flink Batch with > >>>>>> > >>>>> Flink > >>>>> > >>>>>> Streaming is currently cumbersome (although side inputs [1] might > >>>>>> > >>>>> help). > >>>> > >>>>> In > >>>>> > >>>>>> my opinion, a crucial part of end-to-end ML is low-latency > >>>>>> > >>>>> predictions. > >>> > >>>> As another direction, we could integrate Flink Streaming API with > >>>>>> > >>>>> other > >>> > >>>> projects (such as Prediction IO). However, I believe it's better to > >>>>>> > >>>>> first > >>>> > >>>>> evaluate the capabilities and drawbacks of the streaming API with > >>>>>> > >>>>> some > >>> > >>>> prototype of using Flink Streaming for some ML task. Otherwise we > >>>>>> > >>>>> could > >>> > >>>> run > >>>>> > >>>>>> into critical issues just as the System ML integration with e.g. > >>>>>> > >>>>> caching. > >>>> > >>>>> These issues makes the integration of Batch API with other ML > >>>>>> > >>>>> projects > >>> > >>>> practically infeasible. > >>>>>> > >>>>>> I've already been experimenting with offline learning with the > >>>>>> > >>>>> Streaming > >>>> > >>>>> API. Hopefully, I can share some initial performance results next > >>>>>> > >>>>> week > >>> > >>>> on > >>>> > >>>>> matrix factorization. Naturally, I've run into issues. E.g. I could > >>>>>> > >>>>> only > >>>> > >>>>> mark the end of input with some hacks, because this is not needed at > >>>>>> > >>>>> a > >>> > >>>> streaming job consuming input forever. AFAIK, this would be resolved > >>>>>> > >>>>> by > >>> > >>>> side inputs [1]. > >>>>>> > >>>>>> @Theodore: > >>>>>> +1 for doing the prototype project(s) separately the main Flink > >>>>>> repository. Although, I would strongly suggest to follow Flink > >>>>>> > >>>>> development > >>>>> > >>>>>> guidelines as closely as possible. As another note, there is already > >>>>>> > >>>>> a > >>> > >>>> GitHub organization for Flink related projects [2], but it seems like > >>>>>> > >>>>> it > >>>> > >>>>> has not been used much. > >>>>>> > >>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > >>>>>> Side+Inputs+for+DataStream+API > >>>>>> [2] https://github.com/project-flink > >>>>>> > >>>>>> > >>>>>> On 2017-03-04 08:44, Roberto Bentivoglio wrote: > >>>>>> > >>>>>> Hi All, > >>>>>> > >>>>>>> I'd like to start working on: > >>>>>>> - Offline learning with Streaming API > >>>>>>> - Online learning > >>>>>>> > >>>>>>> I think also that using a new organisation on github, as Theodore > >>>>>>> > >>>>>> propsed, > >>>>> > >>>>>> to keep an initial indipendency to speed up the prototyping and > >>>>>>> development > >>>>>>> phases it's really interesting. > >>>>>>> > >>>>>>> I totally agree with Katherin, we need offline learning, but my > >>>>>>> > >>>>>> opinion > >>>> > >>>>> is > >>>>> > >>>>>> that it will be more straightforward to fix the streaming issues > >>>>>>> > >>>>>> than > >>> > >>>> batch > >>>>>>> issues because we will have more support on that by the Flink > >>>>>>> > >>>>>> community. > >>>> > >>>>> Thanks and have a nice weekend, > >>>>>>> Roberto > >>>>>>> > >>>>>>> On 3 March 2017 at 20:20, amir bahmanyari > >>>>>>> > >>>>>> <[hidden email] > >>> > >>>> wrote: > >>>>>>> > >>>>>>> Great points to start: - Online learning > >>>>>>> > >>>>>>>> - Offline learning with the streaming API > >>>>>>>> > >>>>>>>> Thanks + have a great weekend. > >>>>>>>> > >>>>>>>> From: Katherin Eri <[hidden email]> > >>>>>>>> To: [hidden email] > >>>>>>>> Sent: Friday, March 3, 2017 7:41 AM > >>>>>>>> Subject: Re: Machine Learning on Flink - Next steps > >>>>>>>> > >>>>>>>> Thank you, Theodore. > >>>>>>>> > >>>>>>>> Shortly speaking I vote for: > >>>>>>>> 1) Online learning > >>>>>>>> 2) Low-latency prediction serving -> Offline learning with the > >>>>>>>> > >>>>>>> batch > >>> > >>>> API > >>>>> > >>>>>> In details: > >>>>>>>> 1) If streaming is strong side of Flink lets use it, and try to > >>>>>>>> > >>>>>>> support > >>>> > >>>>> some online learning or light weight inmemory learning algorithms. > >>>>>>>> > >>>>>>> Try > >>>> > >>>>> to > >>>>> > >>>>>> build pipeline for them. > >>>>>>>> > >>>>>>>> 2) I think that Flink should be part of production ecosystem, and > >>>>>>>> > >>>>>>> if > >>> > >>>> now > >>>>> > >>>>>> productions require ML support, multiple models deployment and so > >>>>>>>> > >>>>>>> on, > >>> > >>>> we > >>>>> > >>>>>> should serve this. But in my opinion we shouldn’t compete with such > >>>>>>>> projects like PredictionIO, but serve them, to be an execution > >>>>>>>> > >>>>>>> core. > >>> > >>>> But > >>>>> > >>>>>> that means a lot: > >>>>>>>> > >>>>>>>> a. Offline training should be supported, because typically most of > >>>>>>>> > >>>>>>> ML > >>> > >>>> algs > >>>>>>>> are for offline training. > >>>>>>>> b. Model lifecycle should be supported: > >>>>>>>> ETL+transformation+training+scoring+exploitation quality > >>>>>>>> > >>>>>>> monitoring > >>> > >>>> I understand that batch world is full of competitors, but for me > >>>>>>>> > >>>>>>> that > >>> > >>>> doesn’t mean that batch should be ignored. I think that separated > >>>>>>>> streaming/batching applications causes additional deployment and > >>>>>>>> exploitation overhead which typically tried to be avoided. That > >>>>>>>> > >>>>>>> means > >>> > >>>> that > >>>>>>>> we should attract community to this problem in my opinion. > >>>>>>>> > >>>>>>>> > >>>>>>>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > >>>>>>>> [hidden email]>: > >>>>>>>> > >>>>>>>> Hello all, > >>>>>>>> > >>>>>>>> From our previous discussion started by Stavros, we decided to > >>>>>>>> > >>>>>>> start a > >>>> > >>>>> planning document [1] > >>>>>>>> to figure out possible next steps for ML on Flink. > >>>>>>>> > >>>>>>>> Our concerns where mainly ensuring active development while > >>>>>>>> > >>>>>>> satisfying > >>>> > >>>>> the > >>>>>>>> needs of > >>>>>>>> the community. > >>>>>>>> > >>>>>>>> We have listed a number of proposals for future work in the > >>>>>>>> > >>>>>>> document. > >>> > >>>> In > >>>>> > >>>>>> short they are: > >>>>>>>> > >>>>>>>> - Offline learning with the batch API > >>>>>>>> - Online learning > >>>>>>>> - Offline learning with the streaming API > >>>>>>>> - Low-latency prediction serving > >>>>>>>> > >>>>>>>> I saw there is a number of people willing to work on ML for Flink, > >>>>>>>> > >>>>>>> but > >>>> > >>>>> the > >>>>>>>> truth is that we cannot > >>>>>>>> cover all of these suggestions without fragmenting the development > >>>>>>>> > >>>>>>> too > >>>> > >>>>> much. > >>>>>>>> > >>>>>>>> So my recommendation is to pick out 2 of these options, create > >>>>>>>> > >>>>>>> design > >>> > >>>> documents and build prototypes for each library. > >>>>>>>> We can then assess their viability and together with the community > >>>>>>>> > >>>>>>> decide > >>>>> > >>>>>> if we should try > >>>>>>>> to include one (or both) of them in the main Flink distribution. > >>>>>>>> > >>>>>>>> So I invite people to express their opinion about which task they > >>>>>>>> > >>>>>>> would > >>>> > >>>>> be > >>>>>>>> willing to contribute > >>>>>>>> and hopefully we can settle on two of these options. > >>>>>>>> > >>>>>>>> Once that is done we can decide how we do the actual work. Since > >>>>>>>> > >>>>>>> this > >>> > >>>> is > >>>>> > >>>>>> highly experimental > >>>>>>>> I would suggest we work on repositories where we have complete > >>>>>>>> > >>>>>>> control. > >>>> > >>>>> For that purpose I have created an organization [2] on Github which > >>>>>>>> > >>>>>>> we > >>>> > >>>>> can > >>>>>>>> use to create repositories and teams that work on them in an > >>>>>>>> > >>>>>>> organized > >>>> > >>>>> manner. > >>>>>>>> Once enough work has accumulated we can start discussing > >>>>>>>> > >>>>>>> contributing > >>> > >>>> the > >>>>> > >>>>>> code > >>>>>>>> to the main distribution. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Theodore > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > >>>>>>>> d06MIRhahtJ6dw/ > >>>>>>>> [2] https://github.com/flinkml > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> *Yours faithfully, * > >>>>>>>> > >>>>>>>> *Kate Eri.* > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > > > |
Hi Chen,
Thanks for the input! :) There is already a project [1] for using TensorFlow models in Flink, and Theodore has suggested to contact the author, Eron Wright for the model serving direction. [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ Cheers, Gabor On 2017-03-17 19:41, Chen Qin wrote: > [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > nsorflow/ |
Hi all...
I agree about the tensorflow integration it seems to be important from what I hear. Should we sign up somewhere for the working groups (gdcos)? I would like to start helping with the model serving feature. Best Regards, Stavros On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email]> wrote: > Hi Chen, > > Thanks for the input! :) > > There is already a project [1] for using TensorFlow models in Flink, and > Theodore has suggested > to contact the author, Eron Wright for the model serving direction. > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ > > Cheers, > Gabor > > > On 2017-03-17 19:41, Chen Qin wrote: > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te >> nsorflow/ >> > > |
Hello Stavros,
The way I thought we'd do it is that each shepherd would be responsible for organizing the project: that includes setting up a Google doc, sending an email to the dev list to inform the wider community, and if possible, personally contacting the people who expressed interest in the project. Would you be willing to lead that effort for the model serving project? Regards, Theodore -- Sent from a mobile device. May contain autocorrect errors. On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email]> wrote: > Hi all... > > I agree about the tensorflow integration it seems to be important from what > I hear. > Should we sign up somewhere for the working groups (gdcos)? > I would like to start helping with the model serving feature. > > Best Regards, > Stavros > > On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email]> > wrote: > > > Hi Chen, > > > > Thanks for the input! :) > > > > There is already a project [1] for using TensorFlow models in Flink, and > > Theodore has suggested > > to contact the author, Eron Wright for the model serving direction. > > > > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- > tensorflow/ > > > > Cheers, > > Gabor > > > > > > On 2017-03-17 19:41, Chen Qin wrote: > > > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > >> nsorflow/ > >> > > > > > |
Hello, Theodore
Could you please move vectors of development and their prioritized positions from *## Executive summary* to the google doc? Could you please also create some table in google doc, that is representing the selected directions and persons, who would like to drive or participate in the particular topic, in order to make this process transparent for community and sum up current state of commitment of contributors? There we could simply inscribe us to some topic. And 1+ for CTR prediction case. вс, 19 мар. 2017 г. в 16:49, Theodore Vasiloudis < [hidden email]>: > Hello Stavros, > > The way I thought we'd do it is that each shepherd would be responsible for > organizing the project: that includes setting up a Google doc, sending an > email to the dev list to inform the wider community, and if possible, > personally contacting the people who expressed interest in the project. > > Would you be willing to lead that effort for the model serving project? > > Regards, > Theodore > > -- > Sent from a mobile device. May contain autocorrect errors. > > On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email]> > wrote: > > > Hi all... > > > > I agree about the tensorflow integration it seems to be important from > what > > I hear. > > Should we sign up somewhere for the working groups (gdcos)? > > I would like to start helping with the model serving feature. > > > > Best Regards, > > Stavros > > > > On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email]> > > wrote: > > > > > Hi Chen, > > > > > > Thanks for the input! :) > > > > > > There is already a project [1] for using TensorFlow models in Flink, > and > > > Theodore has suggested > > > to contact the author, Eron Wright for the model serving direction. > > > > > > > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- > > tensorflow/ > > > > > > Cheers, > > > Gabor > > > > > > > > > On 2017-03-17 19:41, Chen Qin wrote: > > > > > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > > >> nsorflow/ > > >> > > > > > > > > > *Yours faithfully, * *Kate Eri.* |
Hi All,
Sorry for joining this discussion late. My graduation thesis is about online learning system. I would build it on flink in the next three months. I'd like to contribute on: - Online learning On Mon, Mar 20, 2017 at 6:51 PM Katherin Eri <[hidden email]> wrote: Hello, Theodore Could you please move vectors of development and their prioritized positions from *## Executive summary* to the google doc? Could you please also create some table in google doc, that is representing the selected directions and persons, who would like to drive or participate in the particular topic, in order to make this process transparent for community and sum up current state of commitment of contributors? There we could simply inscribe us to some topic. And 1+ for CTR prediction case. вс, 19 мар. 2017 г. в 16:49, Theodore Vasiloudis < [hidden email]>: > Hello Stavros, > > The way I thought we'd do it is that each shepherd would be responsible for > organizing the project: that includes setting up a Google doc, sending an > email to the dev list to inform the wider community, and if possible, > personally contacting the people who expressed interest in the project. > > Would you be willing to lead that effort for the model serving project? > > Regards, > Theodore > > -- > Sent from a mobile device. May contain autocorrect errors. > > On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email]> > wrote: > > > Hi all... > > > > I agree about the tensorflow integration it seems to be important from > what > > I hear. > > Should we sign up somewhere for the working groups (gdcos)? > > I would like to start helping with the model serving feature. > > > > Best Regards, > > Stavros > > > > On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email]> > > wrote: > > > > > Hi Chen, > > > > > > Thanks for the input! :) > > > > > > There is already a project [1] for using TensorFlow models in Flink, > and > > > Theodore has suggested > > > to contact the author, Eron Wright for the model serving direction. > > > > > > > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- > > tensorflow/ > > > > > > Cheers, > > > Gabor > > > > > > > > > On 2017-03-17 19:41, Chen Qin wrote: > > > > > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > > >> nsorflow/ > > >> > > > > > > > > > *Yours faithfully, * *Kate Eri.* |
Hello everyone,
Here at DFKI, we are currently working on project that involves developing open-source Online Machine Learning algorithms on top of Flink. So far, we have simple moments, sampling (e.g.: simple reservoir sampling) and sketches (e.g., Frequent Directions) built on top of scikit-like abstractions and Flink's DataStream/KeyedStream. Moreover, we have few industrial use cases and we are gonna validate our implementation using real industrial data. We plan to implement more advanced algorithms in the future as well as to share our results with you and contribute, in case you are interested. Best, Ventura This message, for the D. Lgs n. 196/2003 (Privacy Code), may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. On Mon, Mar 20, 2017 at 12:26 PM, Tao Meng <[hidden email]> wrote: > Hi All, > > Sorry for joining this discussion late. > My graduation thesis is about online learning system. I would build it on > flink in the next three months. > > I'd like to contribute on: > - Online learning > > > > > On Mon, Mar 20, 2017 at 6:51 PM Katherin Eri <[hidden email]> > wrote: > > Hello, Theodore > > Could you please move vectors of development and their prioritized > positions from *## Executive summary* to the google doc? > > > > Could you please also create some table in google doc, that is representing > the selected directions and persons, who would like to drive or participate > in the particular topic, in order to make this process transparent for > community and sum up current state of commitment of contributors? > > There we could simply inscribe us to some topic. > > > > And 1+ for CTR prediction case. > > вс, 19 мар. 2017 г. в 16:49, Theodore Vasiloudis < > [hidden email]>: > > > Hello Stavros, > > > > The way I thought we'd do it is that each shepherd would be responsible > for > > organizing the project: that includes setting up a Google doc, sending an > > email to the dev list to inform the wider community, and if possible, > > personally contacting the people who expressed interest in the project. > > > > Would you be willing to lead that effort for the model serving project? > > > > Regards, > > Theodore > > > > -- > > Sent from a mobile device. May contain autocorrect errors. > > > > On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email] > > > > wrote: > > > > > Hi all... > > > > > > I agree about the tensorflow integration it seems to be important from > > what > > > I hear. > > > Should we sign up somewhere for the working groups (gdcos)? > > > I would like to start helping with the model serving feature. > > > > > > Best Regards, > > > Stavros > > > > > > On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email] > > > > > wrote: > > > > > > > Hi Chen, > > > > > > > > Thanks for the input! :) > > > > > > > > There is already a project [1] for using TensorFlow models in Flink, > > and > > > > Theodore has suggested > > > > to contact the author, Eron Wright for the model serving direction. > > > > > > > > > > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- > > > tensorflow/ > > > > > > > > Cheers, > > > > Gabor > > > > > > > > > > > > On 2017-03-17 19:41, Chen Qin wrote: > > > > > > > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > > > >> nsorflow/ > > > >> > > > > > > > > > > > > > > -- > > *Yours faithfully, * > > *Kate Eri.* > |
In reply to this post by Theodore Vasiloudis
Hello Theodore,
I could lead the effort for the model serving thing, I think I have the cycles for it. I would also contact Enron and see if he wants to help in that direction. Best, Stavros On Sun, Mar 19, 2017 at 3:43 PM, Theodore Vasiloudis < [hidden email]> wrote: > Hello Stavros, > > The way I thought we'd do it is that each shepherd would be responsible for > organizing the project: that includes setting up a Google doc, sending an > email to the dev list to inform the wider community, and if possible, > personally contacting the people who expressed interest in the project. > > Would you be willing to lead that effort for the model serving project? > > Regards, > Theodore > > -- > Sent from a mobile device. May contain autocorrect errors. > > On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email]> > wrote: > > > Hi all... > > > > I agree about the tensorflow integration it seems to be important from > what > > I hear. > > Should we sign up somewhere for the working groups (gdcos)? > > I would like to start helping with the model serving feature. > > > > Best Regards, > > Stavros > > > > On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email]> > > wrote: > > > > > Hi Chen, > > > > > > Thanks for the input! :) > > > > > > There is already a project [1] for using TensorFlow models in Flink, > and > > > Theodore has suggested > > > to contact the author, Eron Wright for the model serving direction. > > > > > > > > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- > > tensorflow/ > > > > > > Cheers, > > > Gabor > > > > > > > > > On 2017-03-17 19:41, Chen Qin wrote: > > > > > >> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te > > >> nsorflow/ > > >> > > > > > > > > > |
In reply to this post by venturadelmonte
Hi all,
@Theodore: +1 for the CTR use-case. Thanks for the suggestion! @Katherin: +1 for reflecting the choices made here and contributor commitment in Gdoc. @Tao, @Ventura: It's great to here you have been working on ML on Flink :) I hope we can all aggregate our efforts somehow. It would be best if you could contribute some of your work. I've started putting together a Gdoc specifically for *Offline/incremental learning on Streaming API*: https://docs.google.com/document/d/18BqoFTQ0dPkbyO-PWBMMpW5Nl0pjobSubnWpW0_r8yA/ Right now you can comment/give suggestions there. I'd like to start a separate mailing list discussion as soon as there are enough contributors volunteering for this direction. For now, I'm trying to reflect the relevant part of the discussion here and the initial Gdoc [1]. [1] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/ Cheers, Gabor On 2017-03-20 14:27, Ventura Del Monte wrote: > Hello everyone, > > Here at DFKI, we are currently working on project that involves developing > open-source Online Machine Learning algorithms on top of Flink. > So far, we have simple moments, sampling (e.g.: simple reservoir sampling) > and sketches (e.g., Frequent Directions) built on top of scikit-like > abstractions and Flink's DataStream/KeyedStream. > Moreover, we have few industrial use cases and we are gonna validate our > implementation using real industrial data. > We plan to implement more advanced algorithms in the future as well as to > share our results with you and contribute, in case you are interested. > > Best, > Ventura > > > > > This message, for the D. Lgs n. 196/2003 (Privacy Code), may contain > confidential and/or privileged information. If you are not the addressee or > authorized to receive this for the addressee, you must not use, copy, > disclose or take any action based on this message or any information > herein. If you have received this message in error, please advise the > sender immediately by reply e-mail and delete this message. Thank you for > your cooperation. > > On Mon, Mar 20, 2017 at 12:26 PM, Tao Meng <[hidden email]> wrote: > >> Hi All, >> >> Sorry for joining this discussion late. >> My graduation thesis is about online learning system. I would build it on >> flink in the next three months. >> >> I'd like to contribute on: >> - Online learning >> >> >> >> >> On Mon, Mar 20, 2017 at 6:51 PM Katherin Eri <[hidden email]> >> wrote: >> >> Hello, Theodore >> >> Could you please move vectors of development and their prioritized >> positions from *## Executive summary* to the google doc? >> >> >> >> Could you please also create some table in google doc, that is representing >> the selected directions and persons, who would like to drive or participate >> in the particular topic, in order to make this process transparent for >> community and sum up current state of commitment of contributors? >> >> There we could simply inscribe us to some topic. >> >> >> >> And 1+ for CTR prediction case. >> >> вс, 19 мар. 2017 г. в 16:49, Theodore Vasiloudis < >> [hidden email]>: >> >>> Hello Stavros, >>> >>> The way I thought we'd do it is that each shepherd would be responsible >> for >>> organizing the project: that includes setting up a Google doc, sending an >>> email to the dev list to inform the wider community, and if possible, >>> personally contacting the people who expressed interest in the project. >>> >>> Would you be willing to lead that effort for the model serving project? >>> >>> Regards, >>> Theodore >>> >>> -- >>> Sent from a mobile device. May contain autocorrect errors. >>> >>> On Mar 19, 2017 3:49 AM, "Stavros Kontopoulos" <[hidden email] >>> >>> wrote: >>> >>>> Hi all... >>>> >>>> I agree about the tensorflow integration it seems to be important from >>> what >>>> I hear. >>>> Should we sign up somewhere for the working groups (gdcos)? >>>> I would like to start helping with the model serving feature. >>>> >>>> Best Regards, >>>> Stavros >>>> >>>> On Fri, Mar 17, 2017 at 10:34 PM, Gábor Hermann <[hidden email] >>>> wrote: >>>> >>>>> Hi Chen, >>>>> >>>>> Thanks for the input! :) >>>>> >>>>> There is already a project [1] for using TensorFlow models in Flink, >>> and >>>>> Theodore has suggested >>>>> to contact the author, Eron Wright for the model serving direction. >>>>> >>>>> >>>>> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink- >>>> tensorflow/ >>>>> Cheers, >>>>> Gabor >>>>> >>>>> >>>>> On 2017-03-17 19:41, Chen Qin wrote: >>>>> >>>>>> [1]http://sf.flink-forward.org/kb_sessions/introducing-flink-te >>>>>> nsorflow/ >>>>>> >>>>> >> -- >> >> *Yours faithfully, * >> >> *Kate Eri.* >> |
Free forum by Nabble | Edit this page |