Hi y'all,
I think this is an oft-requested feature [0] and there are many graph algorithms for which intermediate output is the desired result. I'd like to take Stephan up on his offer [1] for pointers. I have yet to get in deep, but I see that iteration tasks are treated specially as IterationIntermediateTask for synchronization between supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are walking the program DAG an iteration must be first reached through the tail. Greg [0] http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html |
Hey,
it would be great to add this feature indeed! Thanks for bringing it up Greg :) Would the best way be to extend the iteration operators to support intermediate outputs or revisit the idea of caching intermediate results and thus allow efficient for-loop iterations? -Vasia. On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote: > Hi y'all, > > I think this is an oft-requested feature [0] and there are many graph > algorithms for which intermediate output is the desired result. I'd like to > take Stephan up on his offer [1] for pointers. > > I have yet to get in deep, but I see that iteration tasks are treated > specially as IterationIntermediateTask for synchronization between > supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are > walking the program DAG an iteration must be first reached through the > tail. > > Greg > > [0] > > http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset > [1] > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html > |
Hello,
> Would the best way be to extend the iteration operators to support > intermediate outputs or revisit the idea of caching intermediate results > and thus allow efficient for-loop iterations? Caching intermediate results would also help a lot to projects that are targeting Flink as a backend, like Emma [1] and SystemML [2]. The issue here is that these languages allow writing more general iterations (general control flow (nested loops, ifs in the loop body), multiple "solution sets", doing something else with the intermediate results, etc.), that can't be translated to Flink's iteration constructs. So these systems currently don't have much better options than just writing intermediate results to files, which is not so nice. Best, Gabor [1] http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf [2] https://systemml.apache.org/ 2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > Hey, > > it would be great to add this feature indeed! Thanks for bringing it up > Greg :) > Would the best way be to extend the iteration operators to support > intermediate outputs or revisit the idea of caching intermediate results > and thus allow efficient for-loop iterations? > > -Vasia. > > On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote: > >> Hi y'all, >> >> I think this is an oft-requested feature [0] and there are many graph >> algorithms for which intermediate output is the desired result. I'd like to >> take Stephan up on his offer [1] for pointers. >> >> I have yet to get in deep, but I see that iteration tasks are treated >> specially as IterationIntermediateTask for synchronization between >> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are >> walking the program DAG an iteration must be first reached through the >> tail. >> >> Greg >> >> [0] >> >> http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset >> [1] >> >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html >> |
This is a feature that was requested by the Mahout project few months
before for the very same reasons as mentioned in previous emails on this thread, but we were snubbed by the flink folks as this being '*WAY too specific*' request for flink to deal with and 'its got to be done the way Flink has it', etc... While delta iterations r real cool, its not real trivial to have them as part of language specific DSLs handling more general iterations. Its good to see that this limitation has started to bite others and hopefully Data Artisans now sees this as a much needed feature. On Mon, May 30, 2016 at 8:31 AM, Gábor Gévay <[hidden email]> wrote: > Hello, > > > Would the best way be to extend the iteration operators to support > > intermediate outputs or revisit the idea of caching intermediate results > > and thus allow efficient for-loop iterations? > > Caching intermediate results would also help a lot to projects that > are targeting Flink as a backend, like Emma [1] and SystemML [2]. The > issue here is that these languages allow writing more general > iterations (general control flow (nested loops, ifs in the loop body), > multiple "solution sets", doing something else with the intermediate > results, etc.), that can't be translated to Flink's iteration > constructs. So these systems currently don't have much better options > than just writing intermediate results to files, which is not so nice. > > Best, > Gabor > > [1] > http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf > [2] https://systemml.apache.org/ > > > > 2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > > Hey, > > > > it would be great to add this feature indeed! Thanks for bringing it up > > Greg :) > > Would the best way be to extend the iteration operators to support > > intermediate outputs or revisit the idea of caching intermediate results > > and thus allow efficient for-loop iterations? > > > > -Vasia. > > > > On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote: > > > >> Hi y'all, > >> > >> I think this is an oft-requested feature [0] and there are many graph > >> algorithms for which intermediate output is the desired result. I'd > like to > >> take Stephan up on his offer [1] for pointers. > >> > >> I have yet to get in deep, but I see that iteration tasks are treated > >> specially as IterationIntermediateTask for synchronization between > >> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are > >> walking the program DAG an iteration must be first reached through the > >> tail. > >> > >> Greg > >> > >> [0] > >> > >> > http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset > >> [1] > >> > >> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html > >> > |
Thanks Greg for opening this discussion!
I really really don't want to derail the discussion here, just a quick clarification regarding Suneel's last email: folks that are working at data Artisans are participating in this community as individuals, not as a corporation, and the dev list is not a support forum to "request" features from some company, but an open forum for the Flink community. I would hope that we keep the discussion technical (I know that I broke this with this email, but really felt I had to clarify this). I think all of us agree that this is a very useful feature, and I'm very happy to see more work on this! Kostas On Mon, May 30, 2016 at 2:49 PM, Suneel Marthi <[hidden email]> wrote: > This is a feature that was requested by the Mahout project few months > before for the very same reasons as mentioned in previous emails on this > thread, but we were snubbed by the flink folks as this being '*WAY too > specific*' request for flink to deal with and 'its got to be done the way > Flink has it', etc... > > While delta iterations r real cool, its not real trivial to have them as > part of language specific DSLs handling more general iterations. Its good > to see that this limitation has started to bite others and hopefully Data > Artisans now sees this as a much needed feature. > > > > On Mon, May 30, 2016 at 8:31 AM, Gábor Gévay <[hidden email]> wrote: > > > Hello, > > > > > Would the best way be to extend the iteration operators to support > > > intermediate outputs or revisit the idea of caching intermediate > results > > > and thus allow efficient for-loop iterations? > > > > Caching intermediate results would also help a lot to projects that > > are targeting Flink as a backend, like Emma [1] and SystemML [2]. The > > issue here is that these languages allow writing more general > > iterations (general control flow (nested loops, ifs in the loop body), > > multiple "solution sets", doing something else with the intermediate > > results, etc.), that can't be translated to Flink's iteration > > constructs. So these systems currently don't have much better options > > than just writing intermediate results to files, which is not so nice. > > > > Best, > > Gabor > > > > [1] > > > http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf > > [2] https://systemml.apache.org/ > > > > > > > > 2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>: > > > Hey, > > > > > > it would be great to add this feature indeed! Thanks for bringing it up > > > Greg :) > > > Would the best way be to extend the iteration operators to support > > > intermediate outputs or revisit the idea of caching intermediate > results > > > and thus allow efficient for-loop iterations? > > > > > > -Vasia. > > > > > > On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote: > > > > > >> Hi y'all, > > >> > > >> I think this is an oft-requested feature [0] and there are many graph > > >> algorithms for which intermediate output is the desired result. I'd > > like to > > >> take Stephan up on his offer [1] for pointers. > > >> > > >> I have yet to get in deep, but I see that iteration tasks are treated > > >> specially as IterationIntermediateTask for synchronization between > > >> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor > are > > >> walking the program DAG an iteration must be first reached through the > > >> tail. > > >> > > >> Greg > > >> > > >> [0] > > >> > > >> > > > http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset > > >> [1] > > >> > > >> > > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html > > >> > > > |
In reply to this post by Greg Hogan
Greg,
We ran into this Issue when implementing the Mahout bindings for Flink [1]. It ended up being the major bottleneck for Mahout on Flink, and makes iterative algorithms basically unreasonable. While it is understook that that Flink's Delta-iterations are intended for use when iterating over Flink DataSet, they are not always suitable to the task. Eg. FlinkML's ALS.scala [2][3] which must flush intermediate results to the file system. In-memory caching is a must have for any type of declarative ML DSL riding on top of fink. Eg, SystemML, Mahout, etc, or an internal Flink DSL. The Mahout bindings are Linear Algebra based. In close collaboration with the Flink community [4] and based on a contribution from an intern hired by Data Artisans to complete this task and other members of the community, we recently finished up work on the Mahout Distributed Linear Algebra bindings. Unfortunately the lack of an in-memory cache made the outcome of this effort very sub-optimal. After the unfinished hand-off of the bindings from the Flink community to the Mahout community, we were forced to find a workaround for the caching. We used the template of flushing and executing and flushing results to the File system [5], and had to release the bindings as "Experimental". Anyone building a declarative ML interface using the Flink Dataset API as a backend will run into similar issues as has been reported on the stackoverflow thread u refer to and it would be great to have this feature. Its great to see this being talked about. Andy [1] http://mahout.apache.org/users/flinkbindings/flink-internals.html [2] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala#L481 [3] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L84 [4]https://issues.apache.org/jira/browse/MAHOUT-1570 [5]https://github.com/apache/mahout/blob/master/flink/src/main/scala/org/apache/mahout/flinkbindings/drm/CheckpointedFlinkDrm.scala#L139 ________________________________________ From: Greg Hogan <[hidden email]> Sent: Thursday, May 26, 2016 4:41:53 PM To: [hidden email] Subject: Iteration Intermediate Output Hi y'all, I think this is an oft-requested feature [0] and there are many graph algorithms for which intermediate output is the desired result. I'd like to take Stephan up on his offer [1] for pointers. I have yet to get in deep, but I see that iteration tasks are treated specially as IterationIntermediateTask for synchronization between supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are walking the program DAG an iteration must be first reached through the tail. Greg [0] http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html |
Free forum by Nabble | Edit this page |