Iteration Intermediate Output

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Iteration Intermediate Output

Greg Hogan
Hi y'all,

I think this is an oft-requested feature [0] and there are many graph
algorithms for which intermediate output is the desired result. I'd like to
take Stephan up on his offer [1] for pointers.

I have yet to get in deep, but I see that iteration tasks are treated
specially as IterationIntermediateTask for synchronization between
supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
walking the program DAG an iteration must be first reached through the tail.

Greg

[0]
http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html
Reply | Threaded
Open this post in threaded view
|

Re: Iteration Intermediate Output

Vasiliki Kalavri
Hey,

it would be great to add this feature indeed! Thanks for bringing it up
Greg :)
Would the best way be to extend the iteration operators to support
intermediate outputs or revisit the idea of caching intermediate results
and thus allow efficient for-loop iterations?

-Vasia.

On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote:

> Hi y'all,
>
> I think this is an oft-requested feature [0] and there are many graph
> algorithms for which intermediate output is the desired result. I'd like to
> take Stephan up on his offer [1] for pointers.
>
> I have yet to get in deep, but I see that iteration tasks are treated
> specially as IterationIntermediateTask for synchronization between
> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
> walking the program DAG an iteration must be first reached through the
> tail.
>
> Greg
>
> [0]
>
> http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
> [1]
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Iteration Intermediate Output

Gábor Gévay
Hello,

> Would the best way be to extend the iteration operators to support
> intermediate outputs or revisit the idea of caching intermediate results
> and thus allow efficient for-loop iterations?

Caching intermediate results would also help a lot to projects that
are targeting Flink as a backend, like Emma [1] and SystemML [2]. The
issue here is that these languages allow writing more general
iterations (general control flow (nested loops, ifs in the loop body),
multiple "solution sets", doing something else with the intermediate
results, etc.), that can't be translated to Flink's iteration
constructs. So these systems currently don't have much better options
than just writing intermediate results to files, which is not so nice.

Best,
Gabor

[1] http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf
[2] https://systemml.apache.org/



2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>:

> Hey,
>
> it would be great to add this feature indeed! Thanks for bringing it up
> Greg :)
> Would the best way be to extend the iteration operators to support
> intermediate outputs or revisit the idea of caching intermediate results
> and thus allow efficient for-loop iterations?
>
> -Vasia.
>
> On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote:
>
>> Hi y'all,
>>
>> I think this is an oft-requested feature [0] and there are many graph
>> algorithms for which intermediate output is the desired result. I'd like to
>> take Stephan up on his offer [1] for pointers.
>>
>> I have yet to get in deep, but I see that iteration tasks are treated
>> specially as IterationIntermediateTask for synchronization between
>> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
>> walking the program DAG an iteration must be first reached through the
>> tail.
>>
>> Greg
>>
>> [0]
>>
>> http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
>> [1]
>>
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html
>>
Reply | Threaded
Open this post in threaded view
|

Re: Iteration Intermediate Output

Suneel Marthi-2
This is a feature that was requested by the Mahout project few months
before for the very same reasons as mentioned in previous emails on this
thread, but we were snubbed by the flink folks as this being '*WAY too
specific*' request for flink to deal with and 'its got to be done the way
Flink has it', etc...

While delta iterations r real cool, its not real trivial to have them as
part of language specific DSLs handling more general iterations.  Its good
to see that this limitation has started to bite others and hopefully Data
Artisans now sees this as a much needed feature.



On Mon, May 30, 2016 at 8:31 AM, Gábor Gévay <[hidden email]> wrote:

> Hello,
>
> > Would the best way be to extend the iteration operators to support
> > intermediate outputs or revisit the idea of caching intermediate results
> > and thus allow efficient for-loop iterations?
>
> Caching intermediate results would also help a lot to projects that
> are targeting Flink as a backend, like Emma [1] and SystemML [2]. The
> issue here is that these languages allow writing more general
> iterations (general control flow (nested loops, ifs in the loop body),
> multiple "solution sets", doing something else with the intermediate
> results, etc.), that can't be translated to Flink's iteration
> constructs. So these systems currently don't have much better options
> than just writing intermediate results to files, which is not so nice.
>
> Best,
> Gabor
>
> [1]
> http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf
> [2] https://systemml.apache.org/
>
>
>
> 2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>:
> > Hey,
> >
> > it would be great to add this feature indeed! Thanks for bringing it up
> > Greg :)
> > Would the best way be to extend the iteration operators to support
> > intermediate outputs or revisit the idea of caching intermediate results
> > and thus allow efficient for-loop iterations?
> >
> > -Vasia.
> >
> > On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote:
> >
> >> Hi y'all,
> >>
> >> I think this is an oft-requested feature [0] and there are many graph
> >> algorithms for which intermediate output is the desired result. I'd
> like to
> >> take Stephan up on his offer [1] for pointers.
> >>
> >> I have yet to get in deep, but I see that iteration tasks are treated
> >> specially as IterationIntermediateTask for synchronization between
> >> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
> >> walking the program DAG an iteration must be first reached through the
> >> tail.
> >>
> >> Greg
> >>
> >> [0]
> >>
> >>
> http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
> >> [1]
> >>
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Iteration Intermediate Output

Kostas Tzoumas-2
Thanks Greg for opening this discussion!

I really really don't want to derail the discussion here, just a quick
clarification regarding Suneel's last email: folks that are working at data
Artisans are participating in this community as individuals, not as a
corporation, and the dev list is not a support forum to "request" features
from some company, but an open forum for the Flink community. I would hope
that we keep the discussion technical (I know that I broke this with this
email, but really felt I had to clarify this).

I think all of us agree that this is a very useful feature, and I'm very
happy to see more work on this!

Kostas


On Mon, May 30, 2016 at 2:49 PM, Suneel Marthi <[hidden email]> wrote:

> This is a feature that was requested by the Mahout project few months
> before for the very same reasons as mentioned in previous emails on this
> thread, but we were snubbed by the flink folks as this being '*WAY too
> specific*' request for flink to deal with and 'its got to be done the way
> Flink has it', etc...
>
> While delta iterations r real cool, its not real trivial to have them as
> part of language specific DSLs handling more general iterations.  Its good
> to see that this limitation has started to bite others and hopefully Data
> Artisans now sees this as a much needed feature.
>
>
>
> On Mon, May 30, 2016 at 8:31 AM, Gábor Gévay <[hidden email]> wrote:
>
> > Hello,
> >
> > > Would the best way be to extend the iteration operators to support
> > > intermediate outputs or revisit the idea of caching intermediate
> results
> > > and thus allow efficient for-loop iterations?
> >
> > Caching intermediate results would also help a lot to projects that
> > are targeting Flink as a backend, like Emma [1] and SystemML [2]. The
> > issue here is that these languages allow writing more general
> > iterations (general control flow (nested loops, ifs in the loop body),
> > multiple "solution sets", doing something else with the intermediate
> > results, etc.), that can't be translated to Flink's iteration
> > constructs. So these systems currently don't have much better options
> > than just writing intermediate results to files, which is not so nice.
> >
> > Best,
> > Gabor
> >
> > [1]
> >
> http://www.user.tu-berlin.de/asteriosk/assets/publications/emma-sigmod2015.pdf
> > [2] https://systemml.apache.org/
> >
> >
> >
> > 2016-05-28 13:48 GMT+02:00 Vasiliki Kalavri <[hidden email]>:
> > > Hey,
> > >
> > > it would be great to add this feature indeed! Thanks for bringing it up
> > > Greg :)
> > > Would the best way be to extend the iteration operators to support
> > > intermediate outputs or revisit the idea of caching intermediate
> results
> > > and thus allow efficient for-loop iterations?
> > >
> > > -Vasia.
> > >
> > > On 26 May 2016 at 22:41, Greg Hogan <[hidden email]> wrote:
> > >
> > >> Hi y'all,
> > >>
> > >> I think this is an oft-requested feature [0] and there are many graph
> > >> algorithms for which intermediate output is the desired result. I'd
> > like to
> > >> take Stephan up on his offer [1] for pointers.
> > >>
> > >> I have yet to get in deep, but I see that iteration tasks are treated
> > >> specially as IterationIntermediateTask for synchronization between
> > >> supersteps. Also, when OperatorTranslation and GraphCreatingVisitor
> are
> > >> walking the program DAG an iteration must be first reached through the
> > >> tail.
> > >>
> > >> Greg
> > >>
> > >> [0]
> > >>
> > >>
> >
> http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
> > >> [1]
> > >>
> > >>
> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Iteration Intermediate Output

Andrew Palumbo
In reply to this post by Greg Hogan
Greg,

We ran into this Issue when implementing the Mahout bindings for Flink [1].  It ended up being the major bottleneck for Mahout on Flink, and makes iterative algorithms basically unreasonable.  While it is understook that that Flink's Delta-iterations are intended for use when iterating over Flink DataSet, they are not always suitable to the task.  Eg. FlinkML's  ALS.scala [2][3] which must flush intermediate results to the file system.

In-memory caching is a must have for any type of declarative ML DSL riding on top of fink.  Eg, SystemML, Mahout, etc, or an internal Flink DSL.  The Mahout bindings are Linear Algebra based.  In close collaboration with the Flink community [4] and based on a contribution from an intern hired by Data Artisans  to complete this task and other members of the community, we recently finished up work on the Mahout Distributed Linear Algebra bindings.

Unfortunately the lack of an in-memory cache made the outcome of this effort very sub-optimal.  After the unfinished hand-off of the bindings from the Flink community to the Mahout community, we were forced to find a workaround for the caching.  We used the template of flushing and executing and flushing results to the File system [5], and had to release the bindings as "Experimental".

Anyone building a declarative ML interface using the Flink Dataset API as a backend will run into similar issues as has been reported on the stackoverflow thread u refer to and it would be great to have this feature.


Its great to see this being talked about.

Andy

[1] http://mahout.apache.org/users/flinkbindings/flink-internals.html
[2] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/recommendation/ALS.scala#L481
[3] https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/common/FlinkMLTools.scala#L84
[4]https://issues.apache.org/jira/browse/MAHOUT-1570
[5]https://github.com/apache/mahout/blob/master/flink/src/main/scala/org/apache/mahout/flinkbindings/drm/CheckpointedFlinkDrm.scala#L139

________________________________________
From: Greg Hogan <[hidden email]>
Sent: Thursday, May 26, 2016 4:41:53 PM
To: [hidden email]
Subject: Iteration Intermediate Output

Hi y'all,

I think this is an oft-requested feature [0] and there are many graph
algorithms for which intermediate output is the desired result. I'd like to
take Stephan up on his offer [1] for pointers.

I have yet to get in deep, but I see that iteration tasks are treated
specially as IterationIntermediateTask for synchronization between
supersteps. Also, when OperatorTranslation and GraphCreatingVisitor are
walking the program DAG an iteration must be first reached through the tail.

Greg

[0]
http://stackoverflow.com/questions/37224140/possibility-of-saving-partial-outputs-from-bulk-iteration-in-flink-dataset
[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Intermediate-output-during-delta-iterations-td436.html