(DEPRECATED) Apache Flink Mailing List archive.

Question about limitations of iterative algorithms

Classic

List

Threaded

6 messages Options

André Petermann

Question about limitations of iterative algorithms

Sorry for redundant post ...

Hi all,

We are working on an implementation of Frequent Subgraph Mining using
Flink. At that, the "too few memory segments error" prevents the most
promising solution. The problem is not specific for graphs, but all
iterative problems where

- working and solution sets contain data of different types
- the working set may grow, shrink or is replaced for each iteration
- the solution set grows for each iteration
- the termination criterion is based on data set metrics,
e.g. while working set not empty

An illustration of our problem workflow, generalized to graph unspecific
frequent pattern mining, can be found here:
https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf

Page 1 shows the most promising solution. We started implementing it
using a for loop. However, the "too few memory segments error" makes it
untestable. As the iteration body itself is a complex workflow and the
number of iterations is arbitrary, unrolling it while reserving operator
memory will be a permanent limitation. Increasing limits or physical
memory would only delay the problem. The resulting question is:

Would it be possible to implement a while-not-empty or at least a for
loop, that isn't unrolled and can be executed more memory efficient?

Page 2 shows an alternative solution to our problem using the concept of
delta iteration. However, Flink delta iteration does neither support
broadcasting nor working-set independent intermediate results. Page 3
shows our working solution using two workarounds for those restrictions.
However, these workarounds lead to unnecessary memory consumption and
redundant expensive computations. So, in the case the answer to the
first question is no, a second question:

Would it be possible to extend the delta iteration by the support of
rich map functions with broadcast sets and the memory of intermediate
results?

We think, that a while-not-empty loop might be useful for other
algorithms too, e.g. variable length path search in graphs. Did we miss
Flink features meeting our requirements? Do you think it's worth to
create an improvement issue? At that, we'd of course be willing to
contribute in the form of development.

Best Regards
Andre

--
-------------------------------------------
PhD Student
University of Leipzig
Department of Computer Science
Database Research Group

email: [hidden email]
web: dbs.uni-leipzig.de
-------------------------------------------

Fabian Hueske-2

Re: Question about limitations of iterative algorithms

Hi Andre,

Thanks for reaching out to the Flink community!

I am not sure your analysis is based on correct assumptions about Flink's
delta iterations.
Flink's delta iterations do support
- working and solution sets of different types
- worksets that grow and shrink or are completely replaced in each iteration
- solution sets that grow (up to the point of the too few memory segments
error)
- termination via a custom aggregator. So it is max number of iterations,
empty workset, and a custom termination criterion.

The problem of the "too few memory segments" occurs because the solution
set is held in an in-memory hash table.
Spilling that table would result in a significant iteration slowdown. One
solution is to throw more resources at the problem (more nodes, more RAM).
Another work-around is to reduce the amount of managed memory and switch
the solution set hash table to "unmanaged". This will keep the result set
in a regular Java HashMap but this might cause a OOMError and kill the JVM
if it grows too large.
AFAIK, there is no way to shrink the solution set at the moment. It might
be worth to investigate into that direction.

Regarding your questions about the while-not-empty loop. What exactly is
the difference to the default delta iteration. It does stop when the
workset is empty (in addition to the custom termination criterion).
I am not sure if I got all your points. Please let me know, if I got
something wrong.

Thanks,
Fabian

2015-11-03 13:50 GMT+01:00 André Petermann <
[hidden email]>:

> Sorry for redundant post ...
>
>
> Hi all,
>
> We are working on an implementation of Frequent Subgraph Mining using
> Flink. At that, the "too few memory segments error" prevents the most
> promising solution. The problem is not specific for graphs, but all
> iterative problems where
>
> - working and solution sets contain data of different types
> - the working set may grow, shrink or is replaced for each iteration
> - the solution set grows for each iteration
> - the termination criterion is based on data set metrics,
> e.g. while working set not empty
>
> An illustration of our problem workflow, generalized to graph unspecific
> frequent pattern mining, can be found here:
>
> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>
> Page 1 shows the most promising solution. We started implementing it
> using a for loop. However, the "too few memory segments error" makes it
> untestable. As the iteration body itself is a complex workflow and the
> number of iterations is arbitrary, unrolling it while reserving operator
> memory will be a permanent limitation. Increasing limits or physical
> memory would only delay the problem. The resulting question is:
>
> Would it be possible to implement a while-not-empty or at least a for
> loop, that isn't unrolled and can be executed more memory efficient?
>
> Page 2 shows an alternative solution to our problem using the concept of
> delta iteration. However, Flink delta iteration does neither support
> broadcasting nor working-set independent intermediate results. Page 3
> shows our working solution using two workarounds for those restrictions.
> However, these workarounds lead to unnecessary memory consumption and
> redundant expensive computations. So, in the case the answer to the
> first question is no, a second question:
>
> Would it be possible to extend the delta iteration by the support of
> rich map functions with broadcast sets and the memory of intermediate
> results?
>
> We think, that a while-not-empty loop might be useful for other
> algorithms too, e.g. variable length path search in graphs. Did we miss
> Flink features meeting our requirements? Do you think it's worth to
> create an improvement issue? At that, we'd of course be willing to
> contribute in the form of development.
>
> Best Regards
> Andre
>
> --
> -------------------------------------------
> PhD Student
> University of Leipzig
> Department of Computer Science
> Database Research Group
>
> email: [hidden email]
> web: dbs.uni-leipzig.de
> -------------------------------------------
>

André Petermann

Re: Question about limitations of iterative algorithms

Hi Fabian,

thanks for your fast reply!

I created a gist to explain the while-not-empty loop in more detail:
https://gist.github.com/p3et/9f6e56cf0b68213e3e2b

It is an approach to create a minimal example of the kind of algorithm
corresponding to page 1 of the PDF, in particular frequent substrings.
Please don't care about the algorithm itself, the actual algorithm is
much more complex. However, even this simple case runs only for 7
iterations on my machine before "too few memory segments" is raising.

If I understood delta iteration right, it's, necessary to provide a 1:1
mapping between working and solution set items. The empty-case you
referred to is the no-more-updates case, but not no-more-items, right?

In my example, the working set is completely replaced for each iteration
(line 52), with only a parent-child mapping. The solution set is
initially empty (line 33) and stores results of all iterations (line
48). I hope this shows the difference to the delta iteration and
while-not-empty. Further on, you see the different data types of working
and solution sets.

I will provide a further Gist for the "second choice" solution using
delta iteration. However, what we actually would prefer is to replace
the for-loop by something like while(embeddings.isNotEmpty()) including
a truly iterative execution.

But please let me know if I missed some Flink features already enabling
such loops.

Thanks,
Andre

On 03.11.2015 15:47, Fabian Hueske wrote:

> Hi Andre,
>
> Thanks for reaching out to the Flink community!
>
> I am not sure your analysis is based on correct assumptions about Flink's
> delta iterations.
> Flink's delta iterations do support
> - working and solution sets of different types
> - worksets that grow and shrink or are completely replaced in each iteration
> - solution sets that grow (up to the point of the too few memory segments
> error)
> - termination via a custom aggregator. So it is max number of iterations,
> empty workset, and a custom termination criterion.
>
> The problem of the "too few memory segments" occurs because the solution
> set is held in an in-memory hash table.
> Spilling that table would result in a significant iteration slowdown. One
> solution is to throw more resources at the problem (more nodes, more RAM).
> Another work-around is to reduce the amount of managed memory and switch
> the solution set hash table to "unmanaged". This will keep the result set
> in a regular Java HashMap but this might cause a OOMError and kill the JVM
> if it grows too large.
> AFAIK, there is no way to shrink the solution set at the moment. It might
> be worth to investigate into that direction.
>
> Regarding your questions about the while-not-empty loop. What exactly is
> the difference to the default delta iteration. It does stop when the
> workset is empty (in addition to the custom termination criterion).
> I am not sure if I got all your points. Please let me know, if I got
> something wrong.
>
> Thanks,
> Fabian
>
> 2015-11-03 13:50 GMT+01:00 André Petermann <
> [hidden email]>:
>
>> Sorry for redundant post ...
>>
>>
>> Hi all,
>>
>> We are working on an implementation of Frequent Subgraph Mining using
>> Flink. At that, the "too few memory segments error" prevents the most
>> promising solution. The problem is not specific for graphs, but all
>> iterative problems where
>>
>> - working and solution sets contain data of different types
>> - the working set may grow, shrink or is replaced for each iteration
>> - the solution set grows for each iteration
>> - the termination criterion is based on data set metrics,
>> e.g. while working set not empty
>>
>> An illustration of our problem workflow, generalized to graph unspecific
>> frequent pattern mining, can be found here:
>>
>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>
>> Page 1 shows the most promising solution. We started implementing it
>> using a for loop. However, the "too few memory segments error" makes it
>> untestable. As the iteration body itself is a complex workflow and the
>> number of iterations is arbitrary, unrolling it while reserving operator
>> memory will be a permanent limitation. Increasing limits or physical
>> memory would only delay the problem. The resulting question is:
>>
>> Would it be possible to implement a while-not-empty or at least a for
>> loop, that isn't unrolled and can be executed more memory efficient?
>>
>> Page 2 shows an alternative solution to our problem using the concept of
>> delta iteration. However, Flink delta iteration does neither support
>> broadcasting nor working-set independent intermediate results. Page 3
>> shows our working solution using two workarounds for those restrictions.
>> However, these workarounds lead to unnecessary memory consumption and
>> redundant expensive computations. So, in the case the answer to the
>> first question is no, a second question:
>>
>> Would it be possible to extend the delta iteration by the support of
>> rich map functions with broadcast sets and the memory of intermediate
>> results?
>>
>> We think, that a while-not-empty loop might be useful for other
>> algorithms too, e.g. variable length path search in graphs. Did we miss
>> Flink features meeting our requirements? Do you think it's worth to
>> create an improvement issue? At that, we'd of course be willing to
>> contribute in the form of development.
>>
>> Best Regards
>> Andre
>>
>> --
>> -------------------------------------------
>> PhD Student
>> University of Leipzig
>> Department of Computer Science
>> Database Research Group
>>
>> email: [hidden email]
>> web: dbs.uni-leipzig.de
>> -------------------------------------------
>>
>

Vasiliki Kalavri

Re: Question about limitations of iterative algorithms

Hi Andre,

On 4 November 2015 at 16:04, André Petermann <
[hidden email]> wrote:

> Hi Fabian,
>
> thanks for your fast reply!
>
> I created a gist to explain the while-not-empty loop in more detail:
> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>
> It is an approach to create a minimal example of the kind of algorithm
> corresponding to page 1 of the PDF, in particular frequent substrings.
> Please don't care about the algorithm itself, the actual algorithm is much
> more complex. However, even this simple case runs only for 7 iterations on
> my machine before "too few memory segments" is raising.
>
> If I understood delta iteration right, it's, necessary to provide a 1:1
> mapping between working and solution set items. The empty-case you referred
> to is the no-more-updates case, but not no-more-items, right?
>

I think it's possible to do what you want with a delta iteration.

The solution set and workset *don't* need to be of the same type. When you
define the delta iteration,
e.g. DeltaIteration<ST, WT> iteration = ...,
ST is the type of the solution set and WT is the type of the workset.

>
> In my example, the working set is completely replaced for each iteration
> (line 52), with only a parent-child mapping. The solution set is initially
> empty (line 33) and stores results of all iterations (line 48). I hope this
> shows the difference to the delta iteration and while-not-empty. Further
> on, you see the different data types of working and solution sets.

> I will provide a further Gist for the "second choice" solution using delta
> iteration. However, what we actually would prefer is to replace the
> for-loop by something like while(embeddings.isNotEmpty()) including a truly
> iterative execution.
>

The default convergence criterion for a delta iteration is an empty
workset. Thus, if you set embeddings as the workset, you have your
"while(embeddings.isNotEmpty())" logic.
Also, as far as I know, there is no problem with appending new elements
to the solution set. So, using allFrequentPatterns as the solution set
should be fine.

>
> But please let me know if I missed some Flink features already enabling
> such loops.
>
> Thanks,
> Andre

Does this clear things out a bit? Let me know if I misunderstood what you
want to do.

Cheers,
-Vasia.

>
>
> On 03.11.2015 15:47, Fabian Hueske wrote:
>
>> Hi Andre,
>>
>> Thanks for reaching out to the Flink community!
>>
>> I am not sure your analysis is based on correct assumptions about Flink's
>> delta iterations.
>> Flink's delta iterations do support
>> - working and solution sets of different types
>> - worksets that grow and shrink or are completely replaced in each
>> iteration
>> - solution sets that grow (up to the point of the too few memory segments
>> error)
>> - termination via a custom aggregator. So it is max number of iterations,
>> empty workset, and a custom termination criterion.
>>
>> The problem of the "too few memory segments" occurs because the solution
>> set is held in an in-memory hash table.
>> Spilling that table would result in a significant iteration slowdown. One
>> solution is to throw more resources at the problem (more nodes, more RAM).
>> Another work-around is to reduce the amount of managed memory and switch
>> the solution set hash table to "unmanaged". This will keep the result set
>> in a regular Java HashMap but this might cause a OOMError and kill the JVM
>> if it grows too large.
>> AFAIK, there is no way to shrink the solution set at the moment. It might
>> be worth to investigate into that direction.
>>
>> Regarding your questions about the while-not-empty loop. What exactly is
>> the difference to the default delta iteration. It does stop when the
>> workset is empty (in addition to the custom termination criterion).
>> I am not sure if I got all your points. Please let me know, if I got
>> something wrong.
>>
>> Thanks,
>> Fabian
>>
>> 2015-11-03 13:50 GMT+01:00 André Petermann <
>> [hidden email]>:
>>
>> Sorry for redundant post ...
>>>
>>>
>>> Hi all,
>>>
>>> We are working on an implementation of Frequent Subgraph Mining using
>>> Flink. At that, the "too few memory segments error" prevents the most
>>> promising solution. The problem is not specific for graphs, but all
>>> iterative problems where
>>>
>>> - working and solution sets contain data of different types
>>> - the working set may grow, shrink or is replaced for each iteration
>>> - the solution set grows for each iteration
>>> - the termination criterion is based on data set metrics,
>>> e.g. while working set not empty
>>>
>>> An illustration of our problem workflow, generalized to graph unspecific
>>> frequent pattern mining, can be found here:
>>>
>>>
>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>>
>>> Page 1 shows the most promising solution. We started implementing it
>>> using a for loop. However, the "too few memory segments error" makes it
>>> untestable. As the iteration body itself is a complex workflow and the
>>> number of iterations is arbitrary, unrolling it while reserving operator
>>> memory will be a permanent limitation. Increasing limits or physical
>>> memory would only delay the problem. The resulting question is:
>>>
>>> Would it be possible to implement a while-not-empty or at least a for
>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>
>>> Page 2 shows an alternative solution to our problem using the concept of
>>> delta iteration. However, Flink delta iteration does neither support
>>> broadcasting nor working-set independent intermediate results. Page 3
>>> shows our working solution using two workarounds for those restrictions.
>>> However, these workarounds lead to unnecessary memory consumption and
>>> redundant expensive computations. So, in the case the answer to the
>>> first question is no, a second question:
>>>
>>> Would it be possible to extend the delta iteration by the support of
>>> rich map functions with broadcast sets and the memory of intermediate
>>> results?
>>>
>>> We think, that a while-not-empty loop might be useful for other
>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>> Flink features meeting our requirements? Do you think it's worth to
>>> create an improvement issue? At that, we'd of course be willing to
>>> contribute in the form of development.
>>>
>>> Best Regards
>>> Andre
>>>
>>> --
>>> -------------------------------------------
>>> PhD Student
>>> University of Leipzig
>>> Department of Computer Science
>>> Database Research Group
>>>
>>> email: [hidden email]
>>> web: dbs.uni-leipzig.de
>>> -------------------------------------------
>>>
>>>
>>

André Petermann

Re: Question about limitations of iterative algorithms

Hi Vasia!

Many thanks for your reply! Your hints finally enabled me implementing
the problems first-choice solution using delta iteration:
https://gist.github.com/p3et/12deb7d6321b48e9efab

Do you think this could be worth to be contributed as an example within
the Flink documentation? The examples I found so far could not help
enlightening me how to use delta iteration for this kind of loop
(ST != WT, start from empty solution set, ...).

Cheers,
Andre

On 04.11.2015 16:32, Vasiliki Kalavri wrote:

> Hi Andre,
>
> On 4 November 2015 at 16:04, André Petermann <
> [hidden email]> wrote:
>
>> Hi Fabian,
>>
>> thanks for your fast reply!
>>
>> I created a gist to explain the while-not-empty loop in more detail:
>> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>>
>> It is an approach to create a minimal example of the kind of algorithm
>> corresponding to page 1 of the PDF, in particular frequent substrings.
>> Please don't care about the algorithm itself, the actual algorithm is much
>> more complex. However, even this simple case runs only for 7 iterations on
>> my machine before "too few memory segments" is raising.
>>
>> If I understood delta iteration right, it's, necessary to provide a 1:1
>> mapping between working and solution set items. The empty-case you referred
>> to is the no-more-updates case, but not no-more-items, right?
>>
>
>
> I think it's possible to do what you want with a delta iteration.
>
> The solution set and workset *don't* need to be of the same type. When you
> define the delta iteration,
> e.g. DeltaIteration<ST, WT> iteration = ...,
> ST is the type of the solution set and WT is the type of the workset.
>
>
>
>>
>> In my example, the working set is completely replaced for each iteration
>> (line 52), with only a parent-child mapping. The solution set is initially
>> empty (line 33) and stores results of all iterations (line 48). I hope this
>> shows the difference to the delta iteration and while-not-empty. Further
>> on, you see the different data types of working and solution sets.
>
>
>> I will provide a further Gist for the "second choice" solution using delta
>> iteration. However, what we actually would prefer is to replace the
>> for-loop by something like while(embeddings.isNotEmpty()) including a truly
>> iterative execution.
>>
>
>
> The default convergence criterion for a delta iteration is an empty
> workset. Thus, if you set embeddings as the workset, you have your
> "while(embeddings.isNotEmpty())" logic.
> Also, as far as I know, there is no problem with appending new elements
> to the solution set. So, using allFrequentPatterns as the solution set
> should be fine.
>
>
>
>>
>> But please let me know if I missed some Flink features already enabling
>> such loops.
>>
>> Thanks,
>> Andre
>
>
>
> Does this clear things out a bit? Let me know if I misunderstood what you
> want to do.
>
> Cheers,
> -Vasia.
>
>
>
>>
>>
>> On 03.11.2015 15:47, Fabian Hueske wrote:
>>
>>> Hi Andre,
>>>
>>> Thanks for reaching out to the Flink community!
>>>
>>> I am not sure your analysis is based on correct assumptions about Flink's
>>> delta iterations.
>>> Flink's delta iterations do support
>>> - working and solution sets of different types
>>> - worksets that grow and shrink or are completely replaced in each
>>> iteration
>>> - solution sets that grow (up to the point of the too few memory segments
>>> error)
>>> - termination via a custom aggregator. So it is max number of iterations,
>>> empty workset, and a custom termination criterion.
>>>
>>> The problem of the "too few memory segments" occurs because the solution
>>> set is held in an in-memory hash table.
>>> Spilling that table would result in a significant iteration slowdown. One
>>> solution is to throw more resources at the problem (more nodes, more RAM).
>>> Another work-around is to reduce the amount of managed memory and switch
>>> the solution set hash table to "unmanaged". This will keep the result set
>>> in a regular Java HashMap but this might cause a OOMError and kill the JVM
>>> if it grows too large.
>>> AFAIK, there is no way to shrink the solution set at the moment. It might
>>> be worth to investigate into that direction.
>>>
>>> Regarding your questions about the while-not-empty loop. What exactly is
>>> the difference to the default delta iteration. It does stop when the
>>> workset is empty (in addition to the custom termination criterion).
>>> I am not sure if I got all your points. Please let me know, if I got
>>> something wrong.
>>>
>>> Thanks,
>>> Fabian
>>>
>>> 2015-11-03 13:50 GMT+01:00 André Petermann <
>>> [hidden email]>:
>>>
>>> Sorry for redundant post ...
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> We are working on an implementation of Frequent Subgraph Mining using
>>>> Flink. At that, the "too few memory segments error" prevents the most
>>>> promising solution. The problem is not specific for graphs, but all
>>>> iterative problems where
>>>>
>>>> - working and solution sets contain data of different types
>>>> - the working set may grow, shrink or is replaced for each iteration
>>>> - the solution set grows for each iteration
>>>> - the termination criterion is based on data set metrics,
>>>> e.g. while working set not empty
>>>>
>>>> An illustration of our problem workflow, generalized to graph unspecific
>>>> frequent pattern mining, can be found here:
>>>>
>>>>
>>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>>>
>>>> Page 1 shows the most promising solution. We started implementing it
>>>> using a for loop. However, the "too few memory segments error" makes it
>>>> untestable. As the iteration body itself is a complex workflow and the
>>>> number of iterations is arbitrary, unrolling it while reserving operator
>>>> memory will be a permanent limitation. Increasing limits or physical
>>>> memory would only delay the problem. The resulting question is:
>>>>
>>>> Would it be possible to implement a while-not-empty or at least a for
>>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>>
>>>> Page 2 shows an alternative solution to our problem using the concept of
>>>> delta iteration. However, Flink delta iteration does neither support
>>>> broadcasting nor working-set independent intermediate results. Page 3
>>>> shows our working solution using two workarounds for those restrictions.
>>>> However, these workarounds lead to unnecessary memory consumption and
>>>> redundant expensive computations. So, in the case the answer to the
>>>> first question is no, a second question:
>>>>
>>>> Would it be possible to extend the delta iteration by the support of
>>>> rich map functions with broadcast sets and the memory of intermediate
>>>> results?
>>>>
>>>> We think, that a while-not-empty loop might be useful for other
>>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>>> Flink features meeting our requirements? Do you think it's worth to
>>>> create an improvement issue? At that, we'd of course be willing to
>>>> contribute in the form of development.
>>>>
>>>> Best Regards
>>>> Andre
>>>>
>>>> --
>>>> -------------------------------------------
>>>> PhD Student
>>>> University of Leipzig
>>>> Department of Computer Science
>>>> Database Research Group
>>>>
>>>> email: [hidden email]
>>>> web: dbs.uni-leipzig.de
>>>> -------------------------------------------
>>>>
>>>>
>>>
>

Vasiliki Kalavri

Re: Question about limitations of iterative algorithms

Hi Andre,

I'm happy you were able to solve your problem :)

Improvements to the documentation are always welcome!
To me ST != WT is straight-forward from the javadocs, but I guess it
wouldn't hurt to stress it in the docs.
Do you think you could simplify your implementation a bit to make for a
nice example? It might be a bit too complicated to follow as it is right
now.

In any case, if you would like to improve the delta iteration docs, please
go ahead and open a JIRA. We can discuss the details of what improvements
to make over there.

Thanks!
-Vasia.

On 5 November 2015 at 11:41, André Petermann <
[hidden email]> wrote:

> Hi Vasia!
>
> Many thanks for your reply! Your hints finally enabled me implementing the
> problems first-choice solution using delta iteration:
> https://gist.github.com/p3et/12deb7d6321b48e9efab
>
> Do you think this could be worth to be contributed as an example within
> the Flink documentation? The examples I found so far could not help
> enlightening me how to use delta iteration for this kind of loop
> (ST != WT, start from empty solution set, ...).
>
> Cheers,
> Andre
>
>
> On 04.11.2015 16:32, Vasiliki Kalavri wrote:
>
>> Hi Andre,
>>
>> On 4 November 2015 at 16:04, André Petermann <
>> [hidden email]> wrote:
>>
>> Hi Fabian,
>>>
>>> thanks for your fast reply!
>>>
>>> I created a gist to explain the while-not-empty loop in more detail:
>>> https://gist.github.com/p3et/9f6e56cf0b68213e3e2b
>>>
>>> It is an approach to create a minimal example of the kind of algorithm
>>> corresponding to page 1 of the PDF, in particular frequent substrings.
>>> Please don't care about the algorithm itself, the actual algorithm is
>>> much
>>> more complex. However, even this simple case runs only for 7 iterations
>>> on
>>> my machine before "too few memory segments" is raising.
>>>
>>> If I understood delta iteration right, it's, necessary to provide a 1:1
>>> mapping between working and solution set items. The empty-case you
>>> referred
>>> to is the no-more-updates case, but not no-more-items, right?
>>>
>>>
>>
>> I think it's possible to do what you want with a delta iteration.
>>
>> The solution set and workset *don't* need to be of the same type. When you
>> define the delta iteration,
>> e.g. DeltaIteration<ST, WT> iteration = ...,
>> ST is the type of the solution set and WT is the type of the workset.
>>
>>
>>
>>
>>> In my example, the working set is completely replaced for each iteration
>>> (line 52), with only a parent-child mapping. The solution set is
>>> initially
>>> empty (line 33) and stores results of all iterations (line 48). I hope
>>> this
>>> shows the difference to the delta iteration and while-not-empty. Further
>>> on, you see the different data types of working and solution sets.
>>>
>>
>>
>> I will provide a further Gist for the "second choice" solution using delta
>>> iteration. However, what we actually would prefer is to replace the
>>> for-loop by something like while(embeddings.isNotEmpty()) including a
>>> truly
>>> iterative execution.
>>>
>>>
>>
>> The default convergence criterion for a delta iteration is an empty
>> workset. Thus, if you set embeddings as the workset, you have your
>> "while(embeddings.isNotEmpty())" logic.
>> Also, as far as I know, there is no problem with appending new elements
>> to the solution set. So, using allFrequentPatterns as the solution set
>> should be fine.
>>
>>
>>
>>
>>> But please let me know if I missed some Flink features already enabling
>>> such loops.
>>>
>>> Thanks,
>>> Andre
>>>
>>
>>
>>
>> Does this clear things out a bit? Let me know if I misunderstood what you
>> want to do.
>>
>> Cheers,
>> -Vasia.
>>
>>
>>
>>
>>>
>>> On 03.11.2015 15:47, Fabian Hueske wrote:
>>>
>>> Hi Andre,
>>>>
>>>> Thanks for reaching out to the Flink community!
>>>>
>>>> I am not sure your analysis is based on correct assumptions about
>>>> Flink's
>>>> delta iterations.
>>>> Flink's delta iterations do support
>>>> - working and solution sets of different types
>>>> - worksets that grow and shrink or are completely replaced in each
>>>> iteration
>>>> - solution sets that grow (up to the point of the too few memory
>>>> segments
>>>> error)
>>>> - termination via a custom aggregator. So it is max number of
>>>> iterations,
>>>> empty workset, and a custom termination criterion.
>>>>
>>>> The problem of the "too few memory segments" occurs because the solution
>>>> set is held in an in-memory hash table.
>>>> Spilling that table would result in a significant iteration slowdown.
>>>> One
>>>> solution is to throw more resources at the problem (more nodes, more
>>>> RAM).
>>>> Another work-around is to reduce the amount of managed memory and switch
>>>> the solution set hash table to "unmanaged". This will keep the result
>>>> set
>>>> in a regular Java HashMap but this might cause a OOMError and kill the
>>>> JVM
>>>> if it grows too large.
>>>> AFAIK, there is no way to shrink the solution set at the moment. It
>>>> might
>>>> be worth to investigate into that direction.
>>>>
>>>> Regarding your questions about the while-not-empty loop. What exactly is
>>>> the difference to the default delta iteration. It does stop when the
>>>> workset is empty (in addition to the custom termination criterion).
>>>> I am not sure if I got all your points. Please let me know, if I got
>>>> something wrong.
>>>>
>>>> Thanks,
>>>> Fabian
>>>>
>>>> 2015-11-03 13:50 GMT+01:00 André Petermann <
>>>> [hidden email]>:
>>>>
>>>> Sorry for redundant post ...
>>>>
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We are working on an implementation of Frequent Subgraph Mining using
>>>>> Flink. At that, the "too few memory segments error" prevents the most
>>>>> promising solution. The problem is not specific for graphs, but all
>>>>> iterative problems where
>>>>>
>>>>> - working and solution sets contain data of different types
>>>>> - the working set may grow, shrink or is replaced for each iteration
>>>>> - the solution set grows for each iteration
>>>>> - the termination criterion is based on data set metrics,
>>>>> e.g. while working set not empty
>>>>>
>>>>> An illustration of our problem workflow, generalized to graph
>>>>> unspecific
>>>>> frequent pattern mining, can be found here:
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/dbs-leipzig/gradoop/blob/master/dev-support/loopWithIntermediateResultMemory.pdf
>>>>>
>>>>> Page 1 shows the most promising solution. We started implementing it
>>>>> using a for loop. However, the "too few memory segments error" makes it
>>>>> untestable. As the iteration body itself is a complex workflow and the
>>>>> number of iterations is arbitrary, unrolling it while reserving
>>>>> operator
>>>>> memory will be a permanent limitation. Increasing limits or physical
>>>>> memory would only delay the problem. The resulting question is:
>>>>>
>>>>> Would it be possible to implement a while-not-empty or at least a for
>>>>> loop, that isn't unrolled and can be executed more memory efficient?
>>>>>
>>>>> Page 2 shows an alternative solution to our problem using the concept
>>>>> of
>>>>> delta iteration. However, Flink delta iteration does neither support
>>>>> broadcasting nor working-set independent intermediate results. Page 3
>>>>> shows our working solution using two workarounds for those
>>>>> restrictions.
>>>>> However, these workarounds lead to unnecessary memory consumption and
>>>>> redundant expensive computations. So, in the case the answer to the
>>>>> first question is no, a second question:
>>>>>
>>>>> Would it be possible to extend the delta iteration by the support of
>>>>> rich map functions with broadcast sets and the memory of intermediate
>>>>> results?
>>>>>
>>>>> We think, that a while-not-empty loop might be useful for other
>>>>> algorithms too, e.g. variable length path search in graphs. Did we miss
>>>>> Flink features meeting our requirements? Do you think it's worth to
>>>>> create an improvement issue? At that, we'd of course be willing to
>>>>> contribute in the form of development.
>>>>>
>>>>> Best Regards
>>>>> Andre
>>>>>
>>>>> --
>>>>> -------------------------------------------
>>>>> PhD Student
>>>>> University of Leipzig
>>>>> Department of Computer Science
>>>>> Database Research Group
>>>>>
>>>>> email: [hidden email]
>>>>> web: dbs.uni-leipzig.de
>>>>> -------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>
>>