(DEPRECATED) Apache Flink Mailing List archive.

Re: iterative processing

Posted by Stephan Ewen on Jun 17, 2014; 1:27pm
URL: http://deprecated-apache-flink-mailing-list-archive.368.s1.nabble.com/iterative-processing-tp334p335.html

Hi Yingjun!

Thanks for pointing this out. I would like to learn a bit more about the
problem you have, so let me ask you a few questions to make sure I
understand the matter in more detail...

I assume you are thinking abut programs that run a single reduce function
in the end, "all reduce" , rather than a "by-key-reduce"? And the size of
the data in that "all-reduce" is too large for the main memory?

In many cases, you can do a "combine" step (like a pre-reduce) before the
final reduce function. Because there are many parallel instances of the
"combine", that one has much more memory available, and is often able to
shrink the size of the data such that the final reduce function works on a
data set small enough to fit in memory. In flink, you get that "combine"
step automatically when you use the ReduceFunction, or when you annotate a
GroupReduceFunction with "@Combinable".

Sometimes, such a "combine" behavior cannot be applied to a reduce
function, when an operation cannot be executed on a sub-part of the data.
The case you are mentioning is such a case?

We have not looked into maintaining a distributed state that can be
accessed across machines. Sometimes that may be helpful, I agree. It would
mean that some data accesses have quite a latency, if they cross the
network, but that may be acceptable for certain applications. How would you
imagine such a feature to look like? Could you give us an example of what
you would design it like?

Greetings,
Stephan