(DEPRECATED) Apache Flink Mailing List archive.

Re: iterative processing

Posted by Sebastian Schelter-2 on Jun 17, 2014; 5:05pm
URL: http://deprecated-apache-flink-mailing-list-archive.368.s1.nabble.com/iterative-processing-tp334p346.html

The internal state of pagerank is just the rank vector which would be 100k
times 1.

--sebastian
Am 17.06.2014 17:56 schrieb "Yingjun Wu" <[hidden email]>:

> Hi Stephan,
>
> Thanks for your quick reply.
>
> Let's just consider a simple iterative processing algorithm, say, pagerank.
> If we wanna compute the pagerank value for 100k webpages, then the internal
> state, if represent the graph as matrix, should be at least 100k * 100k * 8
> bytes=74GB, which obviously exceeds the memory size of a single commodity
> machine. GraphLab does not suffer from this problem as it stores the graph
> in distributed memory. So in this case, do you think it is necessary to
> distributed the internal state to multiple machines?
>
> Regards,
> Yingjun
>
>
> On Tue, Jun 17, 2014 at 9:27 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi Yingjun!
> >
> > Thanks for pointing this out. I would like to learn a bit more about the
> > problem you have, so let me ask you a few questions to make sure I
> > understand the matter in more detail...
> >
> > I assume you are thinking abut programs that run a single reduce function
> > in the end, "all reduce" , rather than a "by-key-reduce"? And the size of
> > the data in that "all-reduce" is too large for the main memory?
> >
> > In many cases, you can do a "combine" step (like a pre-reduce) before the
> > final reduce function. Because there are many parallel instances of the
> > "combine", that one has much more memory available, and is often able to
> > shrink the size of the data such that the final reduce function works on
> a
> > data set small enough to fit in memory. In flink, you get that "combine"
> > step automatically when you use the ReduceFunction, or when you annotate
> a
> > GroupReduceFunction with "@Combinable".
> >
> > Sometimes, such a "combine" behavior cannot be applied to a reduce
> > function, when an operation cannot be executed on a sub-part of the data.
> > The case you are mentioning is such a case?
> >
> > We have not looked into maintaining a distributed state that can be
> > accessed across machines. Sometimes that may be helpful, I agree. It
> would
> > mean that some data accesses have quite a latency, if they cross the
> > network, but that may be acceptable for certain applications. How would
> you
> > imagine such a feature to look like? Could you give us an example of what
> > you would design it like?
> >
> > Greetings,
> > Stephan
> >
>