(DEPRECATED) Apache Flink Mailing List archive.

Multiple control flows in a program

Classic

List

Threaded

6 messages Options

Sachin Goel

Multiple control flows in a program

I'm writing a utility to split a data set randomly into several parts and
return an Array of data sets. However, whenever I operate on any of
these *subsets,
*the program basically start from the original data set, and the split is
performed again.

To ensure that these subsets are mutually exclusive, we need to generate
the exact same sequence of random numbers, but also to ensure that the
elements arrive in a filter job in exactly the same order. How do I achieve
this?
Here's the code I've written: https://github.com/apache/flink/pull/921/files

Regards
Sachin

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

Till Rohrmann

Re: Multiple control flows in a program

At the moment, Flink does not support the calculation of intermediate
results from which you can continue your computation. When you execute jobs
which share parts of its job graph, then they are recomputed. When your job
contains operators with non-deterministic output, then there is no
guarantee that the shared job graph parts produce the same results.

What you can do is to execute the two jobs in parallel so that they share
the input of the non-deterministic operator. Alternatively, you can persist
the data set after your non-deterministic operator by writing it manually
to disc and reading it from there.

Cheers,
Till

On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <[hidden email]>
wrote:

> I'm writing a utility to split a data set randomly into several parts and
> return an Array of data sets. However, whenever I operate on any of
> these *subsets,
> *the program basically start from the original data set, and the split is
> performed again.
>
> To ensure that these subsets are mutually exclusive, we need to generate
> the exact same sequence of random numbers, but also to ensure that the
> elements arrive in a filter job in exactly the same order. How do I achieve
> this?
> Here's the code I've written:
> https://github.com/apache/flink/pull/921/files
>
> Regards
> Sachin
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>

Sachin Goel

Re: Multiple control flows in a program

Hi Till
Thanks for the reply.
If you think about it however, having several diverging computational paths
from an intermediate point will probably require re-computation anyway, in
case the number of these paths is even higher than the slots available.
Could that be an argument against a possible implementation?
Making the output of the non-deterministic step persistent seems costly
however. Is there any way to ensure that the data source is partitioned
across the different slots in exactly the same way every-time?
For example, I am using a {{generateSequence}} call, and the internal
iterator, namely the NumberSequenceIterator seems deterministic in its
operation, at least as far as how the elements are grouped together. But
surprisingly, I observed different splits now and then.

Regards
Sachin

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <[hidden email]>
wrote:

> At the moment, Flink does not support the calculation of intermediate
> results from which you can continue your computation. When you execute jobs
> which share parts of its job graph, then they are recomputed. When your job
> contains operators with non-deterministic output, then there is no
> guarantee that the shared job graph parts produce the same results.
>
> What you can do is to execute the two jobs in parallel so that they share
> the input of the non-deterministic operator. Alternatively, you can persist
> the data set after your non-deterministic operator by writing it manually
> to disc and reading it from there.
>
> Cheers,
> Till
>
> On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <[hidden email]>
> wrote:
>
> > I'm writing a utility to split a data set randomly into several parts and
> > return an Array of data sets. However, whenever I operate on any of
> > these *subsets,
> > *the program basically start from the original data set, and the split is
> > performed again.
> >
> > To ensure that these subsets are mutually exclusive, we need to generate
> > the exact same sequence of random numbers, but also to ensure that the
> > elements arrive in a filter job in exactly the same order. How do I
> achieve
> > this?
> > Here's the code I've written:
> > https://github.com/apache/flink/pull/921/files
> >
> > Regards
> > Sachin
> >
> > -- Sachin Goel
> > Computer Science, IIT Delhi
> > m. +91-9871457685
> >
>

till.rohrmann

Re: Multiple control flows in a program

One branch does not occupy a single slot. A slot is usually shared by
operators from multiple branches. Only subtasks of the same operator cannot
be placed into the same slot. Thus, it's not an argument against it.

Most if not all input formats assign the input splits on a first comes
first serve basis with precedence for local computations. That is the
reason for the non-deterministic behaviour you're observing. If one of the
TMs is slower than in a previous run, then it might happen that he gets
different splits assigned. I guess, though, that you can write your own
input format which assigns the input splits deterministically (e.g. based
on the subtask index). But then you will probably sacrifice some of the
local computations.

Cheers,
Till

On Wed, Aug 12, 2015 at 9:54 AM, Sachin Goel <[hidden email]>
wrote:

> Hi Till
> Thanks for the reply.
> If you think about it however, having several diverging computational paths
> from an intermediate point will probably require re-computation anyway, in
> case the number of these paths is even higher than the slots available.
> Could that be an argument against a possible implementation?
> Making the output of the non-deterministic step persistent seems costly
> however. Is there any way to ensure that the data source is partitioned
> across the different slots in exactly the same way every-time?
> For example, I am using a {{generateSequence}} call, and the internal
> iterator, namely the NumberSequenceIterator seems deterministic in its
> operation, at least as far as how the elements are grouped together. But
> surprisingly, I observed different splits now and then.
>
> Regards
> Sachin
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <[hidden email]>
> wrote:
>
> > At the moment, Flink does not support the calculation of intermediate
> > results from which you can continue your computation. When you execute
> jobs
> > which share parts of its job graph, then they are recomputed. When your
> job
> > contains operators with non-deterministic output, then there is no
> > guarantee that the shared job graph parts produce the same results.
> >
> > What you can do is to execute the two jobs in parallel so that they share
> > the input of the non-deterministic operator. Alternatively, you can
> persist
> > the data set after your non-deterministic operator by writing it manually
> > to disc and reading it from there.
> >
> > Cheers,
> > Till
> >
> > On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <[hidden email]>
> > wrote:
> >
> > > I'm writing a utility to split a data set randomly into several parts
> and
> > > return an Array of data sets. However, whenever I operate on any of
> > > these *subsets,
> > > *the program basically start from the original data set, and the split
> is
> > > performed again.
> > >
> > > To ensure that these subsets are mutually exclusive, we need to
> generate
> > > the exact same sequence of random numbers, but also to ensure that the
> > > elements arrive in a filter job in exactly the same order. How do I
> > achieve
> > > this?
> > > Here's the code I've written:
> > > https://github.com/apache/flink/pull/921/files
> > >
> > > Regards
> > > Sachin
> > >
> > > -- Sachin Goel
> > > Computer Science, IIT Delhi
> > > m. +91-9871457685
> > >
> >
>

Sachin Goel

Re: Multiple control flows in a program

Since the random splits need to be done on any data set a user provides, I
think making a persistent source would be the best solution then.

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Aug 12, 2015 at 1:37 PM, Till Rohrmann <[hidden email]>
wrote:

> One branch does not occupy a single slot. A slot is usually shared by
> operators from multiple branches. Only subtasks of the same operator cannot
> be placed into the same slot. Thus, it's not an argument against it.
>
> Most if not all input formats assign the input splits on a first comes
> first serve basis with precedence for local computations. That is the
> reason for the non-deterministic behaviour you're observing. If one of the
> TMs is slower than in a previous run, then it might happen that he gets
> different splits assigned. I guess, though, that you can write your own
> input format which assigns the input splits deterministically (e.g. based
> on the subtask index). But then you will probably sacrifice some of the
> local computations.
>
> Cheers,
> Till
>
> On Wed, Aug 12, 2015 at 9:54 AM, Sachin Goel <[hidden email]>
> wrote:
>
> > Hi Till
> > Thanks for the reply.
> > If you think about it however, having several diverging computational
> paths
> > from an intermediate point will probably require re-computation anyway,
> in
> > case the number of these paths is even higher than the slots available.
> > Could that be an argument against a possible implementation?
> > Making the output of the non-deterministic step persistent seems costly
> > however. Is there any way to ensure that the data source is partitioned
> > across the different slots in exactly the same way every-time?
> > For example, I am using a {{generateSequence}} call, and the internal
> > iterator, namely the NumberSequenceIterator seems deterministic in its
> > operation, at least as far as how the elements are grouped together. But
> > surprisingly, I observed different splits now and then.
> >
> > Regards
> > Sachin
> >
> > -- Sachin Goel
> > Computer Science, IIT Delhi
> > m. +91-9871457685
> >
> > On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > At the moment, Flink does not support the calculation of intermediate
> > > results from which you can continue your computation. When you execute
> > jobs
> > > which share parts of its job graph, then they are recomputed. When your
> > job
> > > contains operators with non-deterministic output, then there is no
> > > guarantee that the shared job graph parts produce the same results.
> > >
> > > What you can do is to execute the two jobs in parallel so that they
> share
> > > the input of the non-deterministic operator. Alternatively, you can
> > persist
> > > the data set after your non-deterministic operator by writing it
> manually
> > > to disc and reading it from there.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <[hidden email]
> >
> > > wrote:
> > >
> > > > I'm writing a utility to split a data set randomly into several parts
> > and
> > > > return an Array of data sets. However, whenever I operate on any of
> > > > these *subsets,
> > > > *the program basically start from the original data set, and the
> split
> > is
> > > > performed again.
> > > >
> > > > To ensure that these subsets are mutually exclusive, we need to
> > generate
> > > > the exact same sequence of random numbers, but also to ensure that
> the
> > > > elements arrive in a filter job in exactly the same order. How do I
> > > achieve
> > > > this?
> > > > Here's the code I've written:
> > > > https://github.com/apache/flink/pull/921/files
> > > >
> > > > Regards
> > > > Sachin
> > > >
> > > > -- Sachin Goel
> > > > Computer Science, IIT Delhi
> > > > m. +91-9871457685
> > > >
> > >
> >
>

Till Rohrmann

Re: Multiple control flows in a program

You can take a look at the ALS implementation. There I did something
similar.

On Wed, Aug 12, 2015 at 10:27 AM, Sachin Goel <[hidden email]>
wrote:

> Since the random splits need to be done on any data set a user provides, I
> think making a persistent source would be the best solution then.
>
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Aug 12, 2015 at 1:37 PM, Till Rohrmann <[hidden email]>
> wrote:
>
> > One branch does not occupy a single slot. A slot is usually shared by
> > operators from multiple branches. Only subtasks of the same operator
> cannot
> > be placed into the same slot. Thus, it's not an argument against it.
> >
> > Most if not all input formats assign the input splits on a first comes
> > first serve basis with precedence for local computations. That is the
> > reason for the non-deterministic behaviour you're observing. If one of
> the
> > TMs is slower than in a previous run, then it might happen that he gets
> > different splits assigned. I guess, though, that you can write your own
> > input format which assigns the input splits deterministically (e.g. based
> > on the subtask index). But then you will probably sacrifice some of the
> > local computations.
> >
> > Cheers,
> > Till
> >
> > On Wed, Aug 12, 2015 at 9:54 AM, Sachin Goel <[hidden email]>
> > wrote:
> >
> > > Hi Till
> > > Thanks for the reply.
> > > If you think about it however, having several diverging computational
> > paths
> > > from an intermediate point will probably require re-computation anyway,
> > in
> > > case the number of these paths is even higher than the slots available.
> > > Could that be an argument against a possible implementation?
> > > Making the output of the non-deterministic step persistent seems costly
> > > however. Is there any way to ensure that the data source is partitioned
> > > across the different slots in exactly the same way every-time?
> > > For example, I am using a {{generateSequence}} call, and the internal
> > > iterator, namely the NumberSequenceIterator seems deterministic in its
> > > operation, at least as far as how the elements are grouped together.
> But
> > > surprisingly, I observed different splits now and then.
> > >
> > > Regards
> > > Sachin
> > >
> > > -- Sachin Goel
> > > Computer Science, IIT Delhi
> > > m. +91-9871457685
> > >
> > > On Wed, Aug 12, 2015 at 12:58 PM, Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > At the moment, Flink does not support the calculation of intermediate
> > > > results from which you can continue your computation. When you
> execute
> > > jobs
> > > > which share parts of its job graph, then they are recomputed. When
> your
> > > job
> > > > contains operators with non-deterministic output, then there is no
> > > > guarantee that the shared job graph parts produce the same results.
> > > >
> > > > What you can do is to execute the two jobs in parallel so that they
> > share
> > > > the input of the non-deterministic operator. Alternatively, you can
> > > persist
> > > > the data set after your non-deterministic operator by writing it
> > manually
> > > > to disc and reading it from there.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Wed, Aug 12, 2015 at 1:34 AM, Sachin Goel <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > I'm writing a utility to split a data set randomly into several
> parts
> > > and
> > > > > return an Array of data sets. However, whenever I operate on any of
> > > > > these *subsets,
> > > > > *the program basically start from the original data set, and the
> > split
> > > is
> > > > > performed again.
> > > > >
> > > > > To ensure that these subsets are mutually exclusive, we need to
> > > generate
> > > > > the exact same sequence of random numbers, but also to ensure that
> > the
> > > > > elements arrive in a filter job in exactly the same order. How do I
> > > > achieve
> > > > > this?
> > > > > Here's the code I've written:
> > > > > https://github.com/apache/flink/pull/921/files
> > > > >
> > > > > Regards
> > > > > Sachin
> > > > >
> > > > > -- Sachin Goel
> > > > > Computer Science, IIT Delhi
> > > > > m. +91-9871457685
> > > > >
> > > >
> > >
> >
>