(DEPRECATED) Apache Flink Mailing List archive.

Does DataSet job also use Barriers to ensure "exactly once."?

Classic

List

Threaded

10 messages Options

Ma GuoWei

Does DataSet job also use Barriers to ensure "exactly once."?

hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure "exactly once."Does the DataSet job use the same mechanism to ensue "exactly once" if a map task is failed?thanks

Kostas Tzoumas-2

Re: Does DataSet job also use Barriers to ensure "exactly once."?

No, it doesn't; periodic snapshots are not needed in DataSet programs, as
DataSets are of finite size and failed partitions can be replayed
completely.

On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:

> hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> "exactly once."Does the DataSet job use the same mechanism to ensue
> "exactly once" if a map task is failed?thanks
>

Márton Balassi

Re: Does DataSet job also use Barriers to ensure "exactly once."?

In reply to this post by Ma GuoWei

As Kostas mentioned the failure mechanisms for streaming and batch
processing are different, but you can expect exactly once processing
guarantees from both of them.

On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:

> hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> "exactly once."Does the DataSet job use the same mechanism to ensue
> "exactly once" if a map task is failed?thanks
>

Ma GuoWei

RE: Does DataSet job also use Barriers to ensure "exactly once."?

In reply to this post by Kostas Tzoumas-2

DataExchangeMode is Piped
If Two operators use Piped Mode to exchange the data , Failed partitions have already send some data to the receiver before it failed.So Does Replaying all the failed partitions cause some duplicate records ?

> Date: Thu, 9 Jul 2015 14:47:29 +0200
> Subject: Re: Does DataSet job also use Barriers to ensure "exactly once."?
> From: [hidden email]
> To: [hidden email]
>
> No, it doesn't; periodic snapshots are not needed in DataSet programs, as
> DataSets are of finite size and failed partitions can be replayed
> completely.
>
>
> On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
>
> > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> > "exactly once."Does the DataSet job use the same mechanism to ensue
> > "exactly once" if a map task is failed?thanks
> >

Stephan Ewen

Re: Does DataSet job also use Barriers to ensure "exactly once."?

Currently, Flink restarts the entire job upon failure.

There is WIP that restricts this to all tasks involved in the pipeline of
the failed task.

Let's say we have pipelined MapReduce. If a mapper fails, the reducers that
have received some data already have to be restarted as well.

In that case, pipelined exchange works like "speculatively" starting the
reducers early. It helps when no failure occurs.
When a failure occurs, the reducers do still not start later than in a
batch exchange mode, where they are started only once the mappers are done
(and no failure can occur any more).

On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <[hidden email]> wrote:

> DataExchangeMode is Piped
> If Two operators use Piped Mode to exchange the data , Failed partitions
> have already send some data to the receiver before it failed.So Does
> Replaying all the failed partitions cause some duplicate records ?
>
>
> > Date: Thu, 9 Jul 2015 14:47:29 +0200
> > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> once."?
> > From: [hidden email]
> > To: [hidden email]
> >
> > No, it doesn't; periodic snapshots are not needed in DataSet programs, as
> > DataSets are of finite size and failed partitions can be replayed
> > completely.
> >
> >
> > On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
> >
> > > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> > > "exactly once."Does the DataSet job use the same mechanism to ensue
> > > "exactly once" if a map task is failed?thanks
> > >
>
>

Ma GuoWei

RE: Does DataSet job also use Barriers to ensure "exactly once."?

DataSet<String> result = in.rebalance()
.map(new Mapper());In the case does the 'map' receive all the data then begin to worker?Will rebalance operator failed cause some duplicate record if the above answer is false ?

> Date: Thu, 9 Jul 2015 15:40:18 +0200
> Subject: Re: Does DataSet job also use Barriers to ensure "exactly once."?
> From: [hidden email]
> To: [hidden email]
>
> Currently, Flink restarts the entire job upon failure.
>
> There is WIP that restricts this to all tasks involved in the pipeline of
> the failed task.
>
> Let's say we have pipelined MapReduce. If a mapper fails, the reducers that
> have received some data already have to be restarted as well.
>
> In that case, pipelined exchange works like "speculatively" starting the
> reducers early. It helps when no failure occurs.
> When a failure occurs, the reducers do still not start later than in a
> batch exchange mode, where they are started only once the mappers are done
> (and no failure can occur any more).
>
>
> On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <[hidden email]> wrote:
>
> > DataExchangeMode is Piped
> > If Two operators use Piped Mode to exchange the data , Failed partitions
> > have already send some data to the receiver before it failed.So Does
> > Replaying all the failed partitions cause some duplicate records ?
> >
> >
> > > Date: Thu, 9 Jul 2015 14:47:29 +0200
> > > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> > once."?
> > > From: [hidden email]
> > > To: [hidden email]
> > >
> > > No, it doesn't; periodic snapshots are not needed in DataSet programs, as
> > > DataSets are of finite size and failed partitions can be replayed
> > > completely.
> > >
> > >
> > > On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
> > >
> > > > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> > > > "exactly once."Does the DataSet job use the same mechanism to ensue
> > > > "exactly once" if a map task is failed?thanks
> > > >
> >
> >

mxm

Re: Does DataSet job also use Barriers to ensure "exactly once."?

In pipelined execution, the mapper will start once it receives data. In
batch-only execution, the mapper will start once it received all data. In
either case, there won't be any duplicate records. If an error occurs, the
entire job will be restarted.

As Stephan mentioned, we will soon have a per-task fault tolerance for
non-pipelined tasks. Eventually, we also want to support pipelined jobs
with persisted intermediate results.

On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <[hidden email]> wrote:

> DataSet<String> result = in.rebalance()
> .map(new Mapper());In the case does the 'map'
> receive all the data then begin to worker?Will rebalance operator failed
> cause some duplicate record if the above answer is false ?
> > Date: Thu, 9 Jul 2015 15:40:18 +0200
> > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> once."?
> > From: [hidden email]
> > To: [hidden email]
> >
> > Currently, Flink restarts the entire job upon failure.
> >
> > There is WIP that restricts this to all tasks involved in the pipeline of
> > the failed task.
> >
> > Let's say we have pipelined MapReduce. If a mapper fails, the reducers
> that
> > have received some data already have to be restarted as well.
> >
> > In that case, pipelined exchange works like "speculatively" starting the
> > reducers early. It helps when no failure occurs.
> > When a failure occurs, the reducers do still not start later than in a
> > batch exchange mode, where they are started only once the mappers are
> done
> > (and no failure can occur any more).
> >
> >
> > On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <[hidden email]> wrote:
> >
> > > DataExchangeMode is Piped
> > > If Two operators use Piped Mode to exchange the data , Failed
> partitions
> > > have already send some data to the receiver before it failed.So Does
> > > Replaying all the failed partitions cause some duplicate records ?
> > >
> > >
> > > > Date: Thu, 9 Jul 2015 14:47:29 +0200
> > > > Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> > > once."?
> > > > From: [hidden email]
> > > > To: [hidden email]
> > > >
> > > > No, it doesn't; periodic snapshots are not needed in DataSet
> programs, as
> > > > DataSets are of finite size and failed partitions can be replayed
> > > > completely.
> > > >
> > > >
> > > > On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
> > > >
> > > > > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> > > > > "exactly once."Does the DataSet job use the same mechanism to ensue
> > > > > "exactly once" if a map task is failed?thanks
> > > > >
> > >
> > >
>
>

Stephan Ewen

Re: Does DataSet job also use Barriers to ensure "exactly once."?

In reply to this post by Ma GuoWei

Any operator in a batch job will receive all of its elements in one
complete successful run.

The mapper starts its work immediately. On a failure, a fresh mapper is
used, and all of the data is replayed. You can think of it as if there was
only a single checkpoint at the very beginning (before any data was sent)
that they fall back to. For mapper-internal state, there can be no
duplicates.

For the interaction with the outside world, there can always be duplicates,
for example if the mapper inserts data into a database. The database would
have data from the initial run (that failed or was canceled) and the
recovery run.

On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <[hidden email]> wrote:

Ma GuoWei

Re: Does DataSet job also use Barriers to ensure "exactly once."?

thank you！I see.
thank all of you.

发自我的 iPhone

> 在 2015年7月9日，下午10:48，Stephan Ewen <[hidden email]> 写道：
>
> Any operator in a batch job will receive all of its elements in one
> complete successful run.
>
> The mapper starts its work immediately. On a failure, a fresh mapper is
> used, and all of the data is replayed. You can think of it as if there was
> only a single checkpoint at the very beginning (before any data was sent)
> that they fall back to. For mapper-internal state, there can be no
> duplicates.
>
> For the interaction with the outside world, there can always be duplicates,
> for example if the mapper inserts data into a database. The database would
> have data from the initial run (that failed or was canceled) and the
> recovery run.
>
>
>
>
>> On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <[hidden email]> wrote:
>>
>> DataSet<String> result = in.rebalance()
>> .map(new Mapper());In the case does the 'map'
>> receive all the data then begin to worker?Will rebalance operator failed
>> cause some duplicate record if the above answer is false ?
>>> Date: Thu, 9 Jul 2015 15:40:18 +0200
>>> Subject: Re: Does DataSet job also use Barriers to ensure "exactly
>> once."?
>>> From: [hidden email]
>>> To: [hidden email]
>>>
>>> Currently, Flink restarts the entire job upon failure.
>>>
>>> There is WIP that restricts this to all tasks involved in the pipeline of
>>> the failed task.
>>>
>>> Let's say we have pipelined MapReduce. If a mapper fails, the reducers
>> that
>>> have received some data already have to be restarted as well.
>>>
>>> In that case, pipelined exchange works like "speculatively" starting the
>>> reducers early. It helps when no failure occurs.
>>> When a failure occurs, the reducers do still not start later than in a
>>> batch exchange mode, where they are started only once the mappers are
>> done
>>> (and no failure can occur any more).
>>>
>>>
>>>> On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <[hidden email]> wrote:
>>>>
>>>> DataExchangeMode is Piped
>>>> If Two operators use Piped Mode to exchange the data , Failed
>> partitions
>>>> have already send some data to the receiver before it failed.So Does
>>>> Replaying all the failed partitions cause some duplicate records ?
>>>>
>>>>
>>>>> Date: Thu, 9 Jul 2015 14:47:29 +0200
>>>>> Subject: Re: Does DataSet job also use Barriers to ensure "exactly
>>>> once."?
>>>>> From: [hidden email]
>>>>> To: [hidden email]
>>>>>
>>>>> No, it doesn't; periodic snapshots are not needed in DataSet
>> programs, as
>>>>> DataSets are of finite size and failed partitions can be replayed
>>>>> completely.
>>>>>
>>>>>
>>>>>> On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
>>>>>>
>>>>>> hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
>>>>>> "exactly once."Does the DataSet job use the same mechanism to ensue
>>>>>> "exactly once" if a map task is failed?thanks
>>
>>

Stephan Ewen

Re: Does DataSet job also use Barriers to ensure "exactly once."?

BTW: Duplicates on interaction with the outside world cannot be avoided in
the general case. If any program (batch or streaming) inserts data into
some outside system (for example a database), then that outside system
needs to cooperate to prevent duplicates.

- Either, the system could eliminate duplicated based on a key

- Or, the system must be part of the checkpoint transaction. We are
working on some extension to make this very easy for outside systems that
support transactions.

Keep following the Flink announcements, if you want updates on this topic!

On Thu, Jul 9, 2015 at 4:50 PM, 马国维 <[hidden email]> wrote:

> thank you！I see.
> thank all of you.
>
> 发自我的 iPhone
>
> > 在 2015年7月9日，下午10:48，Stephan Ewen <[hidden email]> 写道：
> >
> > Any operator in a batch job will receive all of its elements in one
> > complete successful run.
> >
> > The mapper starts its work immediately. On a failure, a fresh mapper is
> > used, and all of the data is replayed. You can think of it as if there
> was
> > only a single checkpoint at the very beginning (before any data was sent)
> > that they fall back to. For mapper-internal state, there can be no
> > duplicates.
> >
> > For the interaction with the outside world, there can always be
> duplicates,
> > for example if the mapper inserts data into a database. The database
> would
> > have data from the initial run (that failed or was canceled) and the
> > recovery run.
> >
> >
> >
> >
> >> On Thu, Jul 9, 2015 at 4:13 PM, 马国维 <[hidden email]> wrote:
> >>
> >> DataSet<String> result = in.rebalance()
> >> .map(new Mapper());In the case does the 'map'
> >> receive all the data then begin to worker?Will rebalance operator failed
> >> cause some duplicate record if the above answer is false ?
> >>> Date: Thu, 9 Jul 2015 15:40:18 +0200
> >>> Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> >> once."?
> >>> From: [hidden email]
> >>> To: [hidden email]
> >>>
> >>> Currently, Flink restarts the entire job upon failure.
> >>>
> >>> There is WIP that restricts this to all tasks involved in the pipeline
> of
> >>> the failed task.
> >>>
> >>> Let's say we have pipelined MapReduce. If a mapper fails, the reducers
> >> that
> >>> have received some data already have to be restarted as well.
> >>>
> >>> In that case, pipelined exchange works like "speculatively" starting
> the
> >>> reducers early. It helps when no failure occurs.
> >>> When a failure occurs, the reducers do still not start later than in a
> >>> batch exchange mode, where they are started only once the mappers are
> >> done
> >>> (and no failure can occur any more).
> >>>
> >>>
> >>>> On Thu, Jul 9, 2015 at 3:34 PM, 马国维 <[hidden email]> wrote:
> >>>>
> >>>> DataExchangeMode is Piped
> >>>> If Two operators use Piped Mode to exchange the data , Failed
> >> partitions
> >>>> have already send some data to the receiver before it failed.So Does
> >>>> Replaying all the failed partitions cause some duplicate records ?
> >>>>
> >>>>
> >>>>> Date: Thu, 9 Jul 2015 14:47:29 +0200
> >>>>> Subject: Re: Does DataSet job also use Barriers to ensure "exactly
> >>>> once."?
> >>>>> From: [hidden email]
> >>>>> To: [hidden email]
> >>>>>
> >>>>> No, it doesn't; periodic snapshots are not needed in DataSet
> >> programs, as
> >>>>> DataSets are of finite size and failed partitions can be replayed
> >>>>> completely.
> >>>>>
> >>>>>
> >>>>>> On Thu, Jul 9, 2015 at 2:43 PM, 马国维 <[hidden email]> wrote:
> >>>>>>
> >>>>>> hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure
> >>>>>> "exactly once."Does the DataSet job use the same mechanism to ensue
> >>>>>> "exactly once" if a map task is failed?thanks
> >>
> >>
>