(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Backtracking for failover regions

Classic

List

Threaded

3 messages Options

Chesnay Schepler-3

[DISCUSS] Backtracking for failover regions

Hello everyone,

Till, Zhu Zhu and myself have prepared a Design Document
<https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8>
for introducing backtracking for failover regions. This is an
optimization of the failure handling logic for jobs with blocking result
partitions (which primarily exist in batch jobs), where only part of the
job has to be restarted.
This has a continuation of the FLIP-1
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures>
efforts to introduce fine-grained recovery from task failures.
The associated JIRA can be found here
<https://issues.apache.org/jira/browse/FLINK-12068>.

Any feedback is highly appreciated.

Regards,
Chesnay

Till Rohrmann

Re: [DISCUSS] Backtracking for failover regions

Thanks for summarizing the current state of Flip-1 and outlining the way to
move forward with it Chesnay.

I think we should implement the first version of the backtracking logic
using the DataConsumptionException (FLINK-6227) to signal if an
intermediate result partition has been lost.

Moreover, I think it would be best to base the new implementation on the
refined FailoverStrategy interface proposed by the scheduler refactorings
[1]. We could have an adaptor to make work with the existing code for
testing purposes and until the scheduler interfaces have been introduced.

Apart from that, +1 for completing Flink's first improvement proposal :-)

[1]
https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing

Cheers,
Till

On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <[hidden email]> wrote:

> Hello everyone,
>
> Till, Zhu Zhu and myself have prepared a Design Document
> <
> https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8>
>
> for introducing backtracking for failover regions. This is an
> optimization of the failure handling logic for jobs with blocking result
> partitions (which primarily exist in batch jobs), where only part of the
> job has to be restarted.
> This has a continuation of the FLIP-1
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures>
>
> efforts to introduce fine-grained recovery from task failures.
> The associated JIRA can be found here
> <https://issues.apache.org/jira/browse/FLINK-12068>.
>
> Any feedback is highly appreciated.
>
> Regards,
> Chesnay
>

Zhu Zhu

Re: [DISCUSS] Backtracking for failover regions

Thanks to Chesnay for bringing up this proposal.
It's good news that we can have a applicable fine grained recovery for
batch jobs soon.
+1 for this proposal.

Regards,
Zhu

Till Rohrmann <[hidden email]> 于2019年4月15日周一下午5:57写道：

> Thanks for summarizing the current state of Flip-1 and outlining the way to
> move forward with it Chesnay.
>
> I think we should implement the first version of the backtracking logic
> using the DataConsumptionException (FLINK-6227) to signal if an
> intermediate result partition has been lost.
>
> Moreover, I think it would be best to base the new implementation on the
> refined FailoverStrategy interface proposed by the scheduler refactorings
> [1]. We could have an adaptor to make work with the existing code for
> testing purposes and until the scheduler interfaces have been introduced.
>
> Apart from that, +1 for completing Flink's first improvement proposal :-)
>
> [1]
>
> https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing
>
> Cheers,
> Till
>
> On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <[hidden email]>
> wrote:
>
> > Hello everyone,
> >
> > Till, Zhu Zhu and myself have prepared a Design Document
> > <
> >
> https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8
> >
> >
> > for introducing backtracking for failover regions. This is an
> > optimization of the failure handling logic for jobs with blocking result
> > partitions (which primarily exist in batch jobs), where only part of the
> > job has to be restarted.
> > This has a continuation of the FLIP-1
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> >
> >
> > efforts to introduce fine-grained recovery from task failures.
> > The associated JIRA can be found here
> > <https://issues.apache.org/jira/browse/FLINK-12068>.
> >
> > Any feedback is highly appreciated.
> >
> > Regards,
> > Chesnay
> >
>