Hello everyone,
Till, Zhu Zhu and myself have prepared a Design Document <https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8> for introducing backtracking for failover regions. This is an optimization of the failure handling logic for jobs with blocking result partitions (which primarily exist in batch jobs), where only part of the job has to be restarted. This has a continuation of the FLIP-1 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures> efforts to introduce fine-grained recovery from task failures. The associated JIRA can be found here <https://issues.apache.org/jira/browse/FLINK-12068>. Any feedback is highly appreciated. Regards, Chesnay |
Thanks for summarizing the current state of Flip-1 and outlining the way to
move forward with it Chesnay. I think we should implement the first version of the backtracking logic using the DataConsumptionException (FLINK-6227) to signal if an intermediate result partition has been lost. Moreover, I think it would be best to base the new implementation on the refined FailoverStrategy interface proposed by the scheduler refactorings [1]. We could have an adaptor to make work with the existing code for testing purposes and until the scheduler interfaces have been introduced. Apart from that, +1 for completing Flink's first improvement proposal :-) [1] https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing Cheers, Till On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <[hidden email]> wrote: > Hello everyone, > > Till, Zhu Zhu and myself have prepared a Design Document > < > https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8> > > for introducing backtracking for failover regions. This is an > optimization of the failure handling logic for jobs with blocking result > partitions (which primarily exist in batch jobs), where only part of the > job has to be restarted. > This has a continuation of the FLIP-1 > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures> > > efforts to introduce fine-grained recovery from task failures. > The associated JIRA can be found here > <https://issues.apache.org/jira/browse/FLINK-12068>. > > Any feedback is highly appreciated. > > Regards, > Chesnay > |
Thanks to Chesnay for bringing up this proposal.
It's good news that we can have a applicable fine grained recovery for batch jobs soon. +1 for this proposal. Regards, Zhu Till Rohrmann <[hidden email]> 于2019年4月15日周一 下午5:57写道: > Thanks for summarizing the current state of Flip-1 and outlining the way to > move forward with it Chesnay. > > I think we should implement the first version of the backtracking logic > using the DataConsumptionException (FLINK-6227) to signal if an > intermediate result partition has been lost. > > Moreover, I think it would be best to base the new implementation on the > refined FailoverStrategy interface proposed by the scheduler refactorings > [1]. We could have an adaptor to make work with the existing code for > testing purposes and until the scheduler interfaces have been introduced. > > Apart from that, +1 for completing Flink's first improvement proposal :-) > > [1] > > https://docs.google.com/document/d/1fstkML72YBO1tGD_dmG2rwvd9bklhRVauh4FSsDDwXU/edit?usp=sharing > > Cheers, > Till > > On Sun, Apr 14, 2019 at 8:20 PM Chesnay Schepler <[hidden email]> > wrote: > > > Hello everyone, > > > > Till, Zhu Zhu and myself have prepared a Design Document > > < > > > https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8 > > > > > > for introducing backtracking for failover regions. This is an > > optimization of the failure handling logic for jobs with blocking result > > partitions (which primarily exist in batch jobs), where only part of the > > job has to be restarted. > > This has a continuation of the FLIP-1 > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > > > > > efforts to introduce fine-grained recovery from task failures. > > The associated JIRA can be found here > > <https://issues.apache.org/jira/browse/FLINK-12068>. > > > > Any feedback is highly appreciated. > > > > Regards, > > Chesnay > > > |
Free forum by Nabble | Edit this page |