[jira] [Created] (FLINK-14439) RestartPipelinedRegionStrategy leverage tracked partition availability for better failover experience in DefaultScheduler

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14439) RestartPipelinedRegionStrategy leverage tracked partition availability for better failover experience in DefaultScheduler

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-14439:
-------------------------------

             Summary: RestartPipelinedRegionStrategy leverage tracked partition availability for better failover experience in DefaultScheduler
                 Key: FLINK-14439
                 URL: https://issues.apache.org/jira/browse/FLINK-14439
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0


In current region failover when using DefaultScheduler, most of the input result partition states are unknown. Even though the failure cause is a PartitionException, only one unhealthy partition can be identified.

The may lead to multiple unsuccessful failovers before all the unhealthy but needed partitions are identified and their producers are involved in the failover as well. (unsuccessful failover here means the recovered tasks get failed again soon due to some missing input partitions.)

Using JM side tracked partition states to help the region failover to identify unhealthy(missing) partitions earlier can help with this case.

It also fails BatchFineGrainedRecoveryITCase due to unexpected failover counts. This is because the legacy scheduler already has similar optimization in FLINK-13055.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)