Yuan Mei created FLINK-18112:
--------------------------------
Summary: Single Task Failure Recovery Prototype
Key: FLINK-18112
URL:
https://issues.apache.org/jira/browse/FLINK-18112 Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing, Runtime / Coordination, Runtime / Network
Affects Versions: 1.12.0
Environment: Build a prototype of single task failure recovery to address and answer the following questions:
Step 1: Scheduling part, restart a single node without restarting the upstream or downstream nodes.
Step 2: Checkpointing part, as my understanding of how regional failover works, this part might not need modification.
Step 3: Network part
- how the recovered node able to link to the upstream ResultPartitions, and continue getting data
- how the downstream node able to link to the recovered node, and continue getting node
- how different netty transit mode affects the results
- what if the failed node buffered data pool is full
Step 4: Failover process verification
Reporter: Yuan Mei
Fix For: 1.12.0
--
This message was sent by Atlassian Jira
(v8.3.4#803005)