[jira] [Created] (FLINK-23041) Change local alignment timeout back to the global time out

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-23041) Change local alignment timeout back to the global time out

Shang Yuanchun (Jira)
Anton Kalashnikov created FLINK-23041:
-----------------------------------------

             Summary: Change local alignment timeout back to the global time out
                 Key: FLINK-23041
                 URL: https://issues.apache.org/jira/browse/FLINK-23041
             Project: Flink
          Issue Type: Bug
            Reporter: Anton Kalashnikov
            Assignee: Anton Kalashnikov


Local alignment timeouts are very confusing and especially without timeout on the outputs, they can significantly delay timeouting to UC.

Problematic case is when all CBs are received with long delay because of the back pressure, but they arrive at the same time. Alignment time can be low (milliseconds), while start delay is ~1 minute. In that case checkpoint doesn't timeout to UC and is passing the responsibility to timeout down the stream.

 

So it is not so transparant for the user why and when AC switches to UC. As mentioned before, the start delay is not correlated with the alignment timeout because it doesn't take into account time in output buffer. the alignment time is not fully correlated with the alignment timeout because the alignment time doesn't take into account the barrier announcement.

 

Based on this, there is the proposal to change the semantic of alignmentTimeout configuration to such meaning:

*The time between the starting of checkpoint(on the checkpont coordinator) and the time when the checkpoint barrier will be received by task.*

By this definition, we will have kind of global timeout which says that if the AC isn't finished for alignmentTimeout time it will be switched to UC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)