Local alignment timeouts are very confusing and especially without timeout on the outputs, they can significantly delay timeouting to UC.
Problematic case is when all CBs are received with long delay because of the back pressure, but they arrive at the same time. Alignment time can be low (milliseconds), while start delay is ~1 minute. In that case checkpoint doesn't timeout to UC and is passing the responsibility to timeout down the stream.
So it is not so transparant for the user why and when AC switches to UC. As mentioned before, the start delay is not correlated with the alignment timeout because it doesn't take into account time in output buffer. the alignment time is not fully correlated with the alignment timeout because the alignment time doesn't take into account the barrier announcement.
Based on this, there is the proposal to change the semantic of alignmentTimeout configuration to such meaning:
*The time between the starting of checkpoint(on the checkpont coordinator) and the time when the checkpoint barrier will be received by task.*
By this definition, we will have kind of global timeout which says that if the AC isn't finished for alignmentTimeout time it will be switched to UC.
This message was sent by Atlassian Jira