[jira] [Created] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17994) Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived

Shang Yuanchun (Jira)
Zhijiang created FLINK-17994:
--------------------------------

             Summary: Fix the race condition between CheckpointBarrierUnaligner#processBarrier and #notifyBarrierReceived
                 Key: FLINK-17994
                 URL: https://issues.apache.org/jira/browse/FLINK-17994
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
            Reporter: Zhijiang
            Assignee: Zhijiang
             Fix For: 1.11.0


The race condition issue happens as follow:
 * ch1 is received from network by netty thread and schedule the ch1 into mailbox via #notifyBarrierReceived
 * ch2 is received from network by netty thread, but before calling #notifyBarrierReceived this barrier was inserted into channel's data queue in advance. Then it would cause task thread process ch2 earlier than #notifyBarrierReceived by netty thread.
 * Task thread would execute checkpoint for ch2 directly because ch2 > ch1.
 * After that, the previous scheduled ch1 is performed from mailbox by task thread, then it causes the IllegalArgumentException inside SubtaskCheckpointCoordinatorImpl#checkpointState because it breaks the assumption that checkpoint is executed in incremental way. 

One possible solution for this race condition is inserting the received barrier into channel's data queue after calling #notifyBarrierReceived, then we can make the assumption that the checkpoint is always triggered by netty thread, to simplify the current situation that checkpoint might be triggered either by task thread or netty thread. 

To do so we can also avoid accessing #notifyBarrierReceived method by task thread while processing the barrier to simplify the logic inside CheckpointBarrierUnaligner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)