Dong Lin created FLINK-22488:
--------------------------------
Summary: KafkaSourceLegacyITCase.testOneToOneSources failed due to "OperatorEvent from an OperatorCoordinator to a task was lost"
Key: FLINK-22488
URL:
https://issues.apache.org/jira/browse/FLINK-22488 Project: Flink
Issue Type: Improvement
Reporter: Dong Lin
According to [1], the test KafkaSourceLegacyITCase.testOneToOneSources failed because it runs a streaming job (which uses KafkaSource) with restartAttempts=1. In addition to the failover explicitly triggered by the FailingIdentityMapper, the job additionally failed due to "org.apache.flink.util.FlinkException: An OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task failover to ensure consistency", which is unexpected by the test.
Note that SubtaskGatewayImpl was updated by [2] on 4/14 which triggers task failover if any OperatorEvent was lost. This could explain why those Kafka tests start to fail due to the exception described above.
In order to make this test stable, let's try to understand why there is such a high chance of loosing OperatorEvent in the Azure test pipeline. And if we could not avoid loosing OperatorEvent in the test pipeline, we probably need to update the test to allow the pipeline being restarted arbitrary times (and still be able to stop the test on the happy path).
[1]
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17212&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5&l=6960[2]
https://github.com/apache/flink/pull/15605--
This message was sent by Atlassian Jira
(v8.3.4#803005)