Igal Shilman created FLINK-21642:
------------------------------------
Summary: RequestReplyFunction recovery fails with a remote SDK
Key: FLINK-21642
URL:
https://issues.apache.org/jira/browse/FLINK-21642 Project: Flink
Issue Type: Bug
Components: Stateful Functions
Reporter: Igal Shilman
While extending our smoke e2e test to use the remote SDKS I've stumbled upon a bug in the RequestReplyFunction. We get a unknown state exception after recovery.
The exact scenario that trigger that bug is:
# There was request in flight.
# A failure occurs that causes the job to restart.
# On restore, we start with no managed state
# But we try to re-send to the SDK exactly the same ToFunction message.
# That ToFunction contains state definitions from the previous attempt. (before the failure)
# The SDK processes this message normally (it has all the state definitions that it knows)
# The SDK responds with a state mutation.
# The PersistedRemoteFunctionValues fails with unknown state.
We need to treat the ToFunction messages as a retryBatch, instead of sending it as-is.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)