[jira] [Created] (FLINK-21642) RequestReplyFunction recovery fails with a remote SDK

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21642) RequestReplyFunction recovery fails with a remote SDK

Shang Yuanchun (Jira)
Igal Shilman created FLINK-21642:
------------------------------------

             Summary: RequestReplyFunction recovery fails with a remote SDK
                 Key: FLINK-21642
                 URL: https://issues.apache.org/jira/browse/FLINK-21642
             Project: Flink
          Issue Type: Bug
          Components: Stateful Functions
            Reporter: Igal Shilman


While extending our smoke e2e test to use the remote SDKS I've stumbled upon a bug in the RequestReplyFunction. We get a unknown state exception after recovery.

The exact scenario that trigger that bug is:
 # There was  request in flight.
 # A  failure occurs that causes the job to restart.
 # On restore, we start with no managed state
 # But we try to re-send to the SDK exactly the same ToFunction message.
 # That ToFunction contains state definitions from the previous attempt. (before the failure)
 # The SDK processes this message normally (it has all the state definitions that it knows)
 # The SDK responds with a state mutation.
 # The PersistedRemoteFunctionValues fails with unknown state. 

 

We need to treat the ToFunction messages as a retryBatch, instead of sending it as-is.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)