Hi guys,
during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with the following exception. We deploy the job with 'allow-non-restored-state' option and from the latest checkpoint dir of the 1.8.1 version. org.apache.flink.util.StateMigrationException: The new state typeSerializer for operator state must not be incompatible. at org.apache.flink.runtime.state.DefaultOperatorStateBackend .getListState(DefaultOperatorStateBackend.java:323) at org.apache.flink.runtime.state.DefaultOperatorStateBackend .getListState(DefaultOperatorStateBackend.java:214) at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator .initializeState(AsyncWaitOperator.java:272) at org.apache.flink.streaming.api.operators.AbstractStreamOperator .initializeState(AbstractStreamOperator.java:281) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState( StreamTask.java:881) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask .java:395) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) at java.lang.Thread.run(Thread.java:748) We see from the Web UI that the 'async wait operator' is causing this, which is not changed at all during this upgrade. All other jobs are migrated without problems, only this one is failing. Has anyone else experienced this during migration? Regards, Bekir Oguz |
Hi,
(This question is more appropriate for the user mailing list, not dev - when responding to my e-mail please remove dev mailing list from the recipients, I’ve kept it just FYI that discussion has moved to user mailing list). Could it be, that the problem is caused by changes in chaining strategy of the AsyncWaitOperator in 1.9, as explained in the release notes [1]? > AsyncIO > Due to a bug in the AsyncWaitOperator, in 1.9.0 the default chaining behaviour of the operator is now changed so that it > is never chained after another operator. This should not be problematic for migrating from older version snapshots as > long as an uid was assigned to the operator. If an uid was not assigned to the operator, please see the instructions here [2] > for a possible workaround. > > Related issues: > > • FLINK-13063: AsyncWaitOperator shouldn’t be releasing checkpointingLock [3] Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio <https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state <https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state> [3] https://issues.apache.org/jira/browse/FLINK-13063 <https://issues.apache.org/jira/browse/FLINK-13063> > On 30 Oct 2019, at 16:52, Bekir Oguz <[hidden email]> wrote: > > Hi guys, > during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with > the following exception. We deploy the job with 'allow-non-restored-state' > option and from the latest checkpoint dir of the 1.8.1 version. > > org.apache.flink.util.StateMigrationException: The new state typeSerializer > for operator state must not be incompatible. > at org.apache.flink.runtime.state.DefaultOperatorStateBackend > .getListState(DefaultOperatorStateBackend.java:323) > at org.apache.flink.runtime.state.DefaultOperatorStateBackend > .getListState(DefaultOperatorStateBackend.java:214) > at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator > .initializeState(AsyncWaitOperator.java:272) > at org.apache.flink.streaming.api.operators.AbstractStreamOperator > .initializeState(AbstractStreamOperator.java:281) > at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState( > StreamTask.java:881) > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask > .java:395) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > at java.lang.Thread.run(Thread.java:748) > > We see from the Web UI that the 'async wait operator' is causing this, > which is not changed at all during this upgrade. > > All other jobs are migrated without problems, only this one is failing. Has > anyone else experienced this during migration? > > Regards, > Bekir Oguz |
Free forum by Nabble | Edit this page |