possible backwards compatibility issue between 1.8->1.9?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

possible backwards compatibility issue between 1.8->1.9?

Bekir Oguz-2
Hi guys,
during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with
the following exception. We deploy the job with 'allow-non-restored-state'
option and from the latest checkpoint dir of the 1.8.1 version.

org.apache.flink.util.StateMigrationException: The new state typeSerializer
for operator state must not be incompatible.
    at org.apache.flink.runtime.state.DefaultOperatorStateBackend
.getListState(DefaultOperatorStateBackend.java:323)
    at org.apache.flink.runtime.state.DefaultOperatorStateBackend
.getListState(DefaultOperatorStateBackend.java:214)
    at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator
.initializeState(AsyncWaitOperator.java:272)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator
.initializeState(AbstractStreamOperator.java:281)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(
StreamTask.java:881)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask
.java:395)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)

We see from the Web UI that the 'async wait operator' is causing this,
which is not changed at all during this upgrade.

All other jobs are migrated without problems, only this one is failing. Has
anyone else experienced this during migration?

Regards,
Bekir Oguz
Reply | Threaded
Open this post in threaded view
|

Re: possible backwards compatibility issue between 1.8->1.9?

Piotr Nowojski-3
Hi,

(This question is more appropriate for the user mailing list, not dev - when responding to my e-mail please remove dev mailing list from the recipients, I’ve kept it just FYI that discussion has moved to user mailing list).

Could it be, that the problem is caused by changes in chaining strategy of the AsyncWaitOperator in 1.9, as explained in the release notes [1]?

> AsyncIO
> Due to a bug in the AsyncWaitOperator, in 1.9.0 the default chaining behaviour of the operator is now changed so that it
> is never chained after another operator. This should not be problematic for migrating from older version snapshots as
> long as an uid was assigned to the operator. If an uid was not assigned to the operator, please see the instructions here [2]
> for a possible workaround.
>
> Related issues:
>
> • FLINK-13063: AsyncWaitOperator shouldn’t be releasing checkpointingLock [3]

Piotrek

[1] https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio <https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio>
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state <https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state>
[3] https://issues.apache.org/jira/browse/FLINK-13063 <https://issues.apache.org/jira/browse/FLINK-13063>

> On 30 Oct 2019, at 16:52, Bekir Oguz <[hidden email]> wrote:
>
> Hi guys,
> during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with
> the following exception. We deploy the job with 'allow-non-restored-state'
> option and from the latest checkpoint dir of the 1.8.1 version.
>
> org.apache.flink.util.StateMigrationException: The new state typeSerializer
> for operator state must not be incompatible.
>    at org.apache.flink.runtime.state.DefaultOperatorStateBackend
> .getListState(DefaultOperatorStateBackend.java:323)
>    at org.apache.flink.runtime.state.DefaultOperatorStateBackend
> .getListState(DefaultOperatorStateBackend.java:214)
>    at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator
> .initializeState(AsyncWaitOperator.java:272)
>    at org.apache.flink.streaming.api.operators.AbstractStreamOperator
> .initializeState(AbstractStreamOperator.java:281)
>    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(
> StreamTask.java:881)
>    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask
> .java:395)
>    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>    at java.lang.Thread.run(Thread.java:748)
>
> We see from the Web UI that the 'async wait operator' is causing this,
> which is not changed at all during this upgrade.
>
> All other jobs are migrated without problems, only this one is failing. Has
> anyone else experienced this during migration?
>
> Regards,
> Bekir Oguz