(DEPRECATED) Apache Flink Mailing List archive.

possible backwards compatibility issue between 1.8->1.9?

Classic

List

Threaded

2 messages Options

Bekir Oguz-2

possible backwards compatibility issue between 1.8->1.9?

Hi guys,
during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with
the following exception. We deploy the job with 'allow-non-restored-state'
option and from the latest checkpoint dir of the 1.8.1 version.

org.apache.flink.util.StateMigrationException: The new state typeSerializer
for operator state must not be incompatible.
at org.apache.flink.runtime.state.DefaultOperatorStateBackend
.getListState(DefaultOperatorStateBackend.java:323)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend
.getListState(DefaultOperatorStateBackend.java:214)
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator
.initializeState(AsyncWaitOperator.java:272)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator
.initializeState(AbstractStreamOperator.java:281)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(
StreamTask.java:881)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask
.java:395)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)

We see from the Web UI that the 'async wait operator' is causing this,
which is not changed at all during this upgrade.

All other jobs are migrated without problems, only this one is failing. Has
anyone else experienced this during migration?

Regards,
Bekir Oguz

Piotr Nowojski-3

Re: possible backwards compatibility issue between 1.8->1.9?

Hi,

(This question is more appropriate for the user mailing list, not dev - when responding to my e-mail please remove dev mailing list from the recipients, I’ve kept it just FYI that discussion has moved to user mailing list).

Could it be, that the problem is caused by changes in chaining strategy of the AsyncWaitOperator in 1.9, as explained in the release notes [1]?

> AsyncIO
> Due to a bug in the AsyncWaitOperator, in 1.9.0 the default chaining behaviour of the operator is now changed so that it
> is never chained after another operator. This should not be problematic for migrating from older version snapshots as
> long as an uid was assigned to the operator. If an uid was not assigned to the operator, please see the instructions here [2]
> for a possible workaround.
>
> Related issues:
>
> • FLINK-13063: AsyncWaitOperator shouldn’t be releasing checkpointingLock [3]

Piotrek

[1] https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio <https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.9.html#asyncio>
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state <https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/upgrading.html#matching-operator-state>
[3] https://issues.apache.org/jira/browse/FLINK-13063 <https://issues.apache.org/jira/browse/FLINK-13063>

> On 30 Oct 2019, at 16:52, Bekir Oguz <[hidden email]> wrote:
>
> Hi guys,
> during our upgrade from 1.8.1 to 1.9.1, one of our jobs fail to start with
> the following exception. We deploy the job with 'allow-non-restored-state'
> option and from the latest checkpoint dir of the 1.8.1 version.
>
> org.apache.flink.util.StateMigrationException: The new state typeSerializer
> for operator state must not be incompatible.
> at org.apache.flink.runtime.state.DefaultOperatorStateBackend
> .getListState(DefaultOperatorStateBackend.java:323)
> at org.apache.flink.runtime.state.DefaultOperatorStateBackend
> .getListState(DefaultOperatorStateBackend.java:214)
> at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator
> .initializeState(AsyncWaitOperator.java:272)
> at org.apache.flink.streaming.api.operators.AbstractStreamOperator
> .initializeState(AbstractStreamOperator.java:281)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(
> StreamTask.java:881)
> at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask
> .java:395)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
>
> We see from the Web UI that the 'async wait operator' is causing this,
> which is not changed at all during this upgrade.
>
> All other jobs are migrated without problems, only this one is failing. Has
> anyone else experienced this during migration?
>
> Regards,
> Bekir Oguz