flink checkpoint state data corruption

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

flink checkpoint state data corruption

Jeffrey Martin
Hi all,

I'm using protobufs as keys of a Flink stream using code copied fromĀ this pull request, but deserialization is failing after checkpoint restore due to missing data.

I'm using HDFS and the RocksDB backend. I tried providing the path to a previous retained checkpoint (i.e., .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer failed on a serialized record that had been truncated and was missing its last 20 bytes out of 77 total.

The same serializers work fine if I don't try restoring from a checkpoint, worked fine for a different job, are fairly well unit-tested, and mostly just delegate to the protobuf serde code so I'm pretty certain my serializer is not the issue. Which means I'm doing something else wrong.

Questions:
1. Have others encountered issues like this?
2. How do I know when a checkpoint has been completed and is safe to restore from? (Is checkpoint completion atomic?)

Thanks,

Jeff Martin

corruption-stacktrace.txt (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: flink checkpoint state data corruption

Yu Li
@Yun Tang <[hidden email]> As the author of the referenced PR, it would
be great if you could help take a look here. Thanks.

@Jeffery:
For your second question, with officially released Flink, once checkpoint
has been completed successfully it's safe to restore from [1]. The issue
you encountered probably was caused by some bug of the un-merged PR and
let's wait for the author's answer.

Hope this helps.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/internals/stream_checkpointing.html

Best Regards,
Yu


On Thu, 14 Nov 2019 at 06:27, Jeffrey Martin <[hidden email]>
wrote:

> Hi all,
>
> I'm using protobufs as keys of a Flink stream using code copied from this
> pull request <https://github.com/apache/flink/pull/7598>, but
> deserialization is failing after checkpoint restore due to missing data.
>
> I'm using HDFS and the RocksDB backend. I tried providing the path to a
> previous retained checkpoint (i.e.,
> .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer
> failed on a serialized record that had been truncated and was missing its
> last 20 bytes out of 77 total.
>
> The same serializers work fine if I don't try restoring from a checkpoint,
> worked fine for a different job, are fairly well unit-tested, and mostly
> just delegate to the protobuf serde code so I'm pretty certain my
> serializer is not the issue. Which means I'm doing something else wrong.
>
> Questions:
> 1. Have others encountered issues like this?
> 2. How do I know when a checkpoint has been completed and is safe to
> restore from? (Is checkpoint completion atomic?)
>
> Thanks,
>
> Jeff Martin
>
Reply | Threaded
Open this post in threaded view
|

Re: flink checkpoint state data corruption

Yun Tang
Hi Jeffrey

Thanks for your willingness to try this un-merged PR. Would you please share your exception stack and your proto file? If you could just summed up a job which could be able to reproduce this bug and share this to me, that would be better to fix this problem.


Best
Yun Tang

From: Yu Li <[hidden email]>
Date: Friday, November 15, 2019 at 1:29 AM
To: dev <[hidden email]>, Yun Tang <[hidden email]>
Subject: Re: flink checkpoint state data corruption

@Yun Tang<mailto:[hidden email]> As the author of the referenced PR, it would be great if you could help take a look here. Thanks.

@Jeffery:
For your second question, with officially released Flink, once checkpoint has been completed successfully it's safe to restore from [1]. The issue you encountered probably was caused by some bug of the un-merged PR and let's wait for the author's answer.

Hope this helps.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/internals/stream_checkpointing.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.9%2Finternals%2Fstream_checkpointing.html&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994961183&sdata=otC1ekbvD107uY0EzWxq8DrJhHRmGzX2xjK%2FdOB5zMU%3D&reserved=0>

Best Regards,
Yu


On Thu, 14 Nov 2019 at 06:27, Jeffrey Martin <[hidden email]<mailto:[hidden email]>> wrote:
Hi all,

I'm using protobufs as keys of a Flink stream using code copied from this pull request<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fpull%2F7598&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994981205&sdata=F%2BIELKMtJTqci1B9a0%2BlTvvQJuaGzdif7hAYtdyx53w%3D&reserved=0>, but deserialization is failing after checkpoint restore due to missing data.

I'm using HDFS and the RocksDB backend. I tried providing the path to a previous retained checkpoint (i.e., .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer failed on a serialized record that had been truncated and was missing its last 20 bytes out of 77 total.

The same serializers work fine if I don't try restoring from a checkpoint, worked fine for a different job, are fairly well unit-tested, and mostly just delegate to the protobuf serde code so I'm pretty certain my serializer is not the issue. Which means I'm doing something else wrong.

Questions:
1. Have others encountered issues like this?
2. How do I know when a checkpoint has been completed and is safe to restore from? (Is checkpoint completion atomic?)

Thanks,

Jeff Martin