Hi all,
I'm using protobufs as keys of a Flink stream using code copied fromĀ this pull request, but deserialization is failing after checkpoint restore due to missing data. I'm using HDFS and the RocksDB backend. I tried providing the path to a previous retained checkpoint (i.e., .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer failed on a serialized record that had been truncated and was missing its last 20 bytes out of 77 total. The same serializers work fine if I don't try restoring from a checkpoint, worked fine for a different job, are fairly well unit-tested, and mostly just delegate to the protobuf serde code so I'm pretty certain my serializer is not the issue. Which means I'm doing something else wrong. Questions: 1. Have others encountered issues like this? 2. How do I know when a checkpoint has been completed and is safe to restore from? (Is checkpoint completion atomic?) Thanks, Jeff Martin corruption-stacktrace.txt (5K) Download Attachment |
@Yun Tang <[hidden email]> As the author of the referenced PR, it would
be great if you could help take a look here. Thanks. @Jeffery: For your second question, with officially released Flink, once checkpoint has been completed successfully it's safe to restore from [1]. The issue you encountered probably was caused by some bug of the un-merged PR and let's wait for the author's answer. Hope this helps. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/internals/stream_checkpointing.html Best Regards, Yu On Thu, 14 Nov 2019 at 06:27, Jeffrey Martin <[hidden email]> wrote: > Hi all, > > I'm using protobufs as keys of a Flink stream using code copied from this > pull request <https://github.com/apache/flink/pull/7598>, but > deserialization is failing after checkpoint restore due to missing data. > > I'm using HDFS and the RocksDB backend. I tried providing the path to a > previous retained checkpoint (i.e., > .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer > failed on a serialized record that had been truncated and was missing its > last 20 bytes out of 77 total. > > The same serializers work fine if I don't try restoring from a checkpoint, > worked fine for a different job, are fairly well unit-tested, and mostly > just delegate to the protobuf serde code so I'm pretty certain my > serializer is not the issue. Which means I'm doing something else wrong. > > Questions: > 1. Have others encountered issues like this? > 2. How do I know when a checkpoint has been completed and is safe to > restore from? (Is checkpoint completion atomic?) > > Thanks, > > Jeff Martin > |
Hi Jeffrey
Thanks for your willingness to try this un-merged PR. Would you please share your exception stack and your proto file? If you could just summed up a job which could be able to reproduce this bug and share this to me, that would be better to fix this problem. Best Yun Tang From: Yu Li <[hidden email]> Date: Friday, November 15, 2019 at 1:29 AM To: dev <[hidden email]>, Yun Tang <[hidden email]> Subject: Re: flink checkpoint state data corruption @Yun Tang<mailto:[hidden email]> As the author of the referenced PR, it would be great if you could help take a look here. Thanks. @Jeffery: For your second question, with officially released Flink, once checkpoint has been completed successfully it's safe to restore from [1]. The issue you encountered probably was caused by some bug of the un-merged PR and let's wait for the author's answer. Hope this helps. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/internals/stream_checkpointing.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.9%2Finternals%2Fstream_checkpointing.html&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994961183&sdata=otC1ekbvD107uY0EzWxq8DrJhHRmGzX2xjK%2FdOB5zMU%3D&reserved=0> Best Regards, Yu On Thu, 14 Nov 2019 at 06:27, Jeffrey Martin <[hidden email]<mailto:[hidden email]>> wrote: Hi all, I'm using protobufs as keys of a Flink stream using code copied from this pull request<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fpull%2F7598&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994981205&sdata=F%2BIELKMtJTqci1B9a0%2BlTvvQJuaGzdif7hAYtdyx53w%3D&reserved=0>, but deserialization is failing after checkpoint restore due to missing data. I'm using HDFS and the RocksDB backend. I tried providing the path to a previous retained checkpoint (i.e., .../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer failed on a serialized record that had been truncated and was missing its last 20 bytes out of 77 total. The same serializers work fine if I don't try restoring from a checkpoint, worked fine for a different job, are fairly well unit-tested, and mostly just delegate to the protobuf serde code so I'm pretty certain my serializer is not the issue. Which means I'm doing something else wrong. Questions: 1. Have others encountered issues like this? 2. How do I know when a checkpoint has been completed and is safe to restore from? (Is checkpoint completion atomic?) Thanks, Jeff Martin |
Free forum by Nabble | Edit this page |