Question about Flink's savepoint

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about Flink's savepoint

Mu Kong
Hi all,

I have some questions about the experience I had with the save point.
So, last night I found my flink cluster's memory usage seemed wired, so I
decided to

1. create a savepoint for the running job(there was only one job running at
the time)
2. and then cancel the job from web UI
3. and restart the cluster

and when I tried to resume the job with the savepoint, there was a
"Truncate did not truncate to right length. Should be 11757 is 56383."
exception.
Because there is also a savepoint being created every 4 a.m. in the
morning, so after I failed to run the job with the savepoint I created
before I canceled the job, I tried to use the 4 a.m. savepoint instead, and
it seemed to work well.

Then this morning, I noticed there is data lost for the time after I cancel
the job and before I resume the job.

I thought if I run the job with savepoint created in 4 a.m., it should
start to process data from 4 a.m., or I'm missing something here?

Also, I didn't add uid to the addSource() function, maybe when I restarted
the cluster the auto-generated id has been changed and that might be the
reason why the recovery didn't go well?
Reply | Threaded
Open this post in threaded view
|

Re: Question about Flink's savepoint

Aljoscha Krettek-2
Hi,

What is the source you're using in your Job and what filesystem (if any) is it writing to?

Best,
Aljoscha

> On 5. Sep 2017, at 03:06, Mu Kong <[hidden email]> wrote:
>
> Hi all,
>
> I have some questions about the experience I had with the save point.
> So, last night I found my flink cluster's memory usage seemed wired, so I
> decided to
>
> 1. create a savepoint for the running job(there was only one job running at
> the time)
> 2. and then cancel the job from web UI
> 3. and restart the cluster
>
> and when I tried to resume the job with the savepoint, there was a
> "Truncate did not truncate to right length. Should be 11757 is 56383."
> exception.
> Because there is also a savepoint being created every 4 a.m. in the
> morning, so after I failed to run the job with the savepoint I created
> before I canceled the job, I tried to use the 4 a.m. savepoint instead, and
> it seemed to work well.
>
> Then this morning, I noticed there is data lost for the time after I cancel
> the job and before I resume the job.
>
> I thought if I run the job with savepoint created in 4 a.m., it should
> start to process data from 4 a.m., or I'm missing something here?
>
> Also, I didn't add uid to the addSource() function, maybe when I restarted
> the cluster the auto-generated id has been changed and that might be the
> reason why the recovery didn't go well?