[DISCUSSION] Complete restart after successive failures

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION] Complete restart after successive failures

Gyula Fóra
Hi all!

In the past years while running Flink in production we have seen a huge
number of scenarios when the Flink jobs can go into unrecoverable failure
loops and only a complete manual restart helps.

This is in most cases due to memory leaks in the user program, leaking
threads etc and it leads to a failure loop due to the fact that the job is
restarted within the same JVM (Taskmanager). After the restart the leak
gets worse and worse eventually crashing the TMs one after the other and
never recovering.

These issues are extremely hard to debug (might only cause problems after a
few failures) and can cause long lasting instabilities.

I suggest we enable an option that would trigger a complete restart every
so many failures. This would release all containers (TM and JM) and restart
everything.

The only argument against this I see is that this might further hide the
root cause of the problem on the job/user side. While this is true a stuck
production job with crashing TM is probably much worse out of these 2.

What do you think?

Gyula
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Complete restart after successive failures

Piotr Nowojski-2
Hi Gyula,

Personally I do not see a problem with providing such an option of “clean restart” after N failures, especially if we set the default value for N to +infinity. However guys working more with Flink’s scheduling systems might have more to say about this.

Piotrek

> On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote:
>
> Hi all!
>
> In the past years while running Flink in production we have seen a huge
> number of scenarios when the Flink jobs can go into unrecoverable failure
> loops and only a complete manual restart helps.
>
> This is in most cases due to memory leaks in the user program, leaking
> threads etc and it leads to a failure loop due to the fact that the job is
> restarted within the same JVM (Taskmanager). After the restart the leak
> gets worse and worse eventually crashing the TMs one after the other and
> never recovering.
>
> These issues are extremely hard to debug (might only cause problems after a
> few failures) and can cause long lasting instabilities.
>
> I suggest we enable an option that would trigger a complete restart every
> so many failures. This would release all containers (TM and JM) and restart
> everything.
>
> The only argument against this I see is that this might further hide the
> root cause of the problem on the job/user side. While this is true a stuck
> production job with crashing TM is probably much worse out of these 2.
>
> What do you think?
>
> Gyula

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Complete restart after successive failures

Till Rohrmann
Hi Gyula,

I see the benefit of having such an option. In fact, it goes in a similar
direction as the currently ongoing discussion about blacklisting TMs. In
the end it could work by reporting failures to the RM which aggregates some
statistics for the individual TMs. Based on some thresholds it could then
decide to free/blacklist a specific TM. Whether to blacklist or restart a
container could then be a configurable option.

Cheers,
Till

On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]> wrote:

> Hi Gyula,
>
> Personally I do not see a problem with providing such an option of “clean
> restart” after N failures, especially if we set the default value for N to
> +infinity. However guys working more with Flink’s scheduling systems might
> have more to say about this.
>
> Piotrek
>
> > On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote:
> >
> > Hi all!
> >
> > In the past years while running Flink in production we have seen a huge
> > number of scenarios when the Flink jobs can go into unrecoverable failure
> > loops and only a complete manual restart helps.
> >
> > This is in most cases due to memory leaks in the user program, leaking
> > threads etc and it leads to a failure loop due to the fact that the job
> is
> > restarted within the same JVM (Taskmanager). After the restart the leak
> > gets worse and worse eventually crashing the TMs one after the other and
> > never recovering.
> >
> > These issues are extremely hard to debug (might only cause problems
> after a
> > few failures) and can cause long lasting instabilities.
> >
> > I suggest we enable an option that would trigger a complete restart every
> > so many failures. This would release all containers (TM and JM) and
> restart
> > everything.
> >
> > The only argument against this I see is that this might further hide the
> > root cause of the problem on the job/user side. While this is true a
> stuck
> > production job with crashing TM is probably much worse out of these 2.
> >
> > What do you think?
> >
> > Gyula
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Complete restart after successive failures

Piotr Nowojski-2
That’s a good point Till. Blacklisting TMs could be able to handle this. One scenario that might be problematic is if clean restart is needed after a more or less random number of job resubmissions, like if resource leakage has different rates on different nodes. In such situation, if we blacklist and restart TMs one by one, Job can keep failing constantly with failures caused every time by a different TM. It could end up with a dead loop in some scenarios/setups. Where the Gyula’s proposal would restart all of the TMs at once, reseting the leakage on all of the TMs at the same time, making a successful restart possible.

I still think that blacklisting TMs is a better way to do it, but maybe we still need some kind of limit, like after N blacklists restart all TMs. But this would also add an additional complexity.

Piotrek

> On 3 Jan 2019, at 13:59, Till Rohrmann <[hidden email]> wrote:
>
> Hi Gyula,
>
> I see the benefit of having such an option. In fact, it goes in a similar
> direction as the currently ongoing discussion about blacklisting TMs. In
> the end it could work by reporting failures to the RM which aggregates some
> statistics for the individual TMs. Based on some thresholds it could then
> decide to free/blacklist a specific TM. Whether to blacklist or restart a
> container could then be a configurable option.
>
> Cheers,
> Till
>
> On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]> wrote:
>
>> Hi Gyula,
>>
>> Personally I do not see a problem with providing such an option of “clean
>> restart” after N failures, especially if we set the default value for N to
>> +infinity. However guys working more with Flink’s scheduling systems might
>> have more to say about this.
>>
>> Piotrek
>>
>>> On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote:
>>>
>>> Hi all!
>>>
>>> In the past years while running Flink in production we have seen a huge
>>> number of scenarios when the Flink jobs can go into unrecoverable failure
>>> loops and only a complete manual restart helps.
>>>
>>> This is in most cases due to memory leaks in the user program, leaking
>>> threads etc and it leads to a failure loop due to the fact that the job
>> is
>>> restarted within the same JVM (Taskmanager). After the restart the leak
>>> gets worse and worse eventually crashing the TMs one after the other and
>>> never recovering.
>>>
>>> These issues are extremely hard to debug (might only cause problems
>> after a
>>> few failures) and can cause long lasting instabilities.
>>>
>>> I suggest we enable an option that would trigger a complete restart every
>>> so many failures. This would release all containers (TM and JM) and
>> restart
>>> everything.
>>>
>>> The only argument against this I see is that this might further hide the
>>> root cause of the problem on the job/user side. While this is true a
>> stuck
>>> production job with crashing TM is probably much worse out of these 2.
>>>
>>> What do you think?
>>>
>>> Gyula
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Complete restart after successive failures

Gyula Fóra
Could it also work so that after so many tries it blacklists everything?
That way it would pretty much trigger a fresh restart.

Gyula

On Fri, 4 Jan 2019 at 10:11, Piotr Nowojski <[hidden email]> wrote:

> That’s a good point Till. Blacklisting TMs could be able to handle this.
> One scenario that might be problematic is if clean restart is needed after
> a more or less random number of job resubmissions, like if resource leakage
> has different rates on different nodes. In such situation, if we blacklist
> and restart TMs one by one, Job can keep failing constantly with failures
> caused every time by a different TM. It could end up with a dead loop in
> some scenarios/setups. Where the Gyula’s proposal would restart all of the
> TMs at once, reseting the leakage on all of the TMs at the same time,
> making a successful restart possible.
>
> I still think that blacklisting TMs is a better way to do it, but maybe we
> still need some kind of limit, like after N blacklists restart all TMs. But
> this would also add an additional complexity.
>
> Piotrek
>
> > On 3 Jan 2019, at 13:59, Till Rohrmann <[hidden email]> wrote:
> >
> > Hi Gyula,
> >
> > I see the benefit of having such an option. In fact, it goes in a similar
> > direction as the currently ongoing discussion about blacklisting TMs. In
> > the end it could work by reporting failures to the RM which aggregates
> some
> > statistics for the individual TMs. Based on some thresholds it could then
> > decide to free/blacklist a specific TM. Whether to blacklist or restart a
> > container could then be a configurable option.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]>
> wrote:
> >
> >> Hi Gyula,
> >>
> >> Personally I do not see a problem with providing such an option of
> “clean
> >> restart” after N failures, especially if we set the default value for N
> to
> >> +infinity. However guys working more with Flink’s scheduling systems
> might
> >> have more to say about this.
> >>
> >> Piotrek
> >>
> >>> On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote:
> >>>
> >>> Hi all!
> >>>
> >>> In the past years while running Flink in production we have seen a huge
> >>> number of scenarios when the Flink jobs can go into unrecoverable
> failure
> >>> loops and only a complete manual restart helps.
> >>>
> >>> This is in most cases due to memory leaks in the user program, leaking
> >>> threads etc and it leads to a failure loop due to the fact that the job
> >> is
> >>> restarted within the same JVM (Taskmanager). After the restart the leak
> >>> gets worse and worse eventually crashing the TMs one after the other
> and
> >>> never recovering.
> >>>
> >>> These issues are extremely hard to debug (might only cause problems
> >> after a
> >>> few failures) and can cause long lasting instabilities.
> >>>
> >>> I suggest we enable an option that would trigger a complete restart
> every
> >>> so many failures. This would release all containers (TM and JM) and
> >> restart
> >>> everything.
> >>>
> >>> The only argument against this I see is that this might further hide
> the
> >>> root cause of the problem on the job/user side. While this is true a
> >> stuck
> >>> production job with crashing TM is probably much worse out of these 2.
> >>>
> >>> What do you think?
> >>>
> >>> Gyula
> >>
> >>
>
>