Hi all!
In the past years while running Flink in production we have seen a huge number of scenarios when the Flink jobs can go into unrecoverable failure loops and only a complete manual restart helps. This is in most cases due to memory leaks in the user program, leaking threads etc and it leads to a failure loop due to the fact that the job is restarted within the same JVM (Taskmanager). After the restart the leak gets worse and worse eventually crashing the TMs one after the other and never recovering. These issues are extremely hard to debug (might only cause problems after a few failures) and can cause long lasting instabilities. I suggest we enable an option that would trigger a complete restart every so many failures. This would release all containers (TM and JM) and restart everything. The only argument against this I see is that this might further hide the root cause of the problem on the job/user side. While this is true a stuck production job with crashing TM is probably much worse out of these 2. What do you think? Gyula |
Hi Gyula,
Personally I do not see a problem with providing such an option of “clean restart” after N failures, especially if we set the default value for N to +infinity. However guys working more with Flink’s scheduling systems might have more to say about this. Piotrek > On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote: > > Hi all! > > In the past years while running Flink in production we have seen a huge > number of scenarios when the Flink jobs can go into unrecoverable failure > loops and only a complete manual restart helps. > > This is in most cases due to memory leaks in the user program, leaking > threads etc and it leads to a failure loop due to the fact that the job is > restarted within the same JVM (Taskmanager). After the restart the leak > gets worse and worse eventually crashing the TMs one after the other and > never recovering. > > These issues are extremely hard to debug (might only cause problems after a > few failures) and can cause long lasting instabilities. > > I suggest we enable an option that would trigger a complete restart every > so many failures. This would release all containers (TM and JM) and restart > everything. > > The only argument against this I see is that this might further hide the > root cause of the problem on the job/user side. While this is true a stuck > production job with crashing TM is probably much worse out of these 2. > > What do you think? > > Gyula |
Hi Gyula,
I see the benefit of having such an option. In fact, it goes in a similar direction as the currently ongoing discussion about blacklisting TMs. In the end it could work by reporting failures to the RM which aggregates some statistics for the individual TMs. Based on some thresholds it could then decide to free/blacklist a specific TM. Whether to blacklist or restart a container could then be a configurable option. Cheers, Till On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]> wrote: > Hi Gyula, > > Personally I do not see a problem with providing such an option of “clean > restart” after N failures, especially if we set the default value for N to > +infinity. However guys working more with Flink’s scheduling systems might > have more to say about this. > > Piotrek > > > On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote: > > > > Hi all! > > > > In the past years while running Flink in production we have seen a huge > > number of scenarios when the Flink jobs can go into unrecoverable failure > > loops and only a complete manual restart helps. > > > > This is in most cases due to memory leaks in the user program, leaking > > threads etc and it leads to a failure loop due to the fact that the job > is > > restarted within the same JVM (Taskmanager). After the restart the leak > > gets worse and worse eventually crashing the TMs one after the other and > > never recovering. > > > > These issues are extremely hard to debug (might only cause problems > after a > > few failures) and can cause long lasting instabilities. > > > > I suggest we enable an option that would trigger a complete restart every > > so many failures. This would release all containers (TM and JM) and > restart > > everything. > > > > The only argument against this I see is that this might further hide the > > root cause of the problem on the job/user side. While this is true a > stuck > > production job with crashing TM is probably much worse out of these 2. > > > > What do you think? > > > > Gyula > > |
That’s a good point Till. Blacklisting TMs could be able to handle this. One scenario that might be problematic is if clean restart is needed after a more or less random number of job resubmissions, like if resource leakage has different rates on different nodes. In such situation, if we blacklist and restart TMs one by one, Job can keep failing constantly with failures caused every time by a different TM. It could end up with a dead loop in some scenarios/setups. Where the Gyula’s proposal would restart all of the TMs at once, reseting the leakage on all of the TMs at the same time, making a successful restart possible.
I still think that blacklisting TMs is a better way to do it, but maybe we still need some kind of limit, like after N blacklists restart all TMs. But this would also add an additional complexity. Piotrek > On 3 Jan 2019, at 13:59, Till Rohrmann <[hidden email]> wrote: > > Hi Gyula, > > I see the benefit of having such an option. In fact, it goes in a similar > direction as the currently ongoing discussion about blacklisting TMs. In > the end it could work by reporting failures to the RM which aggregates some > statistics for the individual TMs. Based on some thresholds it could then > decide to free/blacklist a specific TM. Whether to blacklist or restart a > container could then be a configurable option. > > Cheers, > Till > > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]> wrote: > >> Hi Gyula, >> >> Personally I do not see a problem with providing such an option of “clean >> restart” after N failures, especially if we set the default value for N to >> +infinity. However guys working more with Flink’s scheduling systems might >> have more to say about this. >> >> Piotrek >> >>> On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote: >>> >>> Hi all! >>> >>> In the past years while running Flink in production we have seen a huge >>> number of scenarios when the Flink jobs can go into unrecoverable failure >>> loops and only a complete manual restart helps. >>> >>> This is in most cases due to memory leaks in the user program, leaking >>> threads etc and it leads to a failure loop due to the fact that the job >> is >>> restarted within the same JVM (Taskmanager). After the restart the leak >>> gets worse and worse eventually crashing the TMs one after the other and >>> never recovering. >>> >>> These issues are extremely hard to debug (might only cause problems >> after a >>> few failures) and can cause long lasting instabilities. >>> >>> I suggest we enable an option that would trigger a complete restart every >>> so many failures. This would release all containers (TM and JM) and >> restart >>> everything. >>> >>> The only argument against this I see is that this might further hide the >>> root cause of the problem on the job/user side. While this is true a >> stuck >>> production job with crashing TM is probably much worse out of these 2. >>> >>> What do you think? >>> >>> Gyula >> >> |
Could it also work so that after so many tries it blacklists everything?
That way it would pretty much trigger a fresh restart. Gyula On Fri, 4 Jan 2019 at 10:11, Piotr Nowojski <[hidden email]> wrote: > That’s a good point Till. Blacklisting TMs could be able to handle this. > One scenario that might be problematic is if clean restart is needed after > a more or less random number of job resubmissions, like if resource leakage > has different rates on different nodes. In such situation, if we blacklist > and restart TMs one by one, Job can keep failing constantly with failures > caused every time by a different TM. It could end up with a dead loop in > some scenarios/setups. Where the Gyula’s proposal would restart all of the > TMs at once, reseting the leakage on all of the TMs at the same time, > making a successful restart possible. > > I still think that blacklisting TMs is a better way to do it, but maybe we > still need some kind of limit, like after N blacklists restart all TMs. But > this would also add an additional complexity. > > Piotrek > > > On 3 Jan 2019, at 13:59, Till Rohrmann <[hidden email]> wrote: > > > > Hi Gyula, > > > > I see the benefit of having such an option. In fact, it goes in a similar > > direction as the currently ongoing discussion about blacklisting TMs. In > > the end it could work by reporting failures to the RM which aggregates > some > > statistics for the individual TMs. Based on some thresholds it could then > > decide to free/blacklist a specific TM. Whether to blacklist or restart a > > container could then be a configurable option. > > > > Cheers, > > Till > > > > On Wed, Jan 2, 2019 at 1:15 PM Piotr Nowojski <[hidden email]> > wrote: > > > >> Hi Gyula, > >> > >> Personally I do not see a problem with providing such an option of > “clean > >> restart” after N failures, especially if we set the default value for N > to > >> +infinity. However guys working more with Flink’s scheduling systems > might > >> have more to say about this. > >> > >> Piotrek > >> > >>> On 29 Dec 2018, at 13:36, Gyula Fóra <[hidden email]> wrote: > >>> > >>> Hi all! > >>> > >>> In the past years while running Flink in production we have seen a huge > >>> number of scenarios when the Flink jobs can go into unrecoverable > failure > >>> loops and only a complete manual restart helps. > >>> > >>> This is in most cases due to memory leaks in the user program, leaking > >>> threads etc and it leads to a failure loop due to the fact that the job > >> is > >>> restarted within the same JVM (Taskmanager). After the restart the leak > >>> gets worse and worse eventually crashing the TMs one after the other > and > >>> never recovering. > >>> > >>> These issues are extremely hard to debug (might only cause problems > >> after a > >>> few failures) and can cause long lasting instabilities. > >>> > >>> I suggest we enable an option that would trigger a complete restart > every > >>> so many failures. This would release all containers (TM and JM) and > >> restart > >>> everything. > >>> > >>> The only argument against this I see is that this might further hide > the > >>> root cause of the problem on the job/user side. While this is true a > >> stuck > >>> production job with crashing TM is probably much worse out of these 2. > >>> > >>> What do you think? > >>> > >>> Gyula > >> > >> > > |
Free forum by Nabble | Edit this page |