Hi everyone,
I wanted to reach out to you and ask how many of you are using a customized RestartStrategy[1] in production jobs. We are currently developing the new Flink scheduler[2] which interacts with restart strategies in a different way. We have to re-design the interfaces for the new restart strategies (so called RestartBackoffTimeStrategy). Existing customized RestartStrategy will not work any more with the new scheduler. We want to know whether we should keep the way to customized RestartBackoffTimeStrategy so that existing customized RestartStrategy can be migrated. I'd appreciate if you can share the status if you are using customized RestartStrategy. That will be valuable for use to make decisions. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies [2] https://issues.apache.org/jira/browse/FLINK-10429 Thanks, Zhu Zhu |
Hi Zhu,
We are using custom restart strategy like this: environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), Time.minutes(10))); --- Oytun Tez *M O T A W O R D* The World's Fastest Human Translation Platform. [hidden email] — www.motaword.com On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: > Hi everyone, > > I wanted to reach out to you and ask how many of you are using a > customized RestartStrategy[1] in production jobs. > > We are currently developing the new Flink scheduler[2] which interacts > with restart strategies in a different way. We have to re-design the > interfaces for the new restart strategies (so called > RestartBackoffTimeStrategy). Existing customized RestartStrategy will not > work any more with the new scheduler. > > We want to know whether we should keep the way > to customized RestartBackoffTimeStrategy so that existing customized > RestartStrategy can be migrated. > > I'd appreciate if you can share the status if you are using customized > RestartStrategy. That will be valuable for use to make decisions. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies > [2] https://issues.apache.org/jira/browse/FLINK-10429 > > Thanks, > Zhu Zhu > |
Thanks Oytun for the reply!
Sorry for not have stated it clearly. When saying "customized RestartStrategy", we mean that users implement an *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by themselves and use it by configuring like "restart-strategy: org.foobar.MyRestartStrategyFactoryFactory". The usage of restart strategies you mentioned will keep working with the new scheduler. Thanks, Zhu Zhu Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: > Hi Zhu, > > We are using custom restart strategy like this: > > environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), > Time.minutes(10))); > > > --- > Oytun Tez > > *M O T A W O R D* > The World's Fastest Human Translation Platform. > [hidden email] — www.motaword.com > > > On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: > >> Hi everyone, >> >> I wanted to reach out to you and ask how many of you are using a >> customized RestartStrategy[1] in production jobs. >> >> We are currently developing the new Flink scheduler[2] which interacts >> with restart strategies in a different way. We have to re-design the >> interfaces for the new restart strategies (so called >> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >> work any more with the new scheduler. >> >> We want to know whether we should keep the way >> to customized RestartBackoffTimeStrategy so that existing customized >> RestartStrategy can be migrated. >> >> I'd appreciate if you can share the status if you are using customized >> RestartStrategy. That will be valuable for use to make decisions. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >> [2] https://issues.apache.org/jira/browse/FLINK-10429 >> >> Thanks, >> Zhu Zhu >> > |
Thanks everyone for the input.
The RestartStrategy customization is not recognized as a public interface as it is not explicitly documented. As it is not used from the feedbacks of this survey, I'll conclude that we do not need to support customized RestartStrategy for the new scheduler in Flink 1.10 Other usages are still supported, including all the strategies and configuring ways described in https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies . Feel free to share in this thread if you has any concern for it. Thanks, Zhu Zhu Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: > Thanks Oytun for the reply! > > Sorry for not have stated it clearly. When saying "customized > RestartStrategy", we mean that users implement an > *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by > themselves and use it by configuring like "restart-strategy: > org.foobar.MyRestartStrategyFactoryFactory". > > The usage of restart strategies you mentioned will keep working with the > new scheduler. > > Thanks, > Zhu Zhu > > Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: > >> Hi Zhu, >> >> We are using custom restart strategy like this: >> >> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), >> Time.minutes(10))); >> >> >> --- >> Oytun Tez >> >> *M O T A W O R D* >> The World's Fastest Human Translation Platform. >> [hidden email] — www.motaword.com >> >> >> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >> >>> Hi everyone, >>> >>> I wanted to reach out to you and ask how many of you are using a >>> customized RestartStrategy[1] in production jobs. >>> >>> We are currently developing the new Flink scheduler[2] which interacts >>> with restart strategies in a different way. We have to re-design the >>> interfaces for the new restart strategies (so called >>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>> work any more with the new scheduler. >>> >>> We want to know whether we should keep the way >>> to customized RestartBackoffTimeStrategy so that existing customized >>> RestartStrategy can be migrated. >>> >>> I'd appreciate if you can share the status if you are using customized >>> RestartStrategy. That will be valuable for use to make decisions. >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>> >>> Thanks, >>> Zhu Zhu >>> >> |
We do use config like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional metrics than the Flink provided ones. On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: > Thanks everyone for the input. > > The RestartStrategy customization is not recognized as a public interface > as it is not explicitly documented. > As it is not used from the feedbacks of this survey, I'll conclude that we > do not need to support customized RestartStrategy for the new scheduler in > Flink 1.10 > > Other usages are still supported, including all the strategies and > configuring ways described in > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies > . > > Feel free to share in this thread if you has any concern for it. > > Thanks, > Zhu Zhu > > Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: > >> Thanks Oytun for the reply! >> >> Sorry for not have stated it clearly. When saying "customized >> RestartStrategy", we mean that users implement an >> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >> themselves and use it by configuring like "restart-strategy: >> org.foobar.MyRestartStrategyFactoryFactory". >> >> The usage of restart strategies you mentioned will keep working with the >> new scheduler. >> >> Thanks, >> Zhu Zhu >> >> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >> >>> Hi Zhu, >>> >>> We are using custom restart strategy like this: >>> >>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), >>> Time.minutes(10))); >>> >>> >>> --- >>> Oytun Tez >>> >>> *M O T A W O R D* >>> The World's Fastest Human Translation Platform. >>> [hidden email] — www.motaword.com >>> >>> >>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>> >>>> Hi everyone, >>>> >>>> I wanted to reach out to you and ask how many of you are using a >>>> customized RestartStrategy[1] in production jobs. >>>> >>>> We are currently developing the new Flink scheduler[2] which interacts >>>> with restart strategies in a different way. We have to re-design the >>>> interfaces for the new restart strategies (so called >>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>> work any more with the new scheduler. >>>> >>>> We want to know whether we should keep the way >>>> to customized RestartBackoffTimeStrategy so that existing customized >>>> RestartStrategy can be migrated. >>>> >>>> I'd appreciate if you can share the status if you are using customized >>>> RestartStrategy. That will be valuable for use to make decisions. >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>> |
Thanks Steven for the feedback!
Could you share more information about the metrics you add in you customized restart strategy? Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: > We do use config like "restart-strategy: > org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional > metrics than the Flink provided ones. > > On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: > >> Thanks everyone for the input. >> >> The RestartStrategy customization is not recognized as a public interface >> as it is not explicitly documented. >> As it is not used from the feedbacks of this survey, I'll conclude that >> we do not need to support customized RestartStrategy for the new scheduler >> in Flink 1.10 >> >> Other usages are still supported, including all the strategies and >> configuring ways described in >> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >> . >> >> Feel free to share in this thread if you has any concern for it. >> >> Thanks, >> Zhu Zhu >> >> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >> >>> Thanks Oytun for the reply! >>> >>> Sorry for not have stated it clearly. When saying "customized >>> RestartStrategy", we mean that users implement an >>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >>> themselves and use it by configuring like "restart-strategy: >>> org.foobar.MyRestartStrategyFactoryFactory". >>> >>> The usage of restart strategies you mentioned will keep working with the >>> new scheduler. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>> >>>> Hi Zhu, >>>> >>>> We are using custom restart strategy like this: >>>> >>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), >>>> Time.minutes(10))); >>>> >>>> >>>> --- >>>> Oytun Tez >>>> >>>> *M O T A W O R D* >>>> The World's Fastest Human Translation Platform. >>>> [hidden email] — www.motaword.com >>>> >>>> >>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I wanted to reach out to you and ask how many of you are using a >>>>> customized RestartStrategy[1] in production jobs. >>>>> >>>>> We are currently developing the new Flink scheduler[2] which interacts >>>>> with restart strategies in a different way. We have to re-design the >>>>> interfaces for the new restart strategies (so called >>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>> work any more with the new scheduler. >>>>> >>>>> We want to know whether we should keep the way >>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>> RestartStrategy can be migrated. >>>>> >>>>> I'd appreciate if you can share the status if you are using customized >>>>> RestartStrategy. That will be valuable for use to make decisions. >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>> |
Zhu Zhu,
Flink fullRestart metric is a Gauge, which is not good for alerting on. We publish an equivalent Counter metric for alerting purpose. Thanks, Steven On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: > Thanks Steven for the feedback! > Could you share more information about the metrics you add in you > customized restart strategy? > > Thanks, > Zhu Zhu > > Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: > >> We do use config like "restart-strategy: >> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >> metrics than the Flink provided ones. >> >> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >> >>> Thanks everyone for the input. >>> >>> The RestartStrategy customization is not recognized as a public >>> interface as it is not explicitly documented. >>> As it is not used from the feedbacks of this survey, I'll conclude that >>> we do not need to support customized RestartStrategy for the new scheduler >>> in Flink 1.10 >>> >>> Other usages are still supported, including all the strategies and >>> configuring ways described in >>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>> . >>> >>> Feel free to share in this thread if you has any concern for it. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>> >>>> Thanks Oytun for the reply! >>>> >>>> Sorry for not have stated it clearly. When saying "customized >>>> RestartStrategy", we mean that users implement an >>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >>>> themselves and use it by configuring like "restart-strategy: >>>> org.foobar.MyRestartStrategyFactoryFactory". >>>> >>>> The usage of restart strategies you mentioned will keep working with >>>> the new scheduler. >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>> >>>>> Hi Zhu, >>>>> >>>>> We are using custom restart strategy like this: >>>>> >>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), >>>>> Time.minutes(10))); >>>>> >>>>> >>>>> --- >>>>> Oytun Tez >>>>> >>>>> *M O T A W O R D* >>>>> The World's Fastest Human Translation Platform. >>>>> [hidden email] — www.motaword.com >>>>> >>>>> >>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>> customized RestartStrategy[1] in production jobs. >>>>>> >>>>>> We are currently developing the new Flink scheduler[2] which >>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>> the interfaces for the new restart strategies (so called >>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>> work any more with the new scheduler. >>>>>> >>>>>> We want to know whether we should keep the way >>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>> RestartStrategy can be migrated. >>>>>> >>>>>> I'd appreciate if you can share the status if you are >>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>> decisions. >>>>>> >>>>>> [1] >>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>> |
Steven,
Thanks for the information. If we can determine this a common issue, we can solve it in Flink core. To get to that state, I have two questions which need your help: 1. Why is gauge not good for alerting? The metric "fullRestart" is a Gauge<Long>. Does the metric reporter you use report Counter and Gauge<Long> to external services in different ways? Or anything else can be different due to the metric type? 2. Is the "number of restarts" what you actually need, rather than the "fullRestart" count? If so, I believe we will have such a counter metric in 1.10, since the previous "fullRestart" metric value is not the number of restarts when grained recovery (feature added 1.9.0) is enabled. "fullRestart" reveals how many times entire job graph has been restarted. If grained recovery (feature added 1.9.0) is enabled, the graph would not be restarted when task failures happen and the "fullRestart" value will not increment in such cases. I'd appreciate if you can help with these questions and we can make better decisions for Flink. Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: > Zhu Zhu, > > Flink fullRestart metric is a Gauge, which is not good for alerting on. We > publish an equivalent Counter metric for alerting purpose. > > Thanks, > Steven > > On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: > >> Thanks Steven for the feedback! >> Could you share more information about the metrics you add in you >> customized restart strategy? >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >> >>> We do use config like "restart-strategy: >>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>> metrics than the Flink provided ones. >>> >>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>> >>>> Thanks everyone for the input. >>>> >>>> The RestartStrategy customization is not recognized as a public >>>> interface as it is not explicitly documented. >>>> As it is not used from the feedbacks of this survey, I'll conclude that >>>> we do not need to support customized RestartStrategy for the new scheduler >>>> in Flink 1.10 >>>> >>>> Other usages are still supported, including all the strategies and >>>> configuring ways described in >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>> . >>>> >>>> Feel free to share in this thread if you has any concern for it. >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>> >>>>> Thanks Oytun for the reply! >>>>> >>>>> Sorry for not have stated it clearly. When saying "customized >>>>> RestartStrategy", we mean that users implement an >>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >>>>> themselves and use it by configuring like "restart-strategy: >>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>> >>>>> The usage of restart strategies you mentioned will keep working with >>>>> the new scheduler. >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>> >>>>>> Hi Zhu, >>>>>> >>>>>> We are using custom restart strategy like this: >>>>>> >>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1), >>>>>> Time.minutes(10))); >>>>>> >>>>>> >>>>>> --- >>>>>> Oytun Tez >>>>>> >>>>>> *M O T A W O R D* >>>>>> The World's Fastest Human Translation Platform. >>>>>> [hidden email] — www.motaword.com >>>>>> >>>>>> >>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>> >>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>> the interfaces for the new restart strategies (so called >>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>> work any more with the new scheduler. >>>>>>> >>>>>>> We want to know whether we should keep the way >>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>> RestartStrategy can be migrated. >>>>>>> >>>>>>> I'd appreciate if you can share the status if you are >>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>> decisions. >>>>>>> >>>>>>> [1] >>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>> |
When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 after a first full restart. So alert condition will always be true after first job restart. If we can apply a derivative to the Gauge value, I guess alert can probably work. I can explore if that is an option or not. Yeah. Understood that "fullRestart" won't increment when fine grained recovery happened. I think "task_failures" counter already exists in Flink. On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: > Steven, > > Thanks for the information. If we can determine this a common issue, we > can solve it in Flink core. > To get to that state, I have two questions which need your help: > 1. Why is gauge not good for alerting? The metric "fullRestart" is a > Gauge<Long>. Does the metric reporter you use report Counter and > Gauge<Long> to external services in different ways? Or anything else can be > different due to the metric type? > 2. Is the "number of restarts" what you actually need, rather than > the "fullRestart" count? If so, I believe we will have such a counter > metric in 1.10, since the previous "fullRestart" metric value is not the > number of restarts when grained recovery (feature added 1.9.0) is enabled. > "fullRestart" reveals how many times entire job graph has been > restarted. If grained recovery (feature added 1.9.0) is enabled, the graph > would not be restarted when task failures happen and the "fullRestart" > value will not increment in such cases. > > I'd appreciate if you can help with these questions and we can make better > decisions for Flink. > > Thanks, > Zhu Zhu > > Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: > >> Zhu Zhu, >> >> Flink fullRestart metric is a Gauge, which is not good for alerting on. >> We publish an equivalent Counter metric for alerting purpose. >> >> Thanks, >> Steven >> >> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >> >>> Thanks Steven for the feedback! >>> Could you share more information about the metrics you add in you >>> customized restart strategy? >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>> >>>> We do use config like "restart-strategy: >>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>> metrics than the Flink provided ones. >>>> >>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>> >>>>> Thanks everyone for the input. >>>>> >>>>> The RestartStrategy customization is not recognized as a public >>>>> interface as it is not explicitly documented. >>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>> that we do not need to support customized RestartStrategy for the new >>>>> scheduler in Flink 1.10 >>>>> >>>>> Other usages are still supported, including all the strategies and >>>>> configuring ways described in >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>> . >>>>> >>>>> Feel free to share in this thread if you has any concern for it. >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>> >>>>>> Thanks Oytun for the reply! >>>>>> >>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>> RestartStrategy", we mean that users implement an >>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by >>>>>> themselves and use it by configuring like "restart-strategy: >>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>> >>>>>> The usage of restart strategies you mentioned will keep working with >>>>>> the new scheduler. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>> >>>>>>> Hi Zhu, >>>>>>> >>>>>>> We are using custom restart strategy like this: >>>>>>> >>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Oytun Tez >>>>>>> >>>>>>> *M O T A W O R D* >>>>>>> The World's Fastest Human Translation Platform. >>>>>>> [hidden email] — www.motaword.com >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>> >>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>> work any more with the new scheduler. >>>>>>>> >>>>>>>> We want to know whether we should keep the way >>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>> RestartStrategy can be migrated. >>>>>>>> >>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>> decisions. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>> |
Steven,
In my mind, Flink counter only stores its accumulated count and reports that value. Are you using an external counter directly? Maybe Flink Meter/MeterView is what you need? It stores the count and calculates the rate. And it will report its "count" as well as "rate" to external metric services. The counter "task_failures" only works if the individual failover strategy is enabled. However, it is not a public interface and is not suggested to use, as the fine grained recovery (region failover) now supersedes it. I've opened a ticket[1] to add a metric to show failovers that respects fine grained recovery. [1] https://issues.apache.org/jira/browse/FLINK-14164 Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道: > > When we setup alert like "fullRestarts > 1" for some rolling window, we > want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 > after a first full restart. So alert condition will always be true after > first job restart. If we can apply a derivative to the Gauge value, I guess > alert can probably work. I can explore if that is an option or not. > > Yeah. Understood that "fullRestart" won't increment when fine grained > recovery happened. I think "task_failures" counter already exists in Flink. > > > > On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: > >> Steven, >> >> Thanks for the information. If we can determine this a common issue, we >> can solve it in Flink core. >> To get to that state, I have two questions which need your help: >> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >> Gauge<Long>. Does the metric reporter you use report Counter and >> Gauge<Long> to external services in different ways? Or anything else can be >> different due to the metric type? >> 2. Is the "number of restarts" what you actually need, rather than >> the "fullRestart" count? If so, I believe we will have such a counter >> metric in 1.10, since the previous "fullRestart" metric value is not the >> number of restarts when grained recovery (feature added 1.9.0) is enabled. >> "fullRestart" reveals how many times entire job graph has been >> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >> would not be restarted when task failures happen and the "fullRestart" >> value will not increment in such cases. >> >> I'd appreciate if you can help with these questions and we can make >> better decisions for Flink. >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: >> >>> Zhu Zhu, >>> >>> Flink fullRestart metric is a Gauge, which is not good for alerting on. >>> We publish an equivalent Counter metric for alerting purpose. >>> >>> Thanks, >>> Steven >>> >>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >>> >>>> Thanks Steven for the feedback! >>>> Could you share more information about the metrics you add in you >>>> customized restart strategy? >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>>> >>>>> We do use config like "restart-strategy: >>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>> metrics than the Flink provided ones. >>>>> >>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>>> >>>>>> Thanks everyone for the input. >>>>>> >>>>>> The RestartStrategy customization is not recognized as a public >>>>>> interface as it is not explicitly documented. >>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>> that we do not need to support customized RestartStrategy for the new >>>>>> scheduler in Flink 1.10 >>>>>> >>>>>> Other usages are still supported, including all the strategies and >>>>>> configuring ways described in >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>> . >>>>>> >>>>>> Feel free to share in this thread if you has any concern for it. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>>> >>>>>>> Thanks Oytun for the reply! >>>>>>> >>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>> RestartStrategy", we mean that users implement an >>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>> >>>>>>> The usage of restart strategies you mentioned will keep working with >>>>>>> the new scheduler. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>>> >>>>>>>> Hi Zhu, >>>>>>>> >>>>>>>> We are using custom restart strategy like this: >>>>>>>> >>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Oytun Tez >>>>>>>> >>>>>>>> *M O T A W O R D* >>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>> [hidden email] — www.motaword.com >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>> >>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>>> work any more with the new scheduler. >>>>>>>>> >>>>>>>>> We want to know whether we should keep the way >>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>> RestartStrategy can be migrated. >>>>>>>>> >>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>>> decisions. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>> |
Zhu Zhu,
Sorry, I was using different terminology. yes, Flink meter is what I was talking about regarding "fullRestarts" for threshold based alerting. On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote: > Steven, > > In my mind, Flink counter only stores its accumulated count and reports > that value. Are you using an external counter directly? > Maybe Flink Meter/MeterView is what you need? It stores the count and > calculates the rate. And it will report its "count" as well as "rate" to > external metric services. > > The counter "task_failures" only works if the individual failover strategy > is enabled. However, it is not a public interface and is not suggested to > use, as the fine grained recovery (region failover) now supersedes it. > I've opened a ticket[1] to add a metric to show failovers that respects > fine grained recovery. > > [1] https://issues.apache.org/jira/browse/FLINK-14164 > > Thanks, > Zhu Zhu > > Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道: > >> >> When we setup alert like "fullRestarts > 1" for some rolling window, we >> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >> after a first full restart. So alert condition will always be true after >> first job restart. If we can apply a derivative to the Gauge value, I guess >> alert can probably work. I can explore if that is an option or not. >> >> Yeah. Understood that "fullRestart" won't increment when fine grained >> recovery happened. I think "task_failures" counter already exists in Flink. >> >> >> >> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: >> >>> Steven, >>> >>> Thanks for the information. If we can determine this a common issue, we >>> can solve it in Flink core. >>> To get to that state, I have two questions which need your help: >>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>> Gauge<Long>. Does the metric reporter you use report Counter and >>> Gauge<Long> to external services in different ways? Or anything else can be >>> different due to the metric type? >>> 2. Is the "number of restarts" what you actually need, rather than >>> the "fullRestart" count? If so, I believe we will have such a counter >>> metric in 1.10, since the previous "fullRestart" metric value is not the >>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>> "fullRestart" reveals how many times entire job graph has been >>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>> would not be restarted when task failures happen and the "fullRestart" >>> value will not increment in such cases. >>> >>> I'd appreciate if you can help with these questions and we can make >>> better decisions for Flink. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: >>> >>>> Zhu Zhu, >>>> >>>> Flink fullRestart metric is a Gauge, which is not good for alerting on. >>>> We publish an equivalent Counter metric for alerting purpose. >>>> >>>> Thanks, >>>> Steven >>>> >>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >>>> >>>>> Thanks Steven for the feedback! >>>>> Could you share more information about the metrics you add in you >>>>> customized restart strategy? >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>>>> >>>>>> We do use config like "restart-strategy: >>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>> metrics than the Flink provided ones. >>>>>> >>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>>>> >>>>>>> Thanks everyone for the input. >>>>>>> >>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>> interface as it is not explicitly documented. >>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>> scheduler in Flink 1.10 >>>>>>> >>>>>>> Other usages are still supported, including all the strategies and >>>>>>> configuring ways described in >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>> . >>>>>>> >>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>>>> >>>>>>>> Thanks Oytun for the reply! >>>>>>>> >>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>> >>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>> with the new scheduler. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>>>> >>>>>>>>> Hi Zhu, >>>>>>>>> >>>>>>>>> We are using custom restart strategy like this: >>>>>>>>> >>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>> >>>>>>>>> >>>>>>>>> --- >>>>>>>>> Oytun Tez >>>>>>>>> >>>>>>>>> *M O T A W O R D* >>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>> [hidden email] — www.motaword.com >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>>> >>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>>>> work any more with the new scheduler. >>>>>>>>>> >>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>> >>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>>>> decisions. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>> |
Hi Steven,
As a conclusion, since we will have a meter metric[1] for restarts, customized restart strategy is not needed in your case. Is that right? [1] https://issues.apache.org/jira/browse/FLINK-14164 Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道: > Zhu Zhu, > > Sorry, I was using different terminology. yes, Flink meter is what I was > talking about regarding "fullRestarts" for threshold based alerting. > > On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote: > >> Steven, >> >> In my mind, Flink counter only stores its accumulated count and reports >> that value. Are you using an external counter directly? >> Maybe Flink Meter/MeterView is what you need? It stores the count and >> calculates the rate. And it will report its "count" as well as "rate" to >> external metric services. >> >> The counter "task_failures" only works if the individual failover >> strategy is enabled. However, it is not a public interface and is not >> suggested to use, as the fine grained recovery (region failover) now >> supersedes it. >> I've opened a ticket[1] to add a metric to show failovers that respects >> fine grained recovery. >> >> [1] https://issues.apache.org/jira/browse/FLINK-14164 >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道: >> >>> >>> When we setup alert like "fullRestarts > 1" for some rolling window, we >>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >>> after a first full restart. So alert condition will always be true after >>> first job restart. If we can apply a derivative to the Gauge value, I guess >>> alert can probably work. I can explore if that is an option or not. >>> >>> Yeah. Understood that "fullRestart" won't increment when fine grained >>> recovery happened. I think "task_failures" counter already exists in Flink. >>> >>> >>> >>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: >>> >>>> Steven, >>>> >>>> Thanks for the information. If we can determine this a common issue, we >>>> can solve it in Flink core. >>>> To get to that state, I have two questions which need your help: >>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>> Gauge<Long> to external services in different ways? Or anything else can be >>>> different due to the metric type? >>>> 2. Is the "number of restarts" what you actually need, rather than >>>> the "fullRestart" count? If so, I believe we will have such a counter >>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>>> "fullRestart" reveals how many times entire job graph has been >>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>>> would not be restarted when task failures happen and the "fullRestart" >>>> value will not increment in such cases. >>>> >>>> I'd appreciate if you can help with these questions and we can make >>>> better decisions for Flink. >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: >>>> >>>>> Zhu Zhu, >>>>> >>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >>>>> >>>>>> Thanks Steven for the feedback! >>>>>> Could you share more information about the metrics you add in you >>>>>> customized restart strategy? >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>>>>> >>>>>>> We do use config like "restart-strategy: >>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>> metrics than the Flink provided ones. >>>>>>> >>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>>>>> >>>>>>>> Thanks everyone for the input. >>>>>>>> >>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>> interface as it is not explicitly documented. >>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>>> scheduler in Flink 1.10 >>>>>>>> >>>>>>>> Other usages are still supported, including all the strategies and >>>>>>>> configuring ways described in >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>> . >>>>>>>> >>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>>>>> >>>>>>>>> Thanks Oytun for the reply! >>>>>>>>> >>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>> >>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>> with the new scheduler. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>>>>> >>>>>>>>>> Hi Zhu, >>>>>>>>>> >>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>> >>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Oytun Tez >>>>>>>>>> >>>>>>>>>> *M O T A W O R D* >>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>> [hidden email] — www.motaword.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>>>> >>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>> >>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>> >>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>>>>> decisions. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Zhu Zhu >>>>>>>>>>> >>>>>>>>>> |
Zhu Zhu, that is correct.
On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <[hidden email]> wrote: > Hi Steven, > > As a conclusion, since we will have a meter metric[1] for restarts, > customized restart strategy is not needed in your case. > Is that right? > > [1] https://issues.apache.org/jira/browse/FLINK-14164 > > Thanks, > Zhu Zhu > > Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道: > >> Zhu Zhu, >> >> Sorry, I was using different terminology. yes, Flink meter is what I was >> talking about regarding "fullRestarts" for threshold based alerting. >> >> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote: >> >>> Steven, >>> >>> In my mind, Flink counter only stores its accumulated count and reports >>> that value. Are you using an external counter directly? >>> Maybe Flink Meter/MeterView is what you need? It stores the count and >>> calculates the rate. And it will report its "count" as well as "rate" to >>> external metric services. >>> >>> The counter "task_failures" only works if the individual failover >>> strategy is enabled. However, it is not a public interface and is not >>> suggested to use, as the fine grained recovery (region failover) now >>> supersedes it. >>> I've opened a ticket[1] to add a metric to show failovers that respects >>> fine grained recovery. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-14164 >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道: >>> >>>> >>>> When we setup alert like "fullRestarts > 1" for some rolling window, we >>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >>>> after a first full restart. So alert condition will always be true after >>>> first job restart. If we can apply a derivative to the Gauge value, I guess >>>> alert can probably work. I can explore if that is an option or not. >>>> >>>> Yeah. Understood that "fullRestart" won't increment when fine grained >>>> recovery happened. I think "task_failures" counter already exists in Flink. >>>> >>>> >>>> >>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: >>>> >>>>> Steven, >>>>> >>>>> Thanks for the information. If we can determine this a common issue, >>>>> we can solve it in Flink core. >>>>> To get to that state, I have two questions which need your help: >>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>>> Gauge<Long> to external services in different ways? Or anything else can be >>>>> different due to the metric type? >>>>> 2. Is the "number of restarts" what you actually need, rather than >>>>> the "fullRestart" count? If so, I believe we will have such a counter >>>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>>>> "fullRestart" reveals how many times entire job graph has been >>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>>>> would not be restarted when task failures happen and the "fullRestart" >>>>> value will not increment in such cases. >>>>> >>>>> I'd appreciate if you can help with these questions and we can make >>>>> better decisions for Flink. >>>>> >>>>> Thanks, >>>>> Zhu Zhu >>>>> >>>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: >>>>> >>>>>> Zhu Zhu, >>>>>> >>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >>>>>> >>>>>>> Thanks Steven for the feedback! >>>>>>> Could you share more information about the metrics you add in you >>>>>>> customized restart strategy? >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>>>>>> >>>>>>>> We do use config like "restart-strategy: >>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>>> metrics than the Flink provided ones. >>>>>>>> >>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>>>>>> >>>>>>>>> Thanks everyone for the input. >>>>>>>>> >>>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>>> interface as it is not explicitly documented. >>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>>>> scheduler in Flink 1.10 >>>>>>>>> >>>>>>>>> Other usages are still supported, including all the strategies and >>>>>>>>> configuring ways described in >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>>> . >>>>>>>>> >>>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>>>>>> >>>>>>>>>> Thanks Oytun for the reply! >>>>>>>>>> >>>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>>> >>>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>>> with the new scheduler. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>>>>>> >>>>>>>>>>> Hi Zhu, >>>>>>>>>>> >>>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>>> >>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> Oytun Tez >>>>>>>>>>> >>>>>>>>>>> *M O T A W O R D* >>>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>>> [hidden email] — www.motaword.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi everyone, >>>>>>>>>>>> >>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using >>>>>>>>>>>> a customized RestartStrategy[1] in production jobs. >>>>>>>>>>>> >>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>>> >>>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>>> >>>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>>>>>> decisions. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Zhu Zhu >>>>>>>>>>>> >>>>>>>>>>> |
We will then keep the decision that we do not support customized restart
strategy in Flink 1.10. Thanks Steven for the inputs! Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月26日周四 上午12:13写道: > Zhu Zhu, that is correct. > > On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <[hidden email]> wrote: > >> Hi Steven, >> >> As a conclusion, since we will have a meter metric[1] for restarts, >> customized restart strategy is not needed in your case. >> Is that right? >> >> [1] https://issues.apache.org/jira/browse/FLINK-14164 >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道: >> >>> Zhu Zhu, >>> >>> Sorry, I was using different terminology. yes, Flink meter is what I was >>> talking about regarding "fullRestarts" for threshold based alerting. >>> >>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote: >>> >>>> Steven, >>>> >>>> In my mind, Flink counter only stores its accumulated count and reports >>>> that value. Are you using an external counter directly? >>>> Maybe Flink Meter/MeterView is what you need? It stores the count and >>>> calculates the rate. And it will report its "count" as well as "rate" to >>>> external metric services. >>>> >>>> The counter "task_failures" only works if the individual failover >>>> strategy is enabled. However, it is not a public interface and is not >>>> suggested to use, as the fine grained recovery (region failover) now >>>> supersedes it. >>>> I've opened a ticket[1] to add a metric to show failovers that respects >>>> fine grained recovery. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-14164 >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道: >>>> >>>>> >>>>> When we setup alert like "fullRestarts > 1" for some rolling window, >>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go >>>>> below 1 after a first full restart. So alert condition will always be true >>>>> after first job restart. If we can apply a derivative to the Gauge value, I >>>>> guess alert can probably work. I can explore if that is an option or not. >>>>> >>>>> Yeah. Understood that "fullRestart" won't increment when fine grained >>>>> recovery happened. I think "task_failures" counter already exists in Flink. >>>>> >>>>> >>>>> >>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote: >>>>> >>>>>> Steven, >>>>>> >>>>>> Thanks for the information. If we can determine this a common issue, >>>>>> we can solve it in Flink core. >>>>>> To get to that state, I have two questions which need your help: >>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>>>> Gauge<Long> to external services in different ways? Or anything else can be >>>>>> different due to the metric type? >>>>>> 2. Is the "number of restarts" what you actually need, rather than >>>>>> the "fullRestart" count? If so, I believe we will have such a counter >>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>>>>> "fullRestart" reveals how many times entire job graph has been >>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>>>>> would not be restarted when task failures happen and the "fullRestart" >>>>>> value will not increment in such cases. >>>>>> >>>>>> I'd appreciate if you can help with these questions and we can make >>>>>> better decisions for Flink. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道: >>>>>> >>>>>>> Zhu Zhu, >>>>>>> >>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>>>> >>>>>>> Thanks, >>>>>>> Steven >>>>>>> >>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote: >>>>>>> >>>>>>>> Thanks Steven for the feedback! >>>>>>>> Could you share more information about the metrics you add in you >>>>>>>> customized restart strategy? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道: >>>>>>>> >>>>>>>>> We do use config like "restart-strategy: >>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>>>> metrics than the Flink provided ones. >>>>>>>>> >>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote: >>>>>>>>> >>>>>>>>>> Thanks everyone for the input. >>>>>>>>>> >>>>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>>>> interface as it is not explicitly documented. >>>>>>>>>> As it is not used from the feedbacks of this survey, I'll >>>>>>>>>> conclude that we do not need to support customized RestartStrategy for the >>>>>>>>>> new scheduler in Flink 1.10 >>>>>>>>>> >>>>>>>>>> Other usages are still supported, including all the strategies >>>>>>>>>> and configuring ways described in >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道: >>>>>>>>>> >>>>>>>>>>> Thanks Oytun for the reply! >>>>>>>>>>> >>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>>>> >>>>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>>>> with the new scheduler. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Zhu Zhu >>>>>>>>>>> >>>>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道: >>>>>>>>>>> >>>>>>>>>>>> Hi Zhu, >>>>>>>>>>>> >>>>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>>>> >>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>> Oytun Tez >>>>>>>>>>>> >>>>>>>>>>>> *M O T A W O R D* >>>>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>>>> [hidden email] — www.motaword.com >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using >>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs. >>>>>>>>>>>>> >>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design >>>>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not >>>>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>>>> >>>>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>>>> >>>>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make >>>>>>>>>>>>> decisions. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Zhu Zhu >>>>>>>>>>>>> >>>>>>>>>>>> |
Free forum by Nabble | Edit this page |