[SURVEY] How many people are using customized RestartStrategy(s)

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Hi everyone,

I wanted to reach out to you and ask how many of you are using a customized
RestartStrategy[1] in production jobs.

We are currently developing the new Flink scheduler[2] which interacts
with restart strategies in a different way. We have to re-design the
interfaces for the new restart strategies (so called
RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
work any more with the new scheduler.

We want to know whether we should keep the way
to customized RestartBackoffTimeStrategy so that existing customized
RestartStrategy can be migrated.

I'd appreciate if you can share the status if you are using customized
RestartStrategy. That will be valuable for use to make decisions.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
[2] https://issues.apache.org/jira/browse/FLINK-10429

Thanks,
Zhu Zhu
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Oytun Tez
Hi Zhu,

We are using custom restart strategy like this:

environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
Time.minutes(10)));


---
Oytun Tez

*M O T A W O R D*
The World's Fastest Human Translation Platform.
[hidden email] — www.motaword.com


On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:

> Hi everyone,
>
> I wanted to reach out to you and ask how many of you are using a
> customized RestartStrategy[1] in production jobs.
>
> We are currently developing the new Flink scheduler[2] which interacts
> with restart strategies in a different way. We have to re-design the
> interfaces for the new restart strategies (so called
> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
> work any more with the new scheduler.
>
> We want to know whether we should keep the way
> to customized RestartBackoffTimeStrategy so that existing customized
> RestartStrategy can be migrated.
>
> I'd appreciate if you can share the status if you are using customized
> RestartStrategy. That will be valuable for use to make decisions.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
> [2] https://issues.apache.org/jira/browse/FLINK-10429
>
> Thanks,
> Zhu Zhu
>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Thanks Oytun for the reply!

Sorry for not have stated it clearly. When saying "customized
RestartStrategy", we mean that users implement an
*org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
themselves and use it by configuring like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory".

The usage of restart strategies you mentioned will keep working with the
new scheduler.

Thanks,
Zhu Zhu

Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
> Time.minutes(10)));
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> [hidden email] — www.motaword.com
>
>
> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask how many of you are using a
>> customized RestartStrategy[1] in production jobs.
>>
>> We are currently developing the new Flink scheduler[2] which interacts
>> with restart strategies in a different way. We have to re-design the
>> interfaces for the new restart strategies (so called
>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>> work any more with the new scheduler.
>>
>> We want to know whether we should keep the way
>> to customized RestartBackoffTimeStrategy so that existing customized
>> RestartStrategy can be migrated.
>>
>> I'd appreciate if you can share the status if you are using customized
>> RestartStrategy. That will be valuable for use to make decisions.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>
>> Thanks,
>> Zhu Zhu
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Thanks everyone for the input.

The RestartStrategy customization is not recognized as a public interface
as it is not explicitly documented.
As it is not used from the feedbacks of this survey, I'll conclude that we
do not need to support customized RestartStrategy for the new scheduler in
Flink 1.10

Other usages are still supported, including all the strategies and
configuring ways described in
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
.

Feel free to share in this thread if you has any concern for it.

Thanks,
Zhu Zhu

Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:

> Thanks Oytun for the reply!
>
> Sorry for not have stated it clearly. When saying "customized
> RestartStrategy", we mean that users implement an
> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
> themselves and use it by configuring like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory".
>
> The usage of restart strategies you mentioned will keep working with the
> new scheduler.
>
> Thanks,
> Zhu Zhu
>
> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>
>> Hi Zhu,
>>
>> We are using custom restart strategy like this:
>>
>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>> Time.minutes(10)));
>>
>>
>> ---
>> Oytun Tez
>>
>> *M O T A W O R D*
>> The World's Fastest Human Translation Platform.
>> [hidden email] — www.motaword.com
>>
>>
>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask how many of you are using a
>>> customized RestartStrategy[1] in production jobs.
>>>
>>> We are currently developing the new Flink scheduler[2] which interacts
>>> with restart strategies in a different way. We have to re-design the
>>> interfaces for the new restart strategies (so called
>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>> work any more with the new scheduler.
>>>
>>> We want to know whether we should keep the way
>>> to customized RestartBackoffTimeStrategy so that existing customized
>>> RestartStrategy can be migrated.
>>>
>>> I'd appreciate if you can share the status if you are using customized
>>> RestartStrategy. That will be valuable for use to make decisions.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu
We do use config like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
metrics than the Flink provided ones.

On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public interface
> as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude that we
> do not need to support customized RestartStrategy for the new scheduler in
> Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized
>> RestartStrategy", we mean that users implement an
>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>> themselves and use it by configuring like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory".
>>
>> The usage of restart strategies you mentioned will keep working with the
>> new scheduler.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>
>>> Hi Zhu,
>>>
>>> We are using custom restart strategy like this:
>>>
>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>> Time.minutes(10)));
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> [hidden email] — www.motaword.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask how many of you are using a
>>>> customized RestartStrategy[1] in production jobs.
>>>>
>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>> with restart strategies in a different way. We have to re-design the
>>>> interfaces for the new restart strategies (so called
>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>> work any more with the new scheduler.
>>>>
>>>> We want to know whether we should keep the way
>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>> RestartStrategy can be migrated.
>>>>
>>>> I'd appreciate if you can share the status if you are using customized
>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Thanks Steven for the feedback!
Could you share more information about the metrics you add in you
customized restart strategy?

Thanks,
Zhu Zhu

Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public interface
>> as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll conclude that
>> we do not need to support customized RestartStrategy for the new scheduler
>> in Flink 1.10
>>
>> Other usages are still supported, including all the strategies and
>> configuring ways described in
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>> .
>>
>> Feel free to share in this thread if you has any concern for it.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>
>>> Thanks Oytun for the reply!
>>>
>>> Sorry for not have stated it clearly. When saying "customized
>>> RestartStrategy", we mean that users implement an
>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>> themselves and use it by configuring like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>
>>> The usage of restart strategies you mentioned will keep working with the
>>> new scheduler.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>
>>>> Hi Zhu,
>>>>
>>>> We are using custom restart strategy like this:
>>>>
>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>> Time.minutes(10)));
>>>>
>>>>
>>>> ---
>>>> Oytun Tez
>>>>
>>>> *M O T A W O R D*
>>>> The World's Fastest Human Translation Platform.
>>>> [hidden email] — www.motaword.com
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>> customized RestartStrategy[1] in production jobs.
>>>>>
>>>>> We are currently developing the new Flink scheduler[2] which interacts
>>>>> with restart strategies in a different way. We have to re-design the
>>>>> interfaces for the new restart strategies (so called
>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>> work any more with the new scheduler.
>>>>>
>>>>> We want to know whether we should keep the way
>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>> RestartStrategy can be migrated.
>>>>>
>>>>> I'd appreciate if you can share the status if you are using customized
>>>>> RestartStrategy. That will be valuable for use to make decisions.
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu
Zhu Zhu,

Flink fullRestart metric is a Gauge, which is not good for alerting on. We
publish an equivalent Counter metric for alerting purpose.

Thanks,
Steven

On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:

> Thanks Steven for the feedback!
> Could you share more information about the metrics you add in you
> customized restart strategy?
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>
>> We do use config like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>> metrics than the Flink provided ones.
>>
>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>
>>> Thanks everyone for the input.
>>>
>>> The RestartStrategy customization is not recognized as a public
>>> interface as it is not explicitly documented.
>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>> we do not need to support customized RestartStrategy for the new scheduler
>>> in Flink 1.10
>>>
>>> Other usages are still supported, including all the strategies and
>>> configuring ways described in
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>> .
>>>
>>> Feel free to share in this thread if you has any concern for it.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>
>>>> Thanks Oytun for the reply!
>>>>
>>>> Sorry for not have stated it clearly. When saying "customized
>>>> RestartStrategy", we mean that users implement an
>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>> themselves and use it by configuring like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>
>>>> The usage of restart strategies you mentioned will keep working with
>>>> the new scheduler.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>
>>>>> Hi Zhu,
>>>>>
>>>>> We are using custom restart strategy like this:
>>>>>
>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>> Time.minutes(10)));
>>>>>
>>>>>
>>>>> ---
>>>>> Oytun Tez
>>>>>
>>>>> *M O T A W O R D*
>>>>> The World's Fastest Human Translation Platform.
>>>>> [hidden email] — www.motaword.com
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>
>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>> the interfaces for the new restart strategies (so called
>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>> work any more with the new scheduler.
>>>>>>
>>>>>> We want to know whether we should keep the way
>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>> RestartStrategy can be migrated.
>>>>>>
>>>>>> I'd appreciate if you can share the status if you are
>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>> decisions.
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Steven,

Thanks for the information. If we can determine this a common issue, we can
solve it in Flink core.
To get to that state, I have two questions which need your help:
1. Why is gauge not good for alerting? The metric "fullRestart" is a
Gauge<Long>. Does the metric reporter you use report Counter and
Gauge<Long> to external services in different ways? Or anything else can be
different due to the metric type?
2. Is the "number of restarts" what you actually need, rather than
the "fullRestart" count? If so, I believe we will have such a counter
metric in 1.10, since the previous "fullRestart" metric value is not the
number of restarts when grained recovery (feature added 1.9.0) is enabled.
    "fullRestart" reveals how many times entire job graph has been
restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
would not be restarted when task failures happen and the "fullRestart"
value will not increment in such cases.

I'd appreciate if you can help with these questions and we can make better
decisions for Flink.

Thanks,
Zhu Zhu

Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:

> Zhu Zhu,
>
> Flink fullRestart metric is a Gauge, which is not good for alerting on. We
> publish an equivalent Counter metric for alerting purpose.
>
> Thanks,
> Steven
>
> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>
>> Thanks Steven for the feedback!
>> Could you share more information about the metrics you add in you
>> customized restart strategy?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>
>>> We do use config like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>> metrics than the Flink provided ones.
>>>
>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>
>>>> Thanks everyone for the input.
>>>>
>>>> The RestartStrategy customization is not recognized as a public
>>>> interface as it is not explicitly documented.
>>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>>> we do not need to support customized RestartStrategy for the new scheduler
>>>> in Flink 1.10
>>>>
>>>> Other usages are still supported, including all the strategies and
>>>> configuring ways described in
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>> .
>>>>
>>>> Feel free to share in this thread if you has any concern for it.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>
>>>>> Thanks Oytun for the reply!
>>>>>
>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>> RestartStrategy", we mean that users implement an
>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>> themselves and use it by configuring like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>
>>>>> The usage of restart strategies you mentioned will keep working with
>>>>> the new scheduler.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>
>>>>>> Hi Zhu,
>>>>>>
>>>>>> We are using custom restart strategy like this:
>>>>>>
>>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>>> Time.minutes(10)));
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Oytun Tez
>>>>>>
>>>>>> *M O T A W O R D*
>>>>>> The World's Fastest Human Translation Platform.
>>>>>> [hidden email] — www.motaword.com
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>
>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>> work any more with the new scheduler.
>>>>>>>
>>>>>>> We want to know whether we should keep the way
>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>> RestartStrategy can be migrated.
>>>>>>>
>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>> decisions.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu
When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
after a first full restart. So alert condition will always be true after
first job restart. If we can apply a derivative to the Gauge value, I guess
alert can probably work. I can explore if that is an option or not.

Yeah. Understood that "fullRestart" won't increment when fine grained
recovery happened. I think "task_failures" counter already exists in Flink.



On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue, we
> can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge<Long>. Does the metric reporter you use report Counter and
> Gauge<Long> to external services in different ways? Or anything else can be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>     "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make better
> decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>> We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>
>>>> We do use config like "restart-strategy:
>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>> metrics than the Flink provided ones.
>>>>
>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>
>>>>> Thanks everyone for the input.
>>>>>
>>>>> The RestartStrategy customization is not recognized as a public
>>>>> interface as it is not explicitly documented.
>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>> that we do not need to support customized RestartStrategy for the new
>>>>> scheduler in Flink 1.10
>>>>>
>>>>> Other usages are still supported, including all the strategies and
>>>>> configuring ways described in
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>> .
>>>>>
>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>
>>>>>> Thanks Oytun for the reply!
>>>>>>
>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>> RestartStrategy", we mean that users implement an
>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>>> themselves and use it by configuring like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>
>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>> the new scheduler.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>
>>>>>>> Hi Zhu,
>>>>>>>
>>>>>>> We are using custom restart strategy like this:
>>>>>>>
>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Oytun Tez
>>>>>>>
>>>>>>> *M O T A W O R D*
>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>> [hidden email] — www.motaword.com
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>
>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>> work any more with the new scheduler.
>>>>>>>>
>>>>>>>> We want to know whether we should keep the way
>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>
>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>> decisions.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Steven,

In my mind, Flink counter only stores its accumulated count and reports
that value. Are you using an external counter directly?
Maybe Flink Meter/MeterView is what you need? It stores the count and
calculates the rate. And it will report its "count" as well as "rate" to
external metric services.

The counter "task_failures" only works if the individual failover strategy
is enabled. However, it is not a public interface and is not suggested to
use, as the fine grained recovery (region failover) now supersedes it.
I've opened a ticket[1] to add a metric to show failovers that respects
fine grained recovery.

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window, we
> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
> after a first full restart. So alert condition will always be true after
> first job restart. If we can apply a derivative to the Gauge value, I guess
> alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue, we
>> can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge<Long>. Does the metric reporter you use report Counter and
>> Gauge<Long> to external services in different ways? Or anything else can be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>     "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>> We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>>
>>>> Thanks Steven for the feedback!
>>>> Could you share more information about the metrics you add in you
>>>> customized restart strategy?
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>>
>>>>> We do use config like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>> metrics than the Flink provided ones.
>>>>>
>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>>
>>>>>> Thanks everyone for the input.
>>>>>>
>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>> interface as it is not explicitly documented.
>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>> scheduler in Flink 1.10
>>>>>>
>>>>>> Other usages are still supported, including all the strategies and
>>>>>> configuring ways described in
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>> .
>>>>>>
>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>>
>>>>>>> Thanks Oytun for the reply!
>>>>>>>
>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>
>>>>>>> The usage of restart strategies you mentioned will keep working with
>>>>>>> the new scheduler.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>>
>>>>>>>> Hi Zhu,
>>>>>>>>
>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>
>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Oytun Tez
>>>>>>>>
>>>>>>>> *M O T A W O R D*
>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>> [hidden email] — www.motaword.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>
>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>
>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>
>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>> decisions.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu
Zhu Zhu,

Sorry, I was using different terminology. yes, Flink meter is what I was
talking about regarding "fullRestarts" for threshold based alerting.

On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote:

> Steven,
>
> In my mind, Flink counter only stores its accumulated count and reports
> that value. Are you using an external counter directly?
> Maybe Flink Meter/MeterView is what you need? It stores the count and
> calculates the rate. And it will report its "count" as well as "rate" to
> external metric services.
>
> The counter "task_failures" only works if the individual failover strategy
> is enabled. However, it is not a public interface and is not suggested to
> use, as the fine grained recovery (region failover) now supersedes it.
> I've opened a ticket[1] to add a metric to show failovers that respects
> fine grained recovery.
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道:
>
>>
>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>> after a first full restart. So alert condition will always be true after
>> first job restart. If we can apply a derivative to the Gauge value, I guess
>> alert can probably work. I can explore if that is an option or not.
>>
>> Yeah. Understood that "fullRestart" won't increment when fine grained
>> recovery happened. I think "task_failures" counter already exists in Flink.
>>
>>
>>
>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:
>>
>>> Steven,
>>>
>>> Thanks for the information. If we can determine this a common issue, we
>>> can solve it in Flink core.
>>> To get to that state, I have two questions which need your help:
>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>> Gauge<Long> to external services in different ways? Or anything else can be
>>> different due to the metric type?
>>> 2. Is the "number of restarts" what you actually need, rather than
>>> the "fullRestart" count? If so, I believe we will have such a counter
>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>     "fullRestart" reveals how many times entire job graph has been
>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>> would not be restarted when task failures happen and the "fullRestart"
>>> value will not increment in such cases.
>>>
>>> I'd appreciate if you can help with these questions and we can make
>>> better decisions for Flink.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>>>
>>>> Zhu Zhu,
>>>>
>>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>>> We publish an equivalent Counter metric for alerting purpose.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>>>
>>>>> Thanks Steven for the feedback!
>>>>> Could you share more information about the metrics you add in you
>>>>> customized restart strategy?
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>>>
>>>>>> We do use config like "restart-strategy:
>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>> metrics than the Flink provided ones.
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>
>>>>>>> Thanks everyone for the input.
>>>>>>>
>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>> interface as it is not explicitly documented.
>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>> scheduler in Flink 1.10
>>>>>>>
>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>> configuring ways described in
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>> .
>>>>>>>
>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>>>
>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>
>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>
>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>> with the new scheduler.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>
>>>>>>>>> Hi Zhu,
>>>>>>>>>
>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>
>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Oytun Tez
>>>>>>>>>
>>>>>>>>> *M O T A W O R D*
>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>> [hidden email] — www.motaword.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>
>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>
>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>> decisions.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
Hi Steven,

As a conclusion, since we will have a meter metric[1] for restarts,
customized restart strategy is not needed in your case.
Is that right?

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道:

> Zhu Zhu,
>
> Sorry, I was using different terminology. yes, Flink meter is what I was
> talking about regarding "fullRestarts" for threshold based alerting.
>
> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote:
>
>> Steven,
>>
>> In my mind, Flink counter only stores its accumulated count and reports
>> that value. Are you using an external counter directly?
>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>> calculates the rate. And it will report its "count" as well as "rate" to
>> external metric services.
>>
>> The counter "task_failures" only works if the individual failover
>> strategy is enabled. However, it is not a public interface and is not
>> suggested to use, as the fine grained recovery (region failover) now
>> supersedes it.
>> I've opened a ticket[1] to add a metric to show failovers that respects
>> fine grained recovery.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道:
>>
>>>
>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>> after a first full restart. So alert condition will always be true after
>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>> alert can probably work. I can explore if that is an option or not.
>>>
>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>
>>>
>>>
>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:
>>>
>>>> Steven,
>>>>
>>>> Thanks for the information. If we can determine this a common issue, we
>>>> can solve it in Flink core.
>>>> To get to that state, I have two questions which need your help:
>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>> different due to the metric type?
>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>     "fullRestart" reveals how many times entire job graph has been
>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>> would not be restarted when task failures happen and the "fullRestart"
>>>> value will not increment in such cases.
>>>>
>>>> I'd appreciate if you can help with these questions and we can make
>>>> better decisions for Flink.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>>>>
>>>>> Zhu Zhu,
>>>>>
>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>>>>
>>>>>> Thanks Steven for the feedback!
>>>>>> Could you share more information about the metrics you add in you
>>>>>> customized restart strategy?
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>>>>
>>>>>>> We do use config like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>> metrics than the Flink provided ones.
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Thanks everyone for the input.
>>>>>>>>
>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>> interface as it is not explicitly documented.
>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>> scheduler in Flink 1.10
>>>>>>>>
>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>> configuring ways described in
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> .
>>>>>>>>
>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>
>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>
>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>
>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>> with the new scheduler.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>
>>>>>>>>>> Hi Zhu,
>>>>>>>>>>
>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>
>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Oytun Tez
>>>>>>>>>>
>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>> [hidden email] — www.motaword.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>
>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>> decisions.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Steven Wu
Zhu Zhu, that is correct.

On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <[hidden email]> wrote:

> Hi Steven,
>
> As a conclusion, since we will have a meter metric[1] for restarts,
> customized restart strategy is not needed in your case.
> Is that right?
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道:
>
>> Zhu Zhu,
>>
>> Sorry, I was using different terminology. yes, Flink meter is what I was
>> talking about regarding "fullRestarts" for threshold based alerting.
>>
>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote:
>>
>>> Steven,
>>>
>>> In my mind, Flink counter only stores its accumulated count and reports
>>> that value. Are you using an external counter directly?
>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>> calculates the rate. And it will report its "count" as well as "rate" to
>>> external metric services.
>>>
>>> The counter "task_failures" only works if the individual failover
>>> strategy is enabled. However, it is not a public interface and is not
>>> suggested to use, as the fine grained recovery (region failover) now
>>> supersedes it.
>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>> fine grained recovery.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道:
>>>
>>>>
>>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>>> after a first full restart. So alert condition will always be true after
>>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>>> alert can probably work. I can explore if that is an option or not.
>>>>
>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>
>>>>
>>>>
>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:
>>>>
>>>>> Steven,
>>>>>
>>>>> Thanks for the information. If we can determine this a common issue,
>>>>> we can solve it in Flink core.
>>>>> To get to that state, I have two questions which need your help:
>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>> different due to the metric type?
>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>> value will not increment in such cases.
>>>>>
>>>>> I'd appreciate if you can help with these questions and we can make
>>>>> better decisions for Flink.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>>>>>
>>>>>> Zhu Zhu,
>>>>>>
>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>>>>>
>>>>>>> Thanks Steven for the feedback!
>>>>>>> Could you share more information about the metrics you add in you
>>>>>>> customized restart strategy?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>>>>>
>>>>>>>> We do use config like "restart-strategy:
>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>
>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>
>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>>> scheduler in Flink 1.10
>>>>>>>>>
>>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>>> configuring ways described in
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>
>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>
>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>
>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>> with the new scheduler.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>
>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>
>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>
>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>
>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>> [hidden email] — www.motaword.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>> decisions.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>
>>>>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu
We will then keep the decision that we do not support customized restart
strategy in Flink 1.10.

Thanks Steven for the inputs!

Thanks,
Zhu Zhu

Steven Wu <[hidden email]> 于2019年9月26日周四 上午12:13写道:

> Zhu Zhu, that is correct.
>
> On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <[hidden email]> wrote:
>
>> Hi Steven,
>>
>> As a conclusion, since we will have a meter metric[1] for restarts,
>> customized restart strategy is not needed in your case.
>> Is that right?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <[hidden email]> 于2019年9月25日周三 上午2:30写道:
>>
>>> Zhu Zhu,
>>>
>>> Sorry, I was using different terminology. yes, Flink meter is what I was
>>> talking about regarding "fullRestarts" for threshold based alerting.
>>>
>>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <[hidden email]> wrote:
>>>
>>>> Steven,
>>>>
>>>> In my mind, Flink counter only stores its accumulated count and reports
>>>> that value. Are you using an external counter directly?
>>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>>> calculates the rate. And it will report its "count" as well as "rate" to
>>>> external metric services.
>>>>
>>>> The counter "task_failures" only works if the individual failover
>>>> strategy is enabled. However, it is not a public interface and is not
>>>> suggested to use, as the fine grained recovery (region failover) now
>>>> supersedes it.
>>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>>> fine grained recovery.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <[hidden email]> 于2019年9月24日周二 上午6:41写道:
>>>>
>>>>>
>>>>> When we setup alert like "fullRestarts > 1" for some rolling window,
>>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go
>>>>> below 1 after a first full restart. So alert condition will always be true
>>>>> after first job restart. If we can apply a derivative to the Gauge value, I
>>>>> guess alert can probably work. I can explore if that is an option or not.
>>>>>
>>>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <[hidden email]> wrote:
>>>>>
>>>>>> Steven,
>>>>>>
>>>>>> Thanks for the information. If we can determine this a common issue,
>>>>>> we can solve it in Flink core.
>>>>>> To get to that state, I have two questions which need your help:
>>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>>>> different due to the metric type?
>>>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>>>     "fullRestart" reveals how many times entire job graph has been
>>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>>>> would not be restarted when task failures happen and the "fullRestart"
>>>>>> value will not increment in such cases.
>>>>>>
>>>>>> I'd appreciate if you can help with these questions and we can make
>>>>>> better decisions for Flink.
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <[hidden email]> 于2019年9月22日周日 上午3:31写道:
>>>>>>
>>>>>>> Zhu Zhu,
>>>>>>>
>>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Thanks Steven for the feedback!
>>>>>>>> Could you share more information about the metrics you add in you
>>>>>>>> customized restart strategy?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Steven Wu <[hidden email]> 于2019年9月20日周五 上午7:11写道:
>>>>>>>>
>>>>>>>>> We do use config like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>>>> metrics than the Flink provided ones.
>>>>>>>>>
>>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks everyone for the input.
>>>>>>>>>>
>>>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>>>> interface as it is not explicitly documented.
>>>>>>>>>> As it is not used from the feedbacks of this survey, I'll
>>>>>>>>>> conclude that we do not need to support customized RestartStrategy for the
>>>>>>>>>> new scheduler in Flink 1.10
>>>>>>>>>>
>>>>>>>>>> Other usages are still supported, including all the strategies
>>>>>>>>>> and configuring ways described in
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu Zhu
>>>>>>>>>>
>>>>>>>>>> Zhu Zhu <[hidden email]> 于2019年9月12日周四 下午10:33写道:
>>>>>>>>>>
>>>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>>>
>>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>>>
>>>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>>>> with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>> Oytun Tez <[hidden email]> 于2019年9月12日周四 下午10:05写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Zhu,
>>>>>>>>>>>>
>>>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>>>
>>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Oytun Tez
>>>>>>>>>>>>
>>>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>>>> [hidden email] — www.motaword.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <[hidden email]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using
>>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>>>> interacts with restart strategies in a different way. We have to re-design
>>>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>>>>>>>> decisions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>>>
>>>>>>>>>>>>