Hi everyone,
I wanted to reach out to you and ask whether decreasing the default delay to `0 s` for the fixed delay restart strategy [1] is causing trouble. A user reported that he would like to increase the default value because it can cause restart storms in case of systematic faults [2]. The downside of increasing the default delay would be a slightly increased restart time if this config option is not explicitly set. [1] https://issues.apache.org/jira/browse/FLINK-9158 [2] https://issues.apache.org/jira/browse/FLINK-11218 Cheers, Till |
Hi,
I thinks it's better to increase the default value. +1 Best. ------------------ 原始邮件 ------------------ 发件人: "Till Rohrmann"<[hidden email]>; 发送时间: 2019年8月30日(星期五) 晚上10:07 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; 主题: [SURVEY] Is the default restart delay of 0s causing problems? Hi everyone, I wanted to reach out to you and ask whether decreasing the default delay to `0 s` for the fixed delay restart strategy [1] is causing trouble. A user reported that he would like to increase the default value because it can cause restart storms in case of systematic faults [2]. The downside of increasing the default delay would be a slightly increased restart time if this config option is not explicitly set. [1] https://issues.apache.org/jira/browse/FLINK-9158 [2] https://issues.apache.org/jira/browse/FLINK-11218 Cheers, Till |
In our production, we usually override the restart delay to be 10 s.
We once encountered cases that external services are overwhelmed by reconnections from frequent restarted tasks. As a safer though not optimized option, a default delay larger than 0 s is better in my opinion. 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: > Hi, > > > I thinks it's better to increase the default value. +1 > > > Best. > > > > > ------------------ 原始邮件 ------------------ > 发件人: "Till Rohrmann"<[hidden email]>; > 发送时间: 2019年8月30日(星期五) 晚上10:07 > 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; > 主题: [SURVEY] Is the default restart delay of 0s causing problems? > > > > Hi everyone, > > I wanted to reach out to you and ask whether decreasing the default delay > to `0 s` for the fixed delay restart strategy [1] is causing trouble. A > user reported that he would like to increase the default value because it > can cause restart storms in case of systematic faults [2]. > > The downside of increasing the default delay would be a slightly increased > restart time if this config option is not explicitly set. > > [1] https://issues.apache.org/jira/browse/FLINK-9158 > [2] https://issues.apache.org/jira/browse/FLINK-11218 > > Cheers, > Till |
+1 on what Zhu Zhu said.
We also override the default to 10 s. On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: > In our production, we usually override the restart delay to be 10 s. > We once encountered cases that external services are overwhelmed by > reconnections from frequent restarted tasks. > As a safer though not optimized option, a default delay larger than 0 s is > better in my opinion. > > > 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: > >> Hi, >> >> >> I thinks it's better to increase the default value. +1 >> >> >> Best. >> >> >> >> >> ------------------ 原始邮件 ------------------ >> 发件人: "Till Rohrmann"<[hidden email]>; >> 发送时间: 2019年8月30日(星期五) 晚上10:07 >> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >> >> >> >> Hi everyone, >> >> I wanted to reach out to you and ask whether decreasing the default delay >> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A >> user reported that he would like to increase the default value because it >> can cause restart storms in case of systematic faults [2]. >> >> The downside of increasing the default delay would be a slightly increased >> restart time if this config option is not explicitly set. >> >> [1] https://issues.apache.org/jira/browse/FLINK-9158 >> [2] https://issues.apache.org/jira/browse/FLINK-11218 >> >> Cheers, >> Till > > |
-1 on increasing the default delay to none zero, with below reasons:
a) I could see some concerns about setting the delay to zero in the very original JIRA (FLINK-2993 <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still decided to make the change, so I'm wondering whether the decision also came from any customer requirement? If so, how could we judge whether one requirement override the other? b) There could be valid reasons for both default values depending on different use cases, as well as relative work around (like based on latest policy, setting the config manually to 10s could resolve the problem mentioned), and from former replies to this thread we could see users have already taken actions. Changing it back to non-zero again won't affect such users but might cause surprises to those depending on 0 as default. Last but not least, no matter what decision we make this time, I'd suggest to make it final and document in our release note explicitly. Checking the 1.5.0 release note [1] [2] it seems we didn't mention about the change on default restart delay and we'd better learn from it this time. Thanks. [1] https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes [2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html Best Regards, Yu On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: > +1 on what Zhu Zhu said. > > We also override the default to 10 s. > > On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: > >> In our production, we usually override the restart delay to be 10 s. >> We once encountered cases that external services are overwhelmed by >> reconnections from frequent restarted tasks. >> As a safer though not optimized option, a default delay larger than 0 s >> is better in my opinion. >> >> >> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >> >>> Hi, >>> >>> >>> I thinks it's better to increase the default value. +1 >>> >>> >>> Best. >>> >>> >>> >>> >>> ------------------ 原始邮件 ------------------ >>> 发件人: "Till Rohrmann"<[hidden email]>; >>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>> >>> >>> >>> Hi everyone, >>> >>> I wanted to reach out to you and ask whether decreasing the default delay >>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A >>> user reported that he would like to increase the default value because it >>> can cause restart storms in case of systematic faults [2]. >>> >>> The downside of increasing the default delay would be a slightly >>> increased >>> restart time if this config option is not explicitly set. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>> >>> Cheers, >>> Till >> >> |
Thanks a lot for all your feedback. I see there is a slight tendency
towards having a non zero default delay so far. However, Yu has brought up some valid points. Maybe I can shed some light on a). Before FLINK-9158 we set the default delay to 10s because Flink did not support queued scheduling which meant that if one slot was missing/still being occupied, then Flink would fail right away with a NoResourceAvailableException. In order to prevent this we added the delay. This also covered the case when the job was failing because of an overloaded external system. When we finished FLIP-6, we thought that we could improve the user experience by decreasing the default delay to 0s because all Flink related problems (slot still occupied, slot missing because of reconnecting TM) could be handled by the default slot request time out which allowed the slots to become ready after the scheduling was kicked off. However, we did not properly take the case of overloaded external systems into account. For b) I agree that any default value should be properly documented. This was clearly an oversight when FLINK-9158 has been merged. Moreover, I believe that there won't be the solve it all default value. There are always cases where one needs to adapt it to ones needs. But this is ok. The goal should be to find the default value which works for most cases. So maybe the middle ground between 10s and 0s could be a solution. Setting the default restart delay to 1s should prevent restart storms caused by overloaded external systems and still be fast enough to not slow down recoveries noticeably in most cases. If one needs a super fast recovery, then one should set the delay value to 0s. If one requires a longer delay because of a particular infrastructure, then one needs to change the value too. What do you think? Cheers, Till On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: > -1 on increasing the default delay to none zero, with below reasons: > > a) I could see some concerns about setting the delay to zero in the very > original JIRA (FLINK-2993 > <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in > FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still > decided to make the change, so I'm wondering whether the decision also came > from any customer requirement? If so, how could we judge whether one > requirement override the other? > > b) There could be valid reasons for both default values depending on > different use cases, as well as relative work around (like based on latest > policy, setting the config manually to 10s could resolve the problem > mentioned), and from former replies to this thread we could see users have > already taken actions. Changing it back to non-zero again won't affect such > users but might cause surprises to those depending on 0 as default. > > Last but not least, no matter what decision we make this time, I'd suggest > to make it final and document in our release note explicitly. Checking the > 1.5.0 release note [1] [2] it seems we didn't mention about the change on > default restart delay and we'd better learn from it this time. Thanks. > > [1] > https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html > > Best Regards, > Yu > > > On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: > >> +1 on what Zhu Zhu said. >> >> We also override the default to 10 s. >> >> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >> >>> In our production, we usually override the restart delay to be 10 s. >>> We once encountered cases that external services are overwhelmed by >>> reconnections from frequent restarted tasks. >>> As a safer though not optimized option, a default delay larger than 0 s >>> is better in my opinion. >>> >>> >>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>> >>>> Hi, >>>> >>>> >>>> I thinks it's better to increase the default value. +1 >>>> >>>> >>>> Best. >>>> >>>> >>>> >>>> >>>> ------------------ 原始邮件 ------------------ >>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>> >>>> >>>> >>>> Hi everyone, >>>> >>>> I wanted to reach out to you and ask whether decreasing the default >>>> delay >>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A >>>> user reported that he would like to increase the default value because >>>> it >>>> can cause restart storms in case of systematic faults [2]. >>>> >>>> The downside of increasing the default delay would be a slightly >>>> increased >>>> restart time if this config option is not explicitly set. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>> >>>> Cheers, >>>> Till >>> >>> |
1s sounds a good tradeoff to me.
On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <[hidden email]> wrote: > Thanks a lot for all your feedback. I see there is a slight tendency > towards having a non zero default delay so far. > > However, Yu has brought up some valid points. Maybe I can shed some light > on a). > > Before FLINK-9158 we set the default delay to 10s because Flink did not > support queued scheduling which meant that if one slot was missing/still > being occupied, then Flink would fail right away with > a NoResourceAvailableException. In order to prevent this we added the > delay. This also covered the case when the job was failing because of an > overloaded external system. > > When we finished FLIP-6, we thought that we could improve the user > experience by decreasing the default delay to 0s because all Flink related > problems (slot still occupied, slot missing because of reconnecting TM) > could be handled by the default slot request time out which allowed the > slots to become ready after the scheduling was kicked off. However, we did > not properly take the case of overloaded external systems into account. > > For b) I agree that any default value should be properly documented. This > was clearly an oversight when FLINK-9158 has been merged. Moreover, I > believe that there won't be the solve it all default value. There are > always cases where one needs to adapt it to ones needs. But this is ok. The > goal should be to find the default value which works for most cases. > > So maybe the middle ground between 10s and 0s could be a solution. Setting > the default restart delay to 1s should prevent restart storms caused by > overloaded external systems and still be fast enough to not slow down > recoveries noticeably in most cases. If one needs a super fast recovery, > then one should set the delay value to 0s. If one requires a longer delay > because of a particular infrastructure, then one needs to change the value > too. What do you think? > > Cheers, > Till > > On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: > >> -1 on increasing the default delay to none zero, with below reasons: >> >> a) I could see some concerns about setting the delay to zero in the very >> original JIRA (FLINK-2993 >> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still >> decided to make the change, so I'm wondering whether the decision also came >> from any customer requirement? If so, how could we judge whether one >> requirement override the other? >> >> b) There could be valid reasons for both default values depending on >> different use cases, as well as relative work around (like based on latest >> policy, setting the config manually to 10s could resolve the problem >> mentioned), and from former replies to this thread we could see users have >> already taken actions. Changing it back to non-zero again won't affect such >> users but might cause surprises to those depending on 0 as default. >> >> Last but not least, no matter what decision we make this time, I'd >> suggest to make it final and document in our release note explicitly. >> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >> the change on default restart delay and we'd better learn from it this >> time. Thanks. >> >> [1] >> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >> >> Best Regards, >> Yu >> >> >> On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: >> >>> +1 on what Zhu Zhu said. >>> >>> We also override the default to 10 s. >>> >>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >>> >>>> In our production, we usually override the restart delay to be 10 s. >>>> We once encountered cases that external services are overwhelmed by >>>> reconnections from frequent restarted tasks. >>>> As a safer though not optimized option, a default delay larger than 0 s >>>> is better in my opinion. >>>> >>>> >>>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>>> >>>>> Hi, >>>>> >>>>> >>>>> I thinks it's better to increase the default value. +1 >>>>> >>>>> >>>>> Best. >>>>> >>>>> >>>>> >>>>> >>>>> ------------------ 原始邮件 ------------------ >>>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>> >>>>> >>>>> >>>>> Hi everyone, >>>>> >>>>> I wanted to reach out to you and ask whether decreasing the default >>>>> delay >>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A >>>>> user reported that he would like to increase the default value because >>>>> it >>>>> can cause restart storms in case of systematic faults [2]. >>>>> >>>>> The downside of increasing the default delay would be a slightly >>>>> increased >>>>> restart time if this config option is not explicitly set. >>>>> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>> >>>>> Cheers, >>>>> Till >>>> >>>> |
1s looks good to me.
And I think the conclusion that when a user should override the delay is worth to be documented. Thanks, Zhu Zhu Steven Wu <[hidden email]> 于2019年9月3日周二 上午4:42写道: > 1s sounds a good tradeoff to me. > > On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <[hidden email]> wrote: > >> Thanks a lot for all your feedback. I see there is a slight tendency >> towards having a non zero default delay so far. >> >> However, Yu has brought up some valid points. Maybe I can shed some light >> on a). >> >> Before FLINK-9158 we set the default delay to 10s because Flink did not >> support queued scheduling which meant that if one slot was missing/still >> being occupied, then Flink would fail right away with >> a NoResourceAvailableException. In order to prevent this we added the >> delay. This also covered the case when the job was failing because of an >> overloaded external system. >> >> When we finished FLIP-6, we thought that we could improve the user >> experience by decreasing the default delay to 0s because all Flink related >> problems (slot still occupied, slot missing because of reconnecting TM) >> could be handled by the default slot request time out which allowed the >> slots to become ready after the scheduling was kicked off. However, we did >> not properly take the case of overloaded external systems into account. >> >> For b) I agree that any default value should be properly documented. This >> was clearly an oversight when FLINK-9158 has been merged. Moreover, I >> believe that there won't be the solve it all default value. There are >> always cases where one needs to adapt it to ones needs. But this is ok. The >> goal should be to find the default value which works for most cases. >> >> So maybe the middle ground between 10s and 0s could be a solution. >> Setting the default restart delay to 1s should prevent restart storms >> caused by overloaded external systems and still be fast enough to not slow >> down recoveries noticeably in most cases. If one needs a super fast >> recovery, then one should set the delay value to 0s. If one requires a >> longer delay because of a particular infrastructure, then one needs to >> change the value too. What do you think? >> >> Cheers, >> Till >> >> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: >> >>> -1 on increasing the default delay to none zero, with below reasons: >>> >>> a) I could see some concerns about setting the delay to zero in the very >>> original JIRA (FLINK-2993 >>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still >>> decided to make the change, so I'm wondering whether the decision also came >>> from any customer requirement? If so, how could we judge whether one >>> requirement override the other? >>> >>> b) There could be valid reasons for both default values depending on >>> different use cases, as well as relative work around (like based on latest >>> policy, setting the config manually to 10s could resolve the problem >>> mentioned), and from former replies to this thread we could see users have >>> already taken actions. Changing it back to non-zero again won't affect such >>> users but might cause surprises to those depending on 0 as default. >>> >>> Last but not least, no matter what decision we make this time, I'd >>> suggest to make it final and document in our release note explicitly. >>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>> the change on default restart delay and we'd better learn from it this >>> time. Thanks. >>> >>> [1] >>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>> >>> Best Regards, >>> Yu >>> >>> >>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: >>> >>>> +1 on what Zhu Zhu said. >>>> >>>> We also override the default to 10 s. >>>> >>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >>>> >>>>> In our production, we usually override the restart delay to be 10 s. >>>>> We once encountered cases that external services are overwhelmed by >>>>> reconnections from frequent restarted tasks. >>>>> As a safer though not optimized option, a default delay larger than 0 >>>>> s is better in my opinion. >>>>> >>>>> >>>>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I thinks it's better to increase the default value. +1 >>>>>> >>>>>> >>>>>> Best. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------ 原始邮件 ------------------ >>>>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>> >>>>>> >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I wanted to reach out to you and ask whether decreasing the default >>>>>> delay >>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. >>>>>> A >>>>>> user reported that he would like to increase the default value >>>>>> because it >>>>>> can cause restart storms in case of systematic faults [2]. >>>>>> >>>>>> The downside of increasing the default delay would be a slightly >>>>>> increased >>>>>> restart time if this config option is not explicitly set. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>> >>>>>> Cheers, >>>>>> Till >>>>> >>>>> |
Hi all,
just wanted to share my experience with configurations with you. For non-expert users configurations of Flink can be very daunting. The list of common properties is already helping a lot [1], but it's not clear how they depend on each other and settings common for specific use cases are not listed. If we can give somewhat clear recommendations for the start for the most common use cases (batch small/large cluster, streaming high throughput/low latency), I think users would be able start much more quickly with a somewhat well-configured system and fine-tune the settings later. For example, Kafka Streams has a section on how to set the parameters for maximum resilience [2]. I'd propose to leave the current configuration page as a reference page, but also have a recommended configuration settings page that's directly linked in the first section, such that new users are not overwhelmed. Sorry if this response is hijacking the discussion. Btw, is restart-strategy configuration missing in the main configuration page? Is this a conscious decision? [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options [2] https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <[hidden email]> wrote: > 1s looks good to me. > And I think the conclusion that when a user should override the delay is > worth to be documented. > > Thanks, > Zhu Zhu > > Steven Wu <[hidden email]> 于2019年9月3日周二 上午4:42写道: > >> 1s sounds a good tradeoff to me. >> >> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <[hidden email]> >> wrote: >> >>> Thanks a lot for all your feedback. I see there is a slight tendency >>> towards having a non zero default delay so far. >>> >>> However, Yu has brought up some valid points. Maybe I can shed some >>> light on a). >>> >>> Before FLINK-9158 we set the default delay to 10s because Flink did not >>> support queued scheduling which meant that if one slot was missing/still >>> being occupied, then Flink would fail right away with >>> a NoResourceAvailableException. In order to prevent this we added the >>> delay. This also covered the case when the job was failing because of an >>> overloaded external system. >>> >>> When we finished FLIP-6, we thought that we could improve the user >>> experience by decreasing the default delay to 0s because all Flink related >>> problems (slot still occupied, slot missing because of reconnecting TM) >>> could be handled by the default slot request time out which allowed the >>> slots to become ready after the scheduling was kicked off. However, we did >>> not properly take the case of overloaded external systems into account. >>> >>> For b) I agree that any default value should be properly documented. >>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I >>> believe that there won't be the solve it all default value. There are >>> always cases where one needs to adapt it to ones needs. But this is ok. The >>> goal should be to find the default value which works for most cases. >>> >>> So maybe the middle ground between 10s and 0s could be a solution. >>> Setting the default restart delay to 1s should prevent restart storms >>> caused by overloaded external systems and still be fast enough to not slow >>> down recoveries noticeably in most cases. If one needs a super fast >>> recovery, then one should set the delay value to 0s. If one requires a >>> longer delay because of a particular infrastructure, then one needs to >>> change the value too. What do you think? >>> >>> Cheers, >>> Till >>> >>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: >>> >>>> -1 on increasing the default delay to none zero, with below reasons: >>>> >>>> a) I could see some concerns about setting the delay to zero in the >>>> very original JIRA (FLINK-2993 >>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still >>>> decided to make the change, so I'm wondering whether the decision also came >>>> from any customer requirement? If so, how could we judge whether one >>>> requirement override the other? >>>> >>>> b) There could be valid reasons for both default values depending on >>>> different use cases, as well as relative work around (like based on latest >>>> policy, setting the config manually to 10s could resolve the problem >>>> mentioned), and from former replies to this thread we could see users have >>>> already taken actions. Changing it back to non-zero again won't affect such >>>> users but might cause surprises to those depending on 0 as default. >>>> >>>> Last but not least, no matter what decision we make this time, I'd >>>> suggest to make it final and document in our release note explicitly. >>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>>> the change on default restart delay and we'd better learn from it this >>>> time. Thanks. >>>> >>>> [1] >>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>>> [2] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>>> >>>> Best Regards, >>>> Yu >>>> >>>> >>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: >>>> >>>>> +1 on what Zhu Zhu said. >>>>> >>>>> We also override the default to 10 s. >>>>> >>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >>>>> >>>>>> In our production, we usually override the restart delay to be 10 s. >>>>>> We once encountered cases that external services are overwhelmed by >>>>>> reconnections from frequent restarted tasks. >>>>>> As a safer though not optimized option, a default delay larger than 0 >>>>>> s is better in my opinion. >>>>>> >>>>>> >>>>>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> I thinks it's better to increase the default value. +1 >>>>>>> >>>>>>> >>>>>>> Best. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------ 原始邮件 ------------------ >>>>>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I wanted to reach out to you and ask whether decreasing the default >>>>>>> delay >>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing >>>>>>> trouble. A >>>>>>> user reported that he would like to increase the default value >>>>>>> because it >>>>>>> can cause restart storms in case of systematic faults [2]. >>>>>>> >>>>>>> The downside of increasing the default delay would be a slightly >>>>>>> increased >>>>>>> restart time if this config option is not explicitly set. >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>> >>>>>> -- Arvid Heise | Senior Software Engineer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
Thanks everyone for the input again. I'll then conclude this survey thread
and start a discuss thread to set the default restart delay to 1s. @Arvid, I agree that a better documentation how to tune Flink with sane settings for certain scenarios is super helpful. However, as you've said it is somewhat hijacking the discussion and I would exclude it from my proposed changes. The best thing to do would be to start a separate discussion/effort for it. Concerning the restart strategy configuration options, they are currently only documented here [1]. I'm about to change it with this PR [2]. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html [2] https://github.com/apache/flink/pull/9562 Cheers, Till On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <[hidden email]> wrote: > Hi all, > > just wanted to share my experience with configurations with you. For > non-expert users configurations of Flink can be very daunting. The list of > common properties is already helping a lot [1], but it's not clear how they > depend on each other and settings common for specific use cases are not > listed. > > If we can give somewhat clear recommendations for the start for the most > common use cases (batch small/large cluster, streaming high throughput/low > latency), I think users would be able start much more quickly with a > somewhat well-configured system and fine-tune the settings later. For > example, Kafka Streams has a section on how to set the parameters for > maximum resilience [2]. > > I'd propose to leave the current configuration page as a reference page, > but also have a recommended configuration settings page that's directly > linked in the first section, such that new users are not overwhelmed. > > Sorry if this response is hijacking the discussion. > Btw, is restart-strategy configuration missing in the main configuration > page? Is this a conscious decision? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options > [2] > https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency > > On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <[hidden email]> wrote: > >> 1s looks good to me. >> And I think the conclusion that when a user should override the delay is >> worth to be documented. >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <[hidden email]> 于2019年9月3日周二 上午4:42写道: >> >>> 1s sounds a good tradeoff to me. >>> >>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <[hidden email]> >>> wrote: >>> >>>> Thanks a lot for all your feedback. I see there is a slight tendency >>>> towards having a non zero default delay so far. >>>> >>>> However, Yu has brought up some valid points. Maybe I can shed some >>>> light on a). >>>> >>>> Before FLINK-9158 we set the default delay to 10s because Flink did not >>>> support queued scheduling which meant that if one slot was missing/still >>>> being occupied, then Flink would fail right away with >>>> a NoResourceAvailableException. In order to prevent this we added the >>>> delay. This also covered the case when the job was failing because of an >>>> overloaded external system. >>>> >>>> When we finished FLIP-6, we thought that we could improve the user >>>> experience by decreasing the default delay to 0s because all Flink related >>>> problems (slot still occupied, slot missing because of reconnecting TM) >>>> could be handled by the default slot request time out which allowed the >>>> slots to become ready after the scheduling was kicked off. However, we did >>>> not properly take the case of overloaded external systems into account. >>>> >>>> For b) I agree that any default value should be properly documented. >>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I >>>> believe that there won't be the solve it all default value. There are >>>> always cases where one needs to adapt it to ones needs. But this is ok. The >>>> goal should be to find the default value which works for most cases. >>>> >>>> So maybe the middle ground between 10s and 0s could be a solution. >>>> Setting the default restart delay to 1s should prevent restart storms >>>> caused by overloaded external systems and still be fast enough to not slow >>>> down recoveries noticeably in most cases. If one needs a super fast >>>> recovery, then one should set the delay value to 0s. If one requires a >>>> longer delay because of a particular infrastructure, then one needs to >>>> change the value too. What do you think? >>>> >>>> Cheers, >>>> Till >>>> >>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: >>>> >>>>> -1 on increasing the default delay to none zero, with below reasons: >>>>> >>>>> a) I could see some concerns about setting the delay to zero in the >>>>> very original JIRA (FLINK-2993 >>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we >>>>> still decided to make the change, so I'm wondering whether the decision >>>>> also came from any customer requirement? If so, how could we judge whether >>>>> one requirement override the other? >>>>> >>>>> b) There could be valid reasons for both default values depending on >>>>> different use cases, as well as relative work around (like based on latest >>>>> policy, setting the config manually to 10s could resolve the problem >>>>> mentioned), and from former replies to this thread we could see users have >>>>> already taken actions. Changing it back to non-zero again won't affect such >>>>> users but might cause surprises to those depending on 0 as default. >>>>> >>>>> Last but not least, no matter what decision we make this time, I'd >>>>> suggest to make it final and document in our release note explicitly. >>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>>>> the change on default restart delay and we'd better learn from it this >>>>> time. Thanks. >>>>> >>>>> [1] >>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>>>> [2] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>>>> >>>>> Best Regards, >>>>> Yu >>>>> >>>>> >>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: >>>>> >>>>>> +1 on what Zhu Zhu said. >>>>>> >>>>>> We also override the default to 10 s. >>>>>> >>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >>>>>> >>>>>>> In our production, we usually override the restart delay to be 10 s. >>>>>>> We once encountered cases that external services are overwhelmed by >>>>>>> reconnections from frequent restarted tasks. >>>>>>> As a safer though not optimized option, a default delay larger than >>>>>>> 0 s is better in my opinion. >>>>>>> >>>>>>> >>>>>>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> I thinks it's better to increase the default value. +1 >>>>>>>> >>>>>>>> >>>>>>>> Best. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------ 原始邮件 ------------------ >>>>>>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I wanted to reach out to you and ask whether decreasing the default >>>>>>>> delay >>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing >>>>>>>> trouble. A >>>>>>>> user reported that he would like to increase the default value >>>>>>>> because it >>>>>>>> can cause restart storms in case of systematic faults [2]. >>>>>>>> >>>>>>>> The downside of increasing the default delay would be a slightly >>>>>>>> increased >>>>>>>> restart time if this config option is not explicitly set. >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Till >>>>>>> >>>>>>> > > -- > > Arvid Heise | Senior Software Engineer > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > |
The FLIP-62 discuss thread can be found here [1].
[1] https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E Cheers, Till On Tue, Sep 3, 2019 at 11:13 AM Till Rohrmann <[hidden email]> wrote: > Thanks everyone for the input again. I'll then conclude this survey thread > and start a discuss thread to set the default restart delay to 1s. > > @Arvid, I agree that a better documentation how to tune Flink with sane > settings for certain scenarios is super helpful. However, as you've said it > is somewhat hijacking the discussion and I would exclude it from my > proposed changes. The best thing to do would be to start a separate > discussion/effort for it. > > Concerning the restart strategy configuration options, they are currently > only documented here [1]. I'm about to change it with this PR [2]. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html > [2] https://github.com/apache/flink/pull/9562 > > Cheers, > Till > > On Tue, Sep 3, 2019 at 8:21 AM Arvid Heise <[hidden email]> > wrote: > >> Hi all, >> >> just wanted to share my experience with configurations with you. For >> non-expert users configurations of Flink can be very daunting. The list of >> common properties is already helping a lot [1], but it's not clear how they >> depend on each other and settings common for specific use cases are not >> listed. >> >> If we can give somewhat clear recommendations for the start for the most >> common use cases (batch small/large cluster, streaming high throughput/low >> latency), I think users would be able start much more quickly with a >> somewhat well-configured system and fine-tune the settings later. For >> example, Kafka Streams has a section on how to set the parameters for >> maximum resilience [2]. >> >> I'd propose to leave the current configuration page as a reference page, >> but also have a recommended configuration settings page that's directly >> linked in the first section, such that new users are not overwhelmed. >> >> Sorry if this response is hijacking the discussion. >> Btw, is restart-strategy configuration missing in the main configuration >> page? Is this a conscious decision? >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#common-options >> [2] >> https://docs.confluent.io/current/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency >> >> On Tue, Sep 3, 2019 at 5:10 AM Zhu Zhu <[hidden email]> wrote: >> >>> 1s looks good to me. >>> And I think the conclusion that when a user should override the delay is >>> worth to be documented. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Steven Wu <[hidden email]> 于2019年9月3日周二 上午4:42写道: >>> >>>> 1s sounds a good tradeoff to me. >>>> >>>> On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <[hidden email]> >>>> wrote: >>>> >>>>> Thanks a lot for all your feedback. I see there is a slight tendency >>>>> towards having a non zero default delay so far. >>>>> >>>>> However, Yu has brought up some valid points. Maybe I can shed some >>>>> light on a). >>>>> >>>>> Before FLINK-9158 we set the default delay to 10s because Flink did >>>>> not support queued scheduling which meant that if one slot was >>>>> missing/still being occupied, then Flink would fail right away with >>>>> a NoResourceAvailableException. In order to prevent this we added the >>>>> delay. This also covered the case when the job was failing because of an >>>>> overloaded external system. >>>>> >>>>> When we finished FLIP-6, we thought that we could improve the user >>>>> experience by decreasing the default delay to 0s because all Flink related >>>>> problems (slot still occupied, slot missing because of reconnecting TM) >>>>> could be handled by the default slot request time out which allowed the >>>>> slots to become ready after the scheduling was kicked off. However, we did >>>>> not properly take the case of overloaded external systems into account. >>>>> >>>>> For b) I agree that any default value should be properly documented. >>>>> This was clearly an oversight when FLINK-9158 has been merged. Moreover, I >>>>> believe that there won't be the solve it all default value. There are >>>>> always cases where one needs to adapt it to ones needs. But this is ok. The >>>>> goal should be to find the default value which works for most cases. >>>>> >>>>> So maybe the middle ground between 10s and 0s could be a solution. >>>>> Setting the default restart delay to 1s should prevent restart storms >>>>> caused by overloaded external systems and still be fast enough to not slow >>>>> down recoveries noticeably in most cases. If one needs a super fast >>>>> recovery, then one should set the delay value to 0s. If one requires a >>>>> longer delay because of a particular infrastructure, then one needs to >>>>> change the value too. What do you think? >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <[hidden email]> wrote: >>>>> >>>>>> -1 on increasing the default delay to none zero, with below reasons: >>>>>> >>>>>> a) I could see some concerns about setting the delay to zero in the >>>>>> very original JIRA (FLINK-2993 >>>>>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>>>>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we >>>>>> still decided to make the change, so I'm wondering whether the decision >>>>>> also came from any customer requirement? If so, how could we judge whether >>>>>> one requirement override the other? >>>>>> >>>>>> b) There could be valid reasons for both default values depending on >>>>>> different use cases, as well as relative work around (like based on latest >>>>>> policy, setting the config manually to 10s could resolve the problem >>>>>> mentioned), and from former replies to this thread we could see users have >>>>>> already taken actions. Changing it back to non-zero again won't affect such >>>>>> users but might cause surprises to those depending on 0 as default. >>>>>> >>>>>> Last but not least, no matter what decision we make this time, I'd >>>>>> suggest to make it final and document in our release note explicitly. >>>>>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>>>>> the change on default restart delay and we'd better learn from it this >>>>>> time. Thanks. >>>>>> >>>>>> [1] >>>>>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>>>>> [2] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>>>>> >>>>>> Best Regards, >>>>>> Yu >>>>>> >>>>>> >>>>>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <[hidden email]> wrote: >>>>>> >>>>>>> +1 on what Zhu Zhu said. >>>>>>> >>>>>>> We also override the default to 10 s. >>>>>>> >>>>>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <[hidden email]> wrote: >>>>>>> >>>>>>>> In our production, we usually override the restart delay to be 10 s. >>>>>>>> We once encountered cases that external services are overwhelmed by >>>>>>>> reconnections from frequent restarted tasks. >>>>>>>> As a safer though not optimized option, a default delay larger than >>>>>>>> 0 s is better in my opinion. >>>>>>>> >>>>>>>> >>>>>>>> 未来阳光 <[hidden email]> 于2019年8月30日周五 下午10:23写道: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> I thinks it's better to increase the default value. +1 >>>>>>>>> >>>>>>>>> >>>>>>>>> Best. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------ 原始邮件 ------------------ >>>>>>>>> 发件人: "Till Rohrmann"<[hidden email]>; >>>>>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>>>>> 收件人: "dev"<[hidden email]>; "user"<[hidden email]>; >>>>>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi everyone, >>>>>>>>> >>>>>>>>> I wanted to reach out to you and ask whether decreasing the >>>>>>>>> default delay >>>>>>>>> to `0 s` for the fixed delay restart strategy [1] is causing >>>>>>>>> trouble. A >>>>>>>>> user reported that he would like to increase the default value >>>>>>>>> because it >>>>>>>>> can cause restart storms in case of systematic faults [2]. >>>>>>>>> >>>>>>>>> The downside of increasing the default delay would be a slightly >>>>>>>>> increased >>>>>>>>> restart time if this config option is not explicitly set. >>>>>>>>> >>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>> >>>>>>>> >> >> -- >> >> Arvid Heise | Senior Software Engineer >> >> <https://www.ververica.com/> >> >> Follow us @VervericaData >> >> -- >> >> Join Flink Forward <https://flink-forward.org/> - The Apache Flink >> Conference >> >> Stream Processing | Event Driven | Real Time >> >> -- >> >> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> >> -- >> Ververica GmbH >> Registered at Amtsgericht Charlottenburg: HRB 158244 B >> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >> > |
Free forum by Nabble | Edit this page |