[DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
Hi everyone,

I wanted to discuss how to simplify Flink's cluster level RestartStrategy
configuration [1]. Currently, Flink's behaviour with respect to configuring
the {{RestartStrategies}} is quite complicated and convoluted. The reason
for this is that we evolved the way it has been configured and wanted to
keep it backwards compatible. Due to this, we have currently the following
behaviour:

* If the config option `restart-strategy` is configured, then Flink uses
this `RestartStrategy` (so far so simple)
* If the config option `restart-strategy` is not configured, then
** If `restart-strategy.fixed-delay.attempts` or
`restart-strategy.fixed-delay.delay` are defined, then instantiate
`FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
restart-strategy.fixed-delay.delay)`
** If `restart-strategy.fixed-delay.attempts` and
`restart-strategy.fixed-delay.delay` are not defined, then
*** If checkpointing is disabled, then choose `NoRestartStrategy`
*** If checkpointing is enabled, then choose
`FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`

I would like to simplify the configuration by removing the "If
`restart-strategy.fixed-delay.attempts` or
`restart-strategy.fixed-delay.delay`, then" condition. That way, the logic
would be the following:

* If the config option `restart-strategy` is configured, then Flink uses
this `RestartStrategy`
* If the config option `restart-strategy` is not configured, then
** If checkpointing is disabled, then choose `NoRestartStrategy`
** If checkpointing is enabled, then choose
`FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`

That way we retain the user friendliness that jobs restart if the user
enabled checkpointing and we make it clear that any `
restart-strategy.fixed-delay.xyz` setting will only be respected if
`restart-strategy` has been set to `fixed-delay`.

This simplification would, however, change Flink's behaviour and might
break existing setups. Since we introduced `RestartStrategies` with Flink
1.0.0 and deprecated the prior configuration mechanism which enables
restarting if either the `attempts` or the `delay` has been set, I think
that the number of broken jobs should be minimal if not non-existent.

I'm sure that one can simplify the way RestartStrategies are
programmatically configured as well but for the sake of simplicity/scoping
I'd like to not touch it right away.

What do you think about this behaviour change?

[1] https://issues.apache.org/jira/browse/FLINK-13921

Cheers,
Till
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Stephan Ewen
+1 in general

What is the default in batch, though? No restarts? I always found that
somewhat uncommon.
Should we also change that part, if we are changing the default anyways?


On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> wrote:

> Hi everyone,
>
> I wanted to discuss how to simplify Flink's cluster level RestartStrategy
> configuration [1]. Currently, Flink's behaviour with respect to configuring
> the {{RestartStrategies}} is quite complicated and convoluted. The reason
> for this is that we evolved the way it has been configured and wanted to
> keep it backwards compatible. Due to this, we have currently the following
> behaviour:
>
> * If the config option `restart-strategy` is configured, then Flink uses
> this `RestartStrategy` (so far so simple)
> * If the config option `restart-strategy` is not configured, then
> ** If `restart-strategy.fixed-delay.attempts` or
> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> restart-strategy.fixed-delay.delay)`
> ** If `restart-strategy.fixed-delay.attempts` and
> `restart-strategy.fixed-delay.delay` are not defined, then
> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> *** If checkpointing is enabled, then choose
> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>
> I would like to simplify the configuration by removing the "If
> `restart-strategy.fixed-delay.attempts` or
> `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic
> would be the following:
>
> * If the config option `restart-strategy` is configured, then Flink uses
> this `RestartStrategy`
> * If the config option `restart-strategy` is not configured, then
> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> ** If checkpointing is enabled, then choose
> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>
> That way we retain the user friendliness that jobs restart if the user
> enabled checkpointing and we make it clear that any `
> restart-strategy.fixed-delay.xyz` setting will only be respected if
> `restart-strategy` has been set to `fixed-delay`.
>
> This simplification would, however, change Flink's behaviour and might
> break existing setups. Since we introduced `RestartStrategies` with Flink
> 1.0.0 and deprecated the prior configuration mechanism which enables
> restarting if either the `attempts` or the `delay` has been set, I think
> that the number of broken jobs should be minimal if not non-existent.
>
> I'm sure that one can simplify the way RestartStrategies are
> programmatically configured as well but for the sake of simplicity/scoping
> I'd like to not touch it right away.
>
> What do you think about this behaviour change?
>
> [1] https://issues.apache.org/jira/browse/FLINK-13921
>
> Cheers,
> Till
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

dwysakowicz
Also +1 in general.

I have a few questions though:

- does it only apply to the logic in
org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
which is only the cluster side configuration? Or do you want to change
the logic also on the job side in ExecutionConfig?

- if the latter, does that mean deprecated methods in ExecutionConfig
like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
effect? I think this would be a good idea, but would suggest to remove
the corresponding fields and methods. This is not that simple though. I
tried to do that for other parameters that have no effect already like
codeAnalysisMode & failTaskOnCheckpointError. The are two problems:

    1) setNumberOfExecutionRetires are effectively marked with @Public
annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have
this problem). Therefore this would be a binary incompatible change.

    2) ExecutionConfig is stored in state as part of PojoSerializer in
pre flink 1.7. It should not be a problem for numberOfExecutionRetries &
executionRetryDelays as they are of primitive types. It is a problem for
codeAnalysisMode (we cannot remove the class, as this breaks
serialization). I wanted to mention that anyway, just to be aware of that.

Best,

Dawid

On 30/08/2019 14:48, Stephan Ewen wrote:

> +1 in general
>
> What is the default in batch, though? No restarts? I always found that
> somewhat uncommon.
> Should we also change that part, if we are changing the default anyways?
>
>
> On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> wrote:
>
>> Hi everyone,
>>
>> I wanted to discuss how to simplify Flink's cluster level RestartStrategy
>> configuration [1]. Currently, Flink's behaviour with respect to configuring
>> the {{RestartStrategies}} is quite complicated and convoluted. The reason
>> for this is that we evolved the way it has been configured and wanted to
>> keep it backwards compatible. Due to this, we have currently the following
>> behaviour:
>>
>> * If the config option `restart-strategy` is configured, then Flink uses
>> this `RestartStrategy` (so far so simple)
>> * If the config option `restart-strategy` is not configured, then
>> ** If `restart-strategy.fixed-delay.attempts` or
>> `restart-strategy.fixed-delay.delay` are defined, then instantiate
>> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
>> restart-strategy.fixed-delay.delay)`
>> ** If `restart-strategy.fixed-delay.attempts` and
>> `restart-strategy.fixed-delay.delay` are not defined, then
>> *** If checkpointing is disabled, then choose `NoRestartStrategy`
>> *** If checkpointing is enabled, then choose
>> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>>
>> I would like to simplify the configuration by removing the "If
>> `restart-strategy.fixed-delay.attempts` or
>> `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic
>> would be the following:
>>
>> * If the config option `restart-strategy` is configured, then Flink uses
>> this `RestartStrategy`
>> * If the config option `restart-strategy` is not configured, then
>> ** If checkpointing is disabled, then choose `NoRestartStrategy`
>> ** If checkpointing is enabled, then choose
>> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>>
>> That way we retain the user friendliness that jobs restart if the user
>> enabled checkpointing and we make it clear that any `
>> restart-strategy.fixed-delay.xyz` setting will only be respected if
>> `restart-strategy` has been set to `fixed-delay`.
>>
>> This simplification would, however, change Flink's behaviour and might
>> break existing setups. Since we introduced `RestartStrategies` with Flink
>> 1.0.0 and deprecated the prior configuration mechanism which enables
>> restarting if either the `attempts` or the `delay` has been set, I think
>> that the number of broken jobs should be minimal if not non-existent.
>>
>> I'm sure that one can simplify the way RestartStrategies are
>> programmatically configured as well but for the sake of simplicity/scoping
>> I'd like to not touch it right away.
>>
>> What do you think about this behaviour change?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-13921
>>
>> Cheers,
>> Till
>>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
The current default behaviour for batch is `NoRestartStrategy` if nothing
is configured. We could say that we set the default value of
`restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
independent of the checkpointing. The only downside I could see is that
some faulty batch jobs might get stuck in a restart loop without reaching a
terminal state.

@Dawid, I don't intend to touch the ExecutionConfig. This change only
targets the cluster level configuration of the RestartStrategy.

Cheers,
Till

On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email]>
wrote:

> Also +1 in general.
>
> I have a few questions though:
>
> - does it only apply to the logic in
>
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> which is only the cluster side configuration? Or do you want to change
> the logic also on the job side in ExecutionConfig?
>
> - if the latter, does that mean deprecated methods in ExecutionConfig
> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> effect? I think this would be a good idea, but would suggest to remove
> the corresponding fields and methods. This is not that simple though. I
> tried to do that for other parameters that have no effect already like
> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
>
>     1) setNumberOfExecutionRetires are effectively marked with @Public
> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have
> this problem). Therefore this would be a binary incompatible change.
>
>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> pre flink 1.7. It should not be a problem for numberOfExecutionRetries &
> executionRetryDelays as they are of primitive types. It is a problem for
> codeAnalysisMode (we cannot remove the class, as this breaks
> serialization). I wanted to mention that anyway, just to be aware of that.
>
> Best,
>
> Dawid
>
> On 30/08/2019 14:48, Stephan Ewen wrote:
> > +1 in general
> >
> > What is the default in batch, though? No restarts? I always found that
> > somewhat uncommon.
> > Should we also change that part, if we are changing the default anyways?
> >
> >
> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]>
> wrote:
> >
> >> Hi everyone,
> >>
> >> I wanted to discuss how to simplify Flink's cluster level
> RestartStrategy
> >> configuration [1]. Currently, Flink's behaviour with respect to
> configuring
> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> reason
> >> for this is that we evolved the way it has been configured and wanted to
> >> keep it backwards compatible. Due to this, we have currently the
> following
> >> behaviour:
> >>
> >> * If the config option `restart-strategy` is configured, then Flink uses
> >> this `RestartStrategy` (so far so simple)
> >> * If the config option `restart-strategy` is not configured, then
> >> ** If `restart-strategy.fixed-delay.attempts` or
> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> >> restart-strategy.fixed-delay.delay)`
> >> ** If `restart-strategy.fixed-delay.attempts` and
> >> `restart-strategy.fixed-delay.delay` are not defined, then
> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> *** If checkpointing is enabled, then choose
> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >>
> >> I would like to simplify the configuration by removing the "If
> >> `restart-strategy.fixed-delay.attempts` or
> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the
> logic
> >> would be the following:
> >>
> >> * If the config option `restart-strategy` is configured, then Flink uses
> >> this `RestartStrategy`
> >> * If the config option `restart-strategy` is not configured, then
> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> ** If checkpointing is enabled, then choose
> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >>
> >> That way we retain the user friendliness that jobs restart if the user
> >> enabled checkpointing and we make it clear that any `
> >> restart-strategy.fixed-delay.xyz` setting will only be respected if
> >> `restart-strategy` has been set to `fixed-delay`.
> >>
> >> This simplification would, however, change Flink's behaviour and might
> >> break existing setups. Since we introduced `RestartStrategies` with
> Flink
> >> 1.0.0 and deprecated the prior configuration mechanism which enables
> >> restarting if either the `attempts` or the `delay` has been set, I think
> >> that the number of broken jobs should be minimal if not non-existent.
> >>
> >> I'm sure that one can simplify the way RestartStrategies are
> >> programmatically configured as well but for the sake of
> simplicity/scoping
> >> I'd like to not touch it right away.
> >>
> >> What do you think about this behaviour change?
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> >>
> >> Cheers,
> >> Till
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
After an offline discussion with Stephan, we concluded that changing the
default restart strategy for batch jobs is not that easy because the
cluster level restart configuration does not necessarily know about the
type of job which is submitted. We concluded that we would like to keep the
batch behaviour as is (NoRestartStrategy) and revisit this issue at a later
point in time.

On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> wrote:

> The current default behaviour for batch is `NoRestartStrategy` if nothing
> is configured. We could say that we set the default value of
> `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> independent of the checkpointing. The only downside I could see is that
> some faulty batch jobs might get stuck in a restart loop without reaching a
> terminal state.
>
> @Dawid, I don't intend to touch the ExecutionConfig. This change only
> targets the cluster level configuration of the RestartStrategy.
>
> Cheers,
> Till
>
> On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email]>
> wrote:
>
>> Also +1 in general.
>>
>> I have a few questions though:
>>
>> - does it only apply to the logic in
>>
>> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
>> which is only the cluster side configuration? Or do you want to change
>> the logic also on the job side in ExecutionConfig?
>>
>> - if the latter, does that mean deprecated methods in ExecutionConfig
>> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
>> effect? I think this would be a good idea, but would suggest to remove
>> the corresponding fields and methods. This is not that simple though. I
>> tried to do that for other parameters that have no effect already like
>> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
>>
>>     1) setNumberOfExecutionRetires are effectively marked with @Public
>> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have
>> this problem). Therefore this would be a binary incompatible change.
>>
>>     2) ExecutionConfig is stored in state as part of PojoSerializer in
>> pre flink 1.7. It should not be a problem for numberOfExecutionRetries &
>> executionRetryDelays as they are of primitive types. It is a problem for
>> codeAnalysisMode (we cannot remove the class, as this breaks
>> serialization). I wanted to mention that anyway, just to be aware of that.
>>
>> Best,
>>
>> Dawid
>>
>> On 30/08/2019 14:48, Stephan Ewen wrote:
>> > +1 in general
>> >
>> > What is the default in batch, though? No restarts? I always found that
>> > somewhat uncommon.
>> > Should we also change that part, if we are changing the default anyways?
>> >
>> >
>> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]>
>> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> I wanted to discuss how to simplify Flink's cluster level
>> RestartStrategy
>> >> configuration [1]. Currently, Flink's behaviour with respect to
>> configuring
>> >> the {{RestartStrategies}} is quite complicated and convoluted. The
>> reason
>> >> for this is that we evolved the way it has been configured and wanted
>> to
>> >> keep it backwards compatible. Due to this, we have currently the
>> following
>> >> behaviour:
>> >>
>> >> * If the config option `restart-strategy` is configured, then Flink
>> uses
>> >> this `RestartStrategy` (so far so simple)
>> >> * If the config option `restart-strategy` is not configured, then
>> >> ** If `restart-strategy.fixed-delay.attempts` or
>> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
>> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
>> >> restart-strategy.fixed-delay.delay)`
>> >> ** If `restart-strategy.fixed-delay.attempts` and
>> >> `restart-strategy.fixed-delay.delay` are not defined, then
>> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
>> >> *** If checkpointing is enabled, then choose
>> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> >>
>> >> I would like to simplify the configuration by removing the "If
>> >> `restart-strategy.fixed-delay.attempts` or
>> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the
>> logic
>> >> would be the following:
>> >>
>> >> * If the config option `restart-strategy` is configured, then Flink
>> uses
>> >> this `RestartStrategy`
>> >> * If the config option `restart-strategy` is not configured, then
>> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
>> >> ** If checkpointing is enabled, then choose
>> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> >>
>> >> That way we retain the user friendliness that jobs restart if the user
>> >> enabled checkpointing and we make it clear that any `
>> >> restart-strategy.fixed-delay.xyz` setting will only be respected if
>> >> `restart-strategy` has been set to `fixed-delay`.
>> >>
>> >> This simplification would, however, change Flink's behaviour and might
>> >> break existing setups. Since we introduced `RestartStrategies` with
>> Flink
>> >> 1.0.0 and deprecated the prior configuration mechanism which enables
>> >> restarting if either the `attempts` or the `delay` has been set, I
>> think
>> >> that the number of broken jobs should be minimal if not non-existent.
>> >>
>> >> I'm sure that one can simplify the way RestartStrategies are
>> >> programmatically configured as well but for the sake of
>> simplicity/scoping
>> >> I'd like to not touch it right away.
>> >>
>> >> What do you think about this behaviour change?
>> >>
>> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
>> >>
>> >> Cheers,
>> >> Till
>> >>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Becket Qin
+1. The new behavior makes sense to me.

BTW, we need a FLIP for this :)

On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> wrote:

> After an offline discussion with Stephan, we concluded that changing the
> default restart strategy for batch jobs is not that easy because the
> cluster level restart configuration does not necessarily know about the
> type of job which is submitted. We concluded that we would like to keep the
> batch behaviour as is (NoRestartStrategy) and revisit this issue at a later
> point in time.
>
> On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > The current default behaviour for batch is `NoRestartStrategy` if nothing
> > is configured. We could say that we set the default value of
> > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0
> s")`
> > independent of the checkpointing. The only downside I could see is that
> > some faulty batch jobs might get stuck in a restart loop without
> reaching a
> > terminal state.
> >
> > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > targets the cluster level configuration of the RestartStrategy.
> >
> > Cheers,
> > Till
> >
> > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email]
> >
> > wrote:
> >
> >> Also +1 in general.
> >>
> >> I have a few questions though:
> >>
> >> - does it only apply to the logic in
> >>
> >>
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> >> which is only the cluster side configuration? Or do you want to change
> >> the logic also on the job side in ExecutionConfig?
> >>
> >> - if the latter, does that mean deprecated methods in ExecutionConfig
> >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> >> effect? I think this would be a good idea, but would suggest to remove
> >> the corresponding fields and methods. This is not that simple though. I
> >> tried to do that for other parameters that have no effect already like
> >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> >>
> >>     1) setNumberOfExecutionRetires are effectively marked with @Public
> >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have
> >> this problem). Therefore this would be a binary incompatible change.
> >>
> >>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> >> pre flink 1.7. It should not be a problem for numberOfExecutionRetries &
> >> executionRetryDelays as they are of primitive types. It is a problem for
> >> codeAnalysisMode (we cannot remove the class, as this breaks
> >> serialization). I wanted to mention that anyway, just to be aware of
> that.
> >>
> >> Best,
> >>
> >> Dawid
> >>
> >> On 30/08/2019 14:48, Stephan Ewen wrote:
> >> > +1 in general
> >> >
> >> > What is the default in batch, though? No restarts? I always found that
> >> > somewhat uncommon.
> >> > Should we also change that part, if we are changing the default
> anyways?
> >> >
> >> >
> >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]>
> >> wrote:
> >> >
> >> >> Hi everyone,
> >> >>
> >> >> I wanted to discuss how to simplify Flink's cluster level
> >> RestartStrategy
> >> >> configuration [1]. Currently, Flink's behaviour with respect to
> >> configuring
> >> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> >> reason
> >> >> for this is that we evolved the way it has been configured and wanted
> >> to
> >> >> keep it backwards compatible. Due to this, we have currently the
> >> following
> >> >> behaviour:
> >> >>
> >> >> * If the config option `restart-strategy` is configured, then Flink
> >> uses
> >> >> this `RestartStrategy` (so far so simple)
> >> >> * If the config option `restart-strategy` is not configured, then
> >> >> ** If `restart-strategy.fixed-delay.attempts` or
> >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> >> >> restart-strategy.fixed-delay.delay)`
> >> >> ** If `restart-strategy.fixed-delay.attempts` and
> >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> >> *** If checkpointing is enabled, then choose
> >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >> >>
> >> >> I would like to simplify the configuration by removing the "If
> >> >> `restart-strategy.fixed-delay.attempts` or
> >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the
> >> logic
> >> >> would be the following:
> >> >>
> >> >> * If the config option `restart-strategy` is configured, then Flink
> >> uses
> >> >> this `RestartStrategy`
> >> >> * If the config option `restart-strategy` is not configured, then
> >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> >> ** If checkpointing is enabled, then choose
> >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >> >>
> >> >> That way we retain the user friendliness that jobs restart if the
> user
> >> >> enabled checkpointing and we make it clear that any `
> >> >> restart-strategy.fixed-delay.xyz` setting will only be respected if
> >> >> `restart-strategy` has been set to `fixed-delay`.
> >> >>
> >> >> This simplification would, however, change Flink's behaviour and
> might
> >> >> break existing setups. Since we introduced `RestartStrategies` with
> >> Flink
> >> >> 1.0.0 and deprecated the prior configuration mechanism which enables
> >> >> restarting if either the `attempts` or the `delay` has been set, I
> >> think
> >> >> that the number of broken jobs should be minimal if not non-existent.
> >> >>
> >> >> I'm sure that one can simplify the way RestartStrategies are
> >> >> programmatically configured as well but for the sake of
> >> simplicity/scoping
> >> >> I'd like to not touch it right away.
> >> >>
> >> >> What do you think about this behaviour change?
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> >> >>
> >> >> Cheers,
> >> >> Till
> >> >>
> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Zhu Zhu
+1 to simplify the RestartStrategy configuration

One thing to confirm is whether the default delay should be "0 s" in the
case of
"If the config option `restart-strategy` is not configured" and "If
checkpointing is enabled".
I see a related discussion([SURVEY] Is the default restart delay of 0s
causing problems) is ongoing and we may need to take the result from that.

Thanks,
Zhu Zhu

Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道:

> +1. The new behavior makes sense to me.
>
> BTW, we need a FLIP for this :)
>
> On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > After an offline discussion with Stephan, we concluded that changing the
> > default restart strategy for batch jobs is not that easy because the
> > cluster level restart configuration does not necessarily know about the
> > type of job which is submitted. We concluded that we would like to keep
> the
> > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
> later
> > point in time.
> >
> > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > The current default behaviour for batch is `NoRestartStrategy` if
> nothing
> > > is configured. We could say that we set the default value of
> > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0
> > s")`
> > > independent of the checkpointing. The only downside I could see is that
> > > some faulty batch jobs might get stuck in a restart loop without
> > reaching a
> > > terminal state.
> > >
> > > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > > targets the cluster level configuration of the RestartStrategy.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
> [hidden email]
> > >
> > > wrote:
> > >
> > >> Also +1 in general.
> > >>
> > >> I have a few questions though:
> > >>
> > >> - does it only apply to the logic in
> > >>
> > >>
> >
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> > >> which is only the cluster side configuration? Or do you want to change
> > >> the logic also on the job side in ExecutionConfig?
> > >>
> > >> - if the latter, does that mean deprecated methods in ExecutionConfig
> > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> > >> effect? I think this would be a good idea, but would suggest to remove
> > >> the corresponding fields and methods. This is not that simple though.
> I
> > >> tried to do that for other parameters that have no effect already like
> > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> > >>
> > >>     1) setNumberOfExecutionRetires are effectively marked with @Public
> > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
> have
> > >> this problem). Therefore this would be a binary incompatible change.
> > >>
> > >>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> > >> pre flink 1.7. It should not be a problem for
> numberOfExecutionRetries &
> > >> executionRetryDelays as they are of primitive types. It is a problem
> for
> > >> codeAnalysisMode (we cannot remove the class, as this breaks
> > >> serialization). I wanted to mention that anyway, just to be aware of
> > that.
> > >>
> > >> Best,
> > >>
> > >> Dawid
> > >>
> > >> On 30/08/2019 14:48, Stephan Ewen wrote:
> > >> > +1 in general
> > >> >
> > >> > What is the default in batch, though? No restarts? I always found
> that
> > >> > somewhat uncommon.
> > >> > Should we also change that part, if we are changing the default
> > anyways?
> > >> >
> > >> >
> > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]
> >
> > >> wrote:
> > >> >
> > >> >> Hi everyone,
> > >> >>
> > >> >> I wanted to discuss how to simplify Flink's cluster level
> > >> RestartStrategy
> > >> >> configuration [1]. Currently, Flink's behaviour with respect to
> > >> configuring
> > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> > >> reason
> > >> >> for this is that we evolved the way it has been configured and
> wanted
> > >> to
> > >> >> keep it backwards compatible. Due to this, we have currently the
> > >> following
> > >> >> behaviour:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy` (so far so simple)
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> > >> >> restart-strategy.fixed-delay.delay)`
> > >> >> ** If `restart-strategy.fixed-delay.attempts` and
> > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> *** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> I would like to simplify the configuration by removing the "If
> > >> >> `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
> the
> > >> logic
> > >> >> would be the following:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy`
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> ** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> That way we retain the user friendliness that jobs restart if the
> > user
> > >> >> enabled checkpointing and we make it clear that any `
> > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected
> if
> > >> >> `restart-strategy` has been set to `fixed-delay`.
> > >> >>
> > >> >> This simplification would, however, change Flink's behaviour and
> > might
> > >> >> break existing setups. Since we introduced `RestartStrategies` with
> > >> Flink
> > >> >> 1.0.0 and deprecated the prior configuration mechanism which
> enables
> > >> >> restarting if either the `attempts` or the `delay` has been set, I
> > >> think
> > >> >> that the number of broken jobs should be minimal if not
> non-existent.
> > >> >>
> > >> >> I'm sure that one can simplify the way RestartStrategies are
> > >> >> programmatically configured as well but for the sake of
> > >> simplicity/scoping
> > >> >> I'd like to not touch it right away.
> > >> >>
> > >> >> What do you think about this behaviour change?
> > >> >>
> > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> > >> >>
> > >> >> Cheers,
> > >> >> Till
> > >> >>
> > >>
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Zhijiang(wangzhijiang999)
+1 for this proposal.

IMO, it not only simplifies the cluster configuration, but also seems more fit logic to not rely on some low-level speicific parameters to judge the upper-level strategy.
It is also resonable to push forward the restart strategy configuration step by step for batch later.

Best,
Zhijiang
------------------------------------------------------------------
From:Zhu Zhu <[hidden email]>
Send Time:2019年9月2日(星期一) 05:18
To:dev <[hidden email]>
Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

+1 to simplify the RestartStrategy configuration

One thing to confirm is whether the default delay should be "0 s" in the
case of
"If the config option `restart-strategy` is not configured" and "If
checkpointing is enabled".
I see a related discussion([SURVEY] Is the default restart delay of 0s
causing problems) is ongoing and we may need to take the result from that.

Thanks,
Zhu Zhu

Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道:

> +1. The new behavior makes sense to me.
>
> BTW, we need a FLIP for this :)
>
> On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > After an offline discussion with Stephan, we concluded that changing the
> > default restart strategy for batch jobs is not that easy because the
> > cluster level restart configuration does not necessarily know about the
> > type of job which is submitted. We concluded that we would like to keep
> the
> > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
> later
> > point in time.
> >
> > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > The current default behaviour for batch is `NoRestartStrategy` if
> nothing
> > > is configured. We could say that we set the default value of
> > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0
> > s")`
> > > independent of the checkpointing. The only downside I could see is that
> > > some faulty batch jobs might get stuck in a restart loop without
> > reaching a
> > > terminal state.
> > >
> > > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > > targets the cluster level configuration of the RestartStrategy.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
> [hidden email]
> > >
> > > wrote:
> > >
> > >> Also +1 in general.
> > >>
> > >> I have a few questions though:
> > >>
> > >> - does it only apply to the logic in
> > >>
> > >>
> >
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> > >> which is only the cluster side configuration? Or do you want to change
> > >> the logic also on the job side in ExecutionConfig?
> > >>
> > >> - if the latter, does that mean deprecated methods in ExecutionConfig
> > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> > >> effect? I think this would be a good idea, but would suggest to remove
> > >> the corresponding fields and methods. This is not that simple though.
> I
> > >> tried to do that for other parameters that have no effect already like
> > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> > >>
> > >>     1) setNumberOfExecutionRetires are effectively marked with @Public
> > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
> have
> > >> this problem). Therefore this would be a binary incompatible change.
> > >>
> > >>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> > >> pre flink 1.7. It should not be a problem for
> numberOfExecutionRetries &
> > >> executionRetryDelays as they are of primitive types. It is a problem
> for
> > >> codeAnalysisMode (we cannot remove the class, as this breaks
> > >> serialization). I wanted to mention that anyway, just to be aware of
> > that.
> > >>
> > >> Best,
> > >>
> > >> Dawid
> > >>
> > >> On 30/08/2019 14:48, Stephan Ewen wrote:
> > >> > +1 in general
> > >> >
> > >> > What is the default in batch, though? No restarts? I always found
> that
> > >> > somewhat uncommon.
> > >> > Should we also change that part, if we are changing the default
> > anyways?
> > >> >
> > >> >
> > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]
> >
> > >> wrote:
> > >> >
> > >> >> Hi everyone,
> > >> >>
> > >> >> I wanted to discuss how to simplify Flink's cluster level
> > >> RestartStrategy
> > >> >> configuration [1]. Currently, Flink's behaviour with respect to
> > >> configuring
> > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> > >> reason
> > >> >> for this is that we evolved the way it has been configured and
> wanted
> > >> to
> > >> >> keep it backwards compatible. Due to this, we have currently the
> > >> following
> > >> >> behaviour:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy` (so far so simple)
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> > >> >> restart-strategy.fixed-delay.delay)`
> > >> >> ** If `restart-strategy.fixed-delay.attempts` and
> > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> *** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> I would like to simplify the configuration by removing the "If
> > >> >> `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
> the
> > >> logic
> > >> >> would be the following:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy`
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> ** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> That way we retain the user friendliness that jobs restart if the
> > user
> > >> >> enabled checkpointing and we make it clear that any `
> > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected
> if
> > >> >> `restart-strategy` has been set to `fixed-delay`.
> > >> >>
> > >> >> This simplification would, however, change Flink's behaviour and
> > might
> > >> >> break existing setups. Since we introduced `RestartStrategies` with
> > >> Flink
> > >> >> 1.0.0 and deprecated the prior configuration mechanism which
> enables
> > >> >> restarting if either the `attempts` or the `delay` has been set, I
> > >> think
> > >> >> that the number of broken jobs should be minimal if not
> non-existent.
> > >> >>
> > >> >> I'm sure that one can simplify the way RestartStrategies are
> > >> >> programmatically configured as well but for the sake of
> > >> simplicity/scoping
> > >> >> I'd like to not touch it right away.
> > >> >>
> > >> >> What do you think about this behaviour change?
> > >> >>
> > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> > >> >>
> > >> >> Cheers,
> > >> >> Till
> > >> >>
> > >>
> > >>
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
Thanks a lot for the positive feedback. I think you are right Becket that
this needs a FLIP since it changes Flink's behaviour. I'll create one and
post it to the dev ML.

@Zhu Zhu <[hidden email]>  I agree that the restart delay is related to
the RestartStrategy configuration. However, I would like to exclude it from
this improvement proposal since it would broaden the scope unnecessarily.
I'd suggest to continue the discussion about this on the SURVEY thread and
then based on the outcome start a DISCUSS thread about the concrete changes
to the default restart delay value.

Since I will create a FLIP for the simplification logic, I'll conclude this
thread and would encourage everyone to move all further discussion to the
Flip DISCUSS thread. I'll post the link to it shortly.

Cheers,
Till

On Mon, Sep 2, 2019 at 6:22 AM zhijiang <[hidden email]>
wrote:

> +1 for this proposal.
>
> IMO, it not only simplifies the cluster configuration, but also seems more
> fit logic to not rely on some low-level speicific parameters to judge the
> upper-level strategy.
> It is also resonable to push forward the restart strategy configuration
> step by step for batch later.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Zhu Zhu <[hidden email]>
> Send Time:2019年9月2日(星期一) 05:18
> To:dev <[hidden email]>
> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy
> configuration
>
> +1 to simplify the RestartStrategy configuration
>
> One thing to confirm is whether the default delay should be "0 s" in the
> case of
> "If the config option `restart-strategy` is not configured" and "If
> checkpointing is enabled".
> I see a related discussion([SURVEY] Is the default restart delay of 0s
> causing problems) is ongoing and we may need to take the result from that.
>
> Thanks,
> Zhu Zhu
>
> Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道:
>
> > +1. The new behavior makes sense to me.
> >
> > BTW, we need a FLIP for this :)
> >
> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > After an offline discussion with Stephan, we concluded that changing
> the
> > > default restart strategy for batch jobs is not that easy because the
> > > cluster level restart configuration does not necessarily know about the
> > > type of job which is submitted. We concluded that we would like to keep
> > the
> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
> > later
> > > point in time.
> > >
> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > The current default behaviour for batch is `NoRestartStrategy` if
> > nothing
> > > > is configured. We could say that we set the default value of
> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE,
> "0
> > > s")`
> > > > independent of the checkpointing. The only downside I could see is
> that
> > > > some faulty batch jobs might get stuck in a restart loop without
> > > reaching a
> > > > terminal state.
> > > >
> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > > > targets the cluster level configuration of the RestartStrategy.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > >> Also +1 in general.
> > > >>
> > > >> I have a few questions though:
> > > >>
> > > >> - does it only apply to the logic in
> > > >>
> > > >>
> > >
> >
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> > > >> which is only the cluster side configuration? Or do you want to
> change
> > > >> the logic also on the job side in ExecutionConfig?
> > > >>
> > > >> - if the latter, does that mean deprecated methods in
> ExecutionConfig
> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have
> no
> > > >> effect? I think this would be a good idea, but would suggest to
> remove
> > > >> the corresponding fields and methods. This is not that simple
> though.
> > I
> > > >> tried to do that for other parameters that have no effect already
> like
> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> > > >>
> > > >>     1) setNumberOfExecutionRetires are effectively marked with
> @Public
> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
> > have
> > > >> this problem). Therefore this would be a binary incompatible change.
> > > >>
> > > >>     2) ExecutionConfig is stored in state as part of PojoSerializer
> in
> > > >> pre flink 1.7. It should not be a problem for
> > numberOfExecutionRetries &
> > > >> executionRetryDelays as they are of primitive types. It is a problem
> > for
> > > >> codeAnalysisMode (we cannot remove the class, as this breaks
> > > >> serialization). I wanted to mention that anyway, just to be aware of
> > > that.
> > > >>
> > > >> Best,
> > > >>
> > > >> Dawid
> > > >>
> > > >> On 30/08/2019 14:48, Stephan Ewen wrote:
> > > >> > +1 in general
> > > >> >
> > > >> > What is the default in batch, though? No restarts? I always found
> > that
> > > >> > somewhat uncommon.
> > > >> > Should we also change that part, if we are changing the default
> > > anyways?
> > > >> >
> > > >> >
> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <
> [hidden email]
> > >
> > > >> wrote:
> > > >> >
> > > >> >> Hi everyone,
> > > >> >>
> > > >> >> I wanted to discuss how to simplify Flink's cluster level
> > > >> RestartStrategy
> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to
> > > >> configuring
> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted.
> The
> > > >> reason
> > > >> >> for this is that we evolved the way it has been configured and
> > wanted
> > > >> to
> > > >> >> keep it backwards compatible. Due to this, we have currently the
> > > >> following
> > > >> >> behaviour:
> > > >> >>
> > > >> >> * If the config option `restart-strategy` is configured, then
> Flink
> > > >> uses
> > > >> >> this `RestartStrategy` (so far so simple)
> > > >> >> * If the config option `restart-strategy` is not configured, then
> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or
> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then
> instantiate
> > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> > > >> >> restart-strategy.fixed-delay.delay)`
> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and
> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> > > >> >> *** If checkpointing is enabled, then choose
> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > > >> >>
> > > >> >> I would like to simplify the configuration by removing the "If
> > > >> >> `restart-strategy.fixed-delay.attempts` or
> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
> > the
> > > >> logic
> > > >> >> would be the following:
> > > >> >>
> > > >> >> * If the config option `restart-strategy` is configured, then
> Flink
> > > >> uses
> > > >> >> this `RestartStrategy`
> > > >> >> * If the config option `restart-strategy` is not configured, then
> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> > > >> >> ** If checkpointing is enabled, then choose
> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > > >> >>
> > > >> >> That way we retain the user friendliness that jobs restart if the
> > > user
> > > >> >> enabled checkpointing and we make it clear that any `
> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected
> > if
> > > >> >> `restart-strategy` has been set to `fixed-delay`.
> > > >> >>
> > > >> >> This simplification would, however, change Flink's behaviour and
> > > might
> > > >> >> break existing setups. Since we introduced `RestartStrategies`
> with
> > > >> Flink
> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which
> > enables
> > > >> >> restarting if either the `attempts` or the `delay` has been set,
> I
> > > >> think
> > > >> >> that the number of broken jobs should be minimal if not
> > non-existent.
> > > >> >>
> > > >> >> I'm sure that one can simplify the way RestartStrategies are
> > > >> >> programmatically configured as well but for the sake of
> > > >> simplicity/scoping
> > > >> >> I'd like to not touch it right away.
> > > >> >>
> > > >> >> What do you think about this behaviour change?
> > > >> >>
> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> > > >> >>
> > > >> >> Cheers,
> > > >> >> Till
> > > >> >>
> > > >>
> > > >>
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
The link to the dev ML discussion about FLIP-61 is
https://lists.apache.org/thread.html/e206390127bcbd9b24d9c41a838faa75157e468e01552ad241e3e24b@%3Cdev.flink.apache.org%3E

Cheers,
Till

On Mon, Sep 2, 2019 at 10:37 PM Till Rohrmann <[hidden email]> wrote:

> Thanks a lot for the positive feedback. I think you are right Becket that
> this needs a FLIP since it changes Flink's behaviour. I'll create one and
> post it to the dev ML.
>
> @Zhu Zhu <[hidden email]>  I agree that the restart delay is related
> to the RestartStrategy configuration. However, I would like to exclude it
> from this improvement proposal since it would broaden the scope
> unnecessarily. I'd suggest to continue the discussion about this on the
> SURVEY thread and then based on the outcome start a DISCUSS thread about
> the concrete changes to the default restart delay value.
>
> Since I will create a FLIP for the simplification logic, I'll conclude
> this thread and would encourage everyone to move all further discussion to
> the Flip DISCUSS thread. I'll post the link to it shortly.
>
> Cheers,
> Till
>
> On Mon, Sep 2, 2019 at 6:22 AM zhijiang <[hidden email]>
> wrote:
>
>> +1 for this proposal.
>>
>> IMO, it not only simplifies the cluster configuration, but also seems
>> more fit logic to not rely on some low-level speicific parameters to judge
>> the upper-level strategy.
>> It is also resonable to push forward the restart strategy configuration
>> step by step for batch later.
>>
>> Best,
>> Zhijiang
>> ------------------------------------------------------------------
>> From:Zhu Zhu <[hidden email]>
>> Send Time:2019年9月2日(星期一) 05:18
>> To:dev <[hidden email]>
>> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy
>> configuration
>>
>> +1 to simplify the RestartStrategy configuration
>>
>> One thing to confirm is whether the default delay should be "0 s" in the
>> case of
>> "If the config option `restart-strategy` is not configured" and "If
>> checkpointing is enabled".
>> I see a related discussion([SURVEY] Is the default restart delay of 0s
>> causing problems) is ongoing and we may need to take the result from that.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道:
>>
>> > +1. The new behavior makes sense to me.
>> >
>> > BTW, we need a FLIP for this :)
>> >
>> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]>
>> > wrote:
>> >
>> > > After an offline discussion with Stephan, we concluded that changing
>> the
>> > > default restart strategy for batch jobs is not that easy because the
>> > > cluster level restart configuration does not necessarily know about
>> the
>> > > type of job which is submitted. We concluded that we would like to
>> keep
>> > the
>> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
>> > later
>> > > point in time.
>> > >
>> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]>
>> > > wrote:
>> > >
>> > > > The current default behaviour for batch is `NoRestartStrategy` if
>> > nothing
>> > > > is configured. We could say that we set the default value of
>> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE,
>> "0
>> > > s")`
>> > > > independent of the checkpointing. The only downside I could see is
>> that
>> > > > some faulty batch jobs might get stuck in a restart loop without
>> > > reaching a
>> > > > terminal state.
>> > > >
>> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change
>> only
>> > > > targets the cluster level configuration of the RestartStrategy.
>> > > >
>> > > > Cheers,
>> > > > Till
>> > > >
>> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
>> > [hidden email]
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Also +1 in general.
>> > > >>
>> > > >> I have a few questions though:
>> > > >>
>> > > >> - does it only apply to the logic in
>> > > >>
>> > > >>
>> > >
>> >
>> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
>> > > >> which is only the cluster side configuration? Or do you want to
>> change
>> > > >> the logic also on the job side in ExecutionConfig?
>> > > >>
>> > > >> - if the latter, does that mean deprecated methods in
>> ExecutionConfig
>> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will
>> have no
>> > > >> effect? I think this would be a good idea, but would suggest to
>> remove
>> > > >> the corresponding fields and methods. This is not that simple
>> though.
>> > I
>> > > >> tried to do that for other parameters that have no effect already
>> like
>> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
>> > > >>
>> > > >>     1) setNumberOfExecutionRetires are effectively marked with
>> @Public
>> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
>> > have
>> > > >> this problem). Therefore this would be a binary incompatible
>> change.
>> > > >>
>> > > >>     2) ExecutionConfig is stored in state as part of
>> PojoSerializer in
>> > > >> pre flink 1.7. It should not be a problem for
>> > numberOfExecutionRetries &
>> > > >> executionRetryDelays as they are of primitive types. It is a
>> problem
>> > for
>> > > >> codeAnalysisMode (we cannot remove the class, as this breaks
>> > > >> serialization). I wanted to mention that anyway, just to be aware
>> of
>> > > that.
>> > > >>
>> > > >> Best,
>> > > >>
>> > > >> Dawid
>> > > >>
>> > > >> On 30/08/2019 14:48, Stephan Ewen wrote:
>> > > >> > +1 in general
>> > > >> >
>> > > >> > What is the default in batch, though? No restarts? I always found
>> > that
>> > > >> > somewhat uncommon.
>> > > >> > Should we also change that part, if we are changing the default
>> > > anyways?
>> > > >> >
>> > > >> >
>> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <
>> [hidden email]
>> > >
>> > > >> wrote:
>> > > >> >
>> > > >> >> Hi everyone,
>> > > >> >>
>> > > >> >> I wanted to discuss how to simplify Flink's cluster level
>> > > >> RestartStrategy
>> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to
>> > > >> configuring
>> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted.
>> The
>> > > >> reason
>> > > >> >> for this is that we evolved the way it has been configured and
>> > wanted
>> > > >> to
>> > > >> >> keep it backwards compatible. Due to this, we have currently the
>> > > >> following
>> > > >> >> behaviour:
>> > > >> >>
>> > > >> >> * If the config option `restart-strategy` is configured, then
>> Flink
>> > > >> uses
>> > > >> >> this `RestartStrategy` (so far so simple)
>> > > >> >> * If the config option `restart-strategy` is not configured,
>> then
>> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or
>> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then
>> instantiate
>> > > >> >>
>> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
>> > > >> >> restart-strategy.fixed-delay.delay)`
>> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and
>> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
>> > > >> >> *** If checkpointing is disabled, then choose
>> `NoRestartStrategy`
>> > > >> >> *** If checkpointing is enabled, then choose
>> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> > > >> >>
>> > > >> >> I would like to simplify the configuration by removing the "If
>> > > >> >> `restart-strategy.fixed-delay.attempts` or
>> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
>> > the
>> > > >> logic
>> > > >> >> would be the following:
>> > > >> >>
>> > > >> >> * If the config option `restart-strategy` is configured, then
>> Flink
>> > > >> uses
>> > > >> >> this `RestartStrategy`
>> > > >> >> * If the config option `restart-strategy` is not configured,
>> then
>> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
>> > > >> >> ** If checkpointing is enabled, then choose
>> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> > > >> >>
>> > > >> >> That way we retain the user friendliness that jobs restart if
>> the
>> > > user
>> > > >> >> enabled checkpointing and we make it clear that any `
>> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be
>> respected
>> > if
>> > > >> >> `restart-strategy` has been set to `fixed-delay`.
>> > > >> >>
>> > > >> >> This simplification would, however, change Flink's behaviour and
>> > > might
>> > > >> >> break existing setups. Since we introduced `RestartStrategies`
>> with
>> > > >> Flink
>> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which
>> > enables
>> > > >> >> restarting if either the `attempts` or the `delay` has been
>> set, I
>> > > >> think
>> > > >> >> that the number of broken jobs should be minimal if not
>> > non-existent.
>> > > >> >>
>> > > >> >> I'm sure that one can simplify the way RestartStrategies are
>> > > >> >> programmatically configured as well but for the sake of
>> > > >> simplicity/scoping
>> > > >> >> I'd like to not touch it right away.
>> > > >> >>
>> > > >> >> What do you think about this behaviour change?
>> > > >> >>
>> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
>> > > >> >>
>> > > >> >> Cheers,
>> > > >> >> Till
>> > > >> >>
>> > > >>
>> > > >>
>> > >
>> >
>>
>>