[DISCUSS] FLIP-61 Simplify Flink's cluster level RestartStrategy configuration

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-61 Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
Hi everyone,

I'd like to discuss FLIP-61 [1] which tries to simplify Flink's cluster
lever RestartStrategy configuration.

Currently, Flink's behaviour with respect to configuring the
`RestartStrategies` is quite complicated and convoluted. The reason for
this is that we evolved the way it has been configured and wanted to keep
it backwards compatible. Due to this, we have currently the following
behaviour:

* If the config option `restart-strategy` is configured, then Flink uses
this `RestartStrategy` (so far so simple)
* If the config option `restart-strategy` is not configured, then
** If `restart-strategy.fixed-delay.attempts` or
`restart-strategy.fixed-delay.delay` are defined, then instantiate
`FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
restart-strategy.fixed-delay.delay)`
** If `restart-strategy.fixed-delay.attempts` and
`restart-strategy.fixed-delay.delay` are not defined, then
*** If checkpointing is disabled, then choose `NoRestartStrategy`
*** If checkpointing is enabled, then choose
`FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`

I would like to simplify the configuration by removing the "If
`restart-strategy.fixed-delay.attempts` or
`restart-strategy.fixed-delay.delay`, then" condition. That way, the logic
would be the following:

* If the config option `restart-strategy` is configured, then Flink uses
this `RestartStrategy`
* If the config option `restart-strategy` is not configured, then
** If checkpointing is disabled, then choose `NoRestartStrategy`
** If checkpointing is enabled, then choose
`FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`

That way we retain the user friendliness that jobs restart if the user
enabled checkpointing and we make it clear that any `
restart-strategy.fixed-delay.xyz` setting will only be respected if
`restart-strategy` has been set to `fixed-delay`.

This simplification would, however, change Flink's behaviour and might
break existing setups. Since we introduced `RestartStrategies` with Flink
1.0.0 and deprecated the prior configuration mechanism which enables
restarting if either the `attempts` or the `delay` has been set, I think
that the number of broken jobs should be minimal if not non-existent.

There has been a previous discuss thread which is now being abandoned [2].

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-61+Simplify+Flink%27s+cluster+level+RestartStrategy+configuration
[2]
https://lists.apache.org/thread.html/80bef7146f9696f35b1e50ff4acdd1cc3e87ae6f212d205aa7a72182@%3Cdev.flink.apache.org%3E

Cheers,
Till
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-61 Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann
I guess that most things have already been said on the related discussion
thread [1]. Hence I will start a vote about this FLIP.

[1]
https://lists.apache.org/thread.html/80bef7146f9696f35b1e50ff4acdd1cc3e87ae6f212d205aa7a72182@%3Cdev.flink.apache.org%3E

Cheers,
Till

On Mon, Sep 2, 2019 at 10:56 PM Till Rohrmann <[hidden email]> wrote:

> Hi everyone,
>
> I'd like to discuss FLIP-61 [1] which tries to simplify Flink's cluster
> lever RestartStrategy configuration.
>
> Currently, Flink's behaviour with respect to configuring the
> `RestartStrategies` is quite complicated and convoluted. The reason for
> this is that we evolved the way it has been configured and wanted to keep
> it backwards compatible. Due to this, we have currently the following
> behaviour:
>
> * If the config option `restart-strategy` is configured, then Flink uses
> this `RestartStrategy` (so far so simple)
> * If the config option `restart-strategy` is not configured, then
> ** If `restart-strategy.fixed-delay.attempts` or
> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> restart-strategy.fixed-delay.delay)`
> ** If `restart-strategy.fixed-delay.attempts` and
> `restart-strategy.fixed-delay.delay` are not defined, then
> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> *** If checkpointing is enabled, then choose
> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>
> I would like to simplify the configuration by removing the "If
> `restart-strategy.fixed-delay.attempts` or
> `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic
> would be the following:
>
> * If the config option `restart-strategy` is configured, then Flink uses
> this `RestartStrategy`
> * If the config option `restart-strategy` is not configured, then
> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> ** If checkpointing is enabled, then choose
> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>
> That way we retain the user friendliness that jobs restart if the user
> enabled checkpointing and we make it clear that any `
> restart-strategy.fixed-delay.xyz` setting will only be respected if
> `restart-strategy` has been set to `fixed-delay`.
>
> This simplification would, however, change Flink's behaviour and might
> break existing setups. Since we introduced `RestartStrategies` with Flink
> 1.0.0 and deprecated the prior configuration mechanism which enables
> restarting if either the `attempts` or the `delay` has been set, I think
> that the number of broken jobs should be minimal if not non-existent.
>
> There has been a previous discuss thread which is now being abandoned [2].
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-61+Simplify+Flink%27s+cluster+level+RestartStrategy+configuration
> [2]
> https://lists.apache.org/thread.html/80bef7146f9696f35b1e50ff4acdd1cc3e87ae6f212d205aa7a72182@%3Cdev.flink.apache.org%3E
>
> Cheers,
> Till
>