Hi everyone,
I wanted to discuss how to simplify Flink's cluster level RestartStrategy configuration [1]. Currently, Flink's behaviour with respect to configuring the {{RestartStrategies}} is quite complicated and convoluted. The reason for this is that we evolved the way it has been configured and wanted to keep it backwards compatible. Due to this, we have currently the following behaviour: * If the config option `restart-strategy` is configured, then Flink uses this `RestartStrategy` (so far so simple) * If the config option `restart-strategy` is not configured, then ** If `restart-strategy.fixed-delay.attempts` or `restart-strategy.fixed-delay.delay` are defined, then instantiate `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, restart-strategy.fixed-delay.delay)` ** If `restart-strategy.fixed-delay.attempts` and `restart-strategy.fixed-delay.delay` are not defined, then *** If checkpointing is disabled, then choose `NoRestartStrategy` *** If checkpointing is enabled, then choose `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` I would like to simplify the configuration by removing the "If `restart-strategy.fixed-delay.attempts` or `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic would be the following: * If the config option `restart-strategy` is configured, then Flink uses this `RestartStrategy` * If the config option `restart-strategy` is not configured, then ** If checkpointing is disabled, then choose `NoRestartStrategy` ** If checkpointing is enabled, then choose `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` That way we retain the user friendliness that jobs restart if the user enabled checkpointing and we make it clear that any ` restart-strategy.fixed-delay.xyz` setting will only be respected if `restart-strategy` has been set to `fixed-delay`. This simplification would, however, change Flink's behaviour and might break existing setups. Since we introduced `RestartStrategies` with Flink 1.0.0 and deprecated the prior configuration mechanism which enables restarting if either the `attempts` or the `delay` has been set, I think that the number of broken jobs should be minimal if not non-existent. I'm sure that one can simplify the way RestartStrategies are programmatically configured as well but for the sake of simplicity/scoping I'd like to not touch it right away. What do you think about this behaviour change? [1] https://issues.apache.org/jira/browse/FLINK-13921 Cheers, Till |
+1 in general
What is the default in batch, though? No restarts? I always found that somewhat uncommon. Should we also change that part, if we are changing the default anyways? On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> wrote: > Hi everyone, > > I wanted to discuss how to simplify Flink's cluster level RestartStrategy > configuration [1]. Currently, Flink's behaviour with respect to configuring > the {{RestartStrategies}} is quite complicated and convoluted. The reason > for this is that we evolved the way it has been configured and wanted to > keep it backwards compatible. Due to this, we have currently the following > behaviour: > > * If the config option `restart-strategy` is configured, then Flink uses > this `RestartStrategy` (so far so simple) > * If the config option `restart-strategy` is not configured, then > ** If `restart-strategy.fixed-delay.attempts` or > `restart-strategy.fixed-delay.delay` are defined, then instantiate > `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > restart-strategy.fixed-delay.delay)` > ** If `restart-strategy.fixed-delay.attempts` and > `restart-strategy.fixed-delay.delay` are not defined, then > *** If checkpointing is disabled, then choose `NoRestartStrategy` > *** If checkpointing is enabled, then choose > `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > I would like to simplify the configuration by removing the "If > `restart-strategy.fixed-delay.attempts` or > `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic > would be the following: > > * If the config option `restart-strategy` is configured, then Flink uses > this `RestartStrategy` > * If the config option `restart-strategy` is not configured, then > ** If checkpointing is disabled, then choose `NoRestartStrategy` > ** If checkpointing is enabled, then choose > `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > That way we retain the user friendliness that jobs restart if the user > enabled checkpointing and we make it clear that any ` > restart-strategy.fixed-delay.xyz` setting will only be respected if > `restart-strategy` has been set to `fixed-delay`. > > This simplification would, however, change Flink's behaviour and might > break existing setups. Since we introduced `RestartStrategies` with Flink > 1.0.0 and deprecated the prior configuration mechanism which enables > restarting if either the `attempts` or the `delay` has been set, I think > that the number of broken jobs should be minimal if not non-existent. > > I'm sure that one can simplify the way RestartStrategies are > programmatically configured as well but for the sake of simplicity/scoping > I'd like to not touch it right away. > > What do you think about this behaviour change? > > [1] https://issues.apache.org/jira/browse/FLINK-13921 > > Cheers, > Till > |
Also +1 in general.
I have a few questions though: - does it only apply to the logic in org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, which is only the cluster side configuration? Or do you want to change the logic also on the job side in ExecutionConfig? - if the latter, does that mean deprecated methods in ExecutionConfig like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no effect? I think this would be a good idea, but would suggest to remove the corresponding fields and methods. This is not that simple though. I tried to do that for other parameters that have no effect already like codeAnalysisMode & failTaskOnCheckpointError. The are two problems: 1) setNumberOfExecutionRetires are effectively marked with @Public annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have this problem). Therefore this would be a binary incompatible change. 2) ExecutionConfig is stored in state as part of PojoSerializer in pre flink 1.7. It should not be a problem for numberOfExecutionRetries & executionRetryDelays as they are of primitive types. It is a problem for codeAnalysisMode (we cannot remove the class, as this breaks serialization). I wanted to mention that anyway, just to be aware of that. Best, Dawid On 30/08/2019 14:48, Stephan Ewen wrote: > +1 in general > > What is the default in batch, though? No restarts? I always found that > somewhat uncommon. > Should we also change that part, if we are changing the default anyways? > > > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> wrote: > >> Hi everyone, >> >> I wanted to discuss how to simplify Flink's cluster level RestartStrategy >> configuration [1]. Currently, Flink's behaviour with respect to configuring >> the {{RestartStrategies}} is quite complicated and convoluted. The reason >> for this is that we evolved the way it has been configured and wanted to >> keep it backwards compatible. Due to this, we have currently the following >> behaviour: >> >> * If the config option `restart-strategy` is configured, then Flink uses >> this `RestartStrategy` (so far so simple) >> * If the config option `restart-strategy` is not configured, then >> ** If `restart-strategy.fixed-delay.attempts` or >> `restart-strategy.fixed-delay.delay` are defined, then instantiate >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, >> restart-strategy.fixed-delay.delay)` >> ** If `restart-strategy.fixed-delay.attempts` and >> `restart-strategy.fixed-delay.delay` are not defined, then >> *** If checkpointing is disabled, then choose `NoRestartStrategy` >> *** If checkpointing is enabled, then choose >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> >> I would like to simplify the configuration by removing the "If >> `restart-strategy.fixed-delay.attempts` or >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the logic >> would be the following: >> >> * If the config option `restart-strategy` is configured, then Flink uses >> this `RestartStrategy` >> * If the config option `restart-strategy` is not configured, then >> ** If checkpointing is disabled, then choose `NoRestartStrategy` >> ** If checkpointing is enabled, then choose >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> >> That way we retain the user friendliness that jobs restart if the user >> enabled checkpointing and we make it clear that any ` >> restart-strategy.fixed-delay.xyz` setting will only be respected if >> `restart-strategy` has been set to `fixed-delay`. >> >> This simplification would, however, change Flink's behaviour and might >> break existing setups. Since we introduced `RestartStrategies` with Flink >> 1.0.0 and deprecated the prior configuration mechanism which enables >> restarting if either the `attempts` or the `delay` has been set, I think >> that the number of broken jobs should be minimal if not non-existent. >> >> I'm sure that one can simplify the way RestartStrategies are >> programmatically configured as well but for the sake of simplicity/scoping >> I'd like to not touch it right away. >> >> What do you think about this behaviour change? >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 >> >> Cheers, >> Till >> signature.asc (849 bytes) Download Attachment |
The current default behaviour for batch is `NoRestartStrategy` if nothing
is configured. We could say that we set the default value of `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` independent of the checkpointing. The only downside I could see is that some faulty batch jobs might get stuck in a restart loop without reaching a terminal state. @Dawid, I don't intend to touch the ExecutionConfig. This change only targets the cluster level configuration of the RestartStrategy. Cheers, Till On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email]> wrote: > Also +1 in general. > > I have a few questions though: > > - does it only apply to the logic in > > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > which is only the cluster side configuration? Or do you want to change > the logic also on the job side in ExecutionConfig? > > - if the latter, does that mean deprecated methods in ExecutionConfig > like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no > effect? I think this would be a good idea, but would suggest to remove > the corresponding fields and methods. This is not that simple though. I > tried to do that for other parameters that have no effect already like > codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > > 1) setNumberOfExecutionRetires are effectively marked with @Public > annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have > this problem). Therefore this would be a binary incompatible change. > > 2) ExecutionConfig is stored in state as part of PojoSerializer in > pre flink 1.7. It should not be a problem for numberOfExecutionRetries & > executionRetryDelays as they are of primitive types. It is a problem for > codeAnalysisMode (we cannot remove the class, as this breaks > serialization). I wanted to mention that anyway, just to be aware of that. > > Best, > > Dawid > > On 30/08/2019 14:48, Stephan Ewen wrote: > > +1 in general > > > > What is the default in batch, though? No restarts? I always found that > > somewhat uncommon. > > Should we also change that part, if we are changing the default anyways? > > > > > > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> > wrote: > > > >> Hi everyone, > >> > >> I wanted to discuss how to simplify Flink's cluster level > RestartStrategy > >> configuration [1]. Currently, Flink's behaviour with respect to > configuring > >> the {{RestartStrategies}} is quite complicated and convoluted. The > reason > >> for this is that we evolved the way it has been configured and wanted to > >> keep it backwards compatible. Due to this, we have currently the > following > >> behaviour: > >> > >> * If the config option `restart-strategy` is configured, then Flink uses > >> this `RestartStrategy` (so far so simple) > >> * If the config option `restart-strategy` is not configured, then > >> ** If `restart-strategy.fixed-delay.attempts` or > >> `restart-strategy.fixed-delay.delay` are defined, then instantiate > >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > >> restart-strategy.fixed-delay.delay)` > >> ** If `restart-strategy.fixed-delay.attempts` and > >> `restart-strategy.fixed-delay.delay` are not defined, then > >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > >> *** If checkpointing is enabled, then choose > >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > >> > >> I would like to simplify the configuration by removing the "If > >> `restart-strategy.fixed-delay.attempts` or > >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the > logic > >> would be the following: > >> > >> * If the config option `restart-strategy` is configured, then Flink uses > >> this `RestartStrategy` > >> * If the config option `restart-strategy` is not configured, then > >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > >> ** If checkpointing is enabled, then choose > >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > >> > >> That way we retain the user friendliness that jobs restart if the user > >> enabled checkpointing and we make it clear that any ` > >> restart-strategy.fixed-delay.xyz` setting will only be respected if > >> `restart-strategy` has been set to `fixed-delay`. > >> > >> This simplification would, however, change Flink's behaviour and might > >> break existing setups. Since we introduced `RestartStrategies` with > Flink > >> 1.0.0 and deprecated the prior configuration mechanism which enables > >> restarting if either the `attempts` or the `delay` has been set, I think > >> that the number of broken jobs should be minimal if not non-existent. > >> > >> I'm sure that one can simplify the way RestartStrategies are > >> programmatically configured as well but for the sake of > simplicity/scoping > >> I'd like to not touch it right away. > >> > >> What do you think about this behaviour change? > >> > >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > >> > >> Cheers, > >> Till > >> > > |
After an offline discussion with Stephan, we concluded that changing the
default restart strategy for batch jobs is not that easy because the cluster level restart configuration does not necessarily know about the type of job which is submitted. We concluded that we would like to keep the batch behaviour as is (NoRestartStrategy) and revisit this issue at a later point in time. On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> wrote: > The current default behaviour for batch is `NoRestartStrategy` if nothing > is configured. We could say that we set the default value of > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > independent of the checkpointing. The only downside I could see is that > some faulty batch jobs might get stuck in a restart loop without reaching a > terminal state. > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > targets the cluster level configuration of the RestartStrategy. > > Cheers, > Till > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email]> > wrote: > >> Also +1 in general. >> >> I have a few questions though: >> >> - does it only apply to the logic in >> >> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, >> which is only the cluster side configuration? Or do you want to change >> the logic also on the job side in ExecutionConfig? >> >> - if the latter, does that mean deprecated methods in ExecutionConfig >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no >> effect? I think this would be a good idea, but would suggest to remove >> the corresponding fields and methods. This is not that simple though. I >> tried to do that for other parameters that have no effect already like >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: >> >> 1) setNumberOfExecutionRetires are effectively marked with @Public >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have >> this problem). Therefore this would be a binary incompatible change. >> >> 2) ExecutionConfig is stored in state as part of PojoSerializer in >> pre flink 1.7. It should not be a problem for numberOfExecutionRetries & >> executionRetryDelays as they are of primitive types. It is a problem for >> codeAnalysisMode (we cannot remove the class, as this breaks >> serialization). I wanted to mention that anyway, just to be aware of that. >> >> Best, >> >> Dawid >> >> On 30/08/2019 14:48, Stephan Ewen wrote: >> > +1 in general >> > >> > What is the default in batch, though? No restarts? I always found that >> > somewhat uncommon. >> > Should we also change that part, if we are changing the default anyways? >> > >> > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> >> wrote: >> > >> >> Hi everyone, >> >> >> >> I wanted to discuss how to simplify Flink's cluster level >> RestartStrategy >> >> configuration [1]. Currently, Flink's behaviour with respect to >> configuring >> >> the {{RestartStrategies}} is quite complicated and convoluted. The >> reason >> >> for this is that we evolved the way it has been configured and wanted >> to >> >> keep it backwards compatible. Due to this, we have currently the >> following >> >> behaviour: >> >> >> >> * If the config option `restart-strategy` is configured, then Flink >> uses >> >> this `RestartStrategy` (so far so simple) >> >> * If the config option `restart-strategy` is not configured, then >> >> ** If `restart-strategy.fixed-delay.attempts` or >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, >> >> restart-strategy.fixed-delay.delay)` >> >> ** If `restart-strategy.fixed-delay.attempts` and >> >> `restart-strategy.fixed-delay.delay` are not defined, then >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` >> >> *** If checkpointing is enabled, then choose >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> >> >> >> I would like to simplify the configuration by removing the "If >> >> `restart-strategy.fixed-delay.attempts` or >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the >> logic >> >> would be the following: >> >> >> >> * If the config option `restart-strategy` is configured, then Flink >> uses >> >> this `RestartStrategy` >> >> * If the config option `restart-strategy` is not configured, then >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` >> >> ** If checkpointing is enabled, then choose >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> >> >> >> That way we retain the user friendliness that jobs restart if the user >> >> enabled checkpointing and we make it clear that any ` >> >> restart-strategy.fixed-delay.xyz` setting will only be respected if >> >> `restart-strategy` has been set to `fixed-delay`. >> >> >> >> This simplification would, however, change Flink's behaviour and might >> >> break existing setups. Since we introduced `RestartStrategies` with >> Flink >> >> 1.0.0 and deprecated the prior configuration mechanism which enables >> >> restarting if either the `attempts` or the `delay` has been set, I >> think >> >> that the number of broken jobs should be minimal if not non-existent. >> >> >> >> I'm sure that one can simplify the way RestartStrategies are >> >> programmatically configured as well but for the sake of >> simplicity/scoping >> >> I'd like to not touch it right away. >> >> >> >> What do you think about this behaviour change? >> >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 >> >> >> >> Cheers, >> >> Till >> >> >> >> |
+1. The new behavior makes sense to me.
BTW, we need a FLIP for this :) On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> wrote: > After an offline discussion with Stephan, we concluded that changing the > default restart strategy for batch jobs is not that easy because the > cluster level restart configuration does not necessarily know about the > type of job which is submitted. We concluded that we would like to keep the > batch behaviour as is (NoRestartStrategy) and revisit this issue at a later > point in time. > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> > wrote: > > > The current default behaviour for batch is `NoRestartStrategy` if nothing > > is configured. We could say that we set the default value of > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 > s")` > > independent of the checkpointing. The only downside I could see is that > > some faulty batch jobs might get stuck in a restart loop without > reaching a > > terminal state. > > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > > targets the cluster level configuration of the RestartStrategy. > > > > Cheers, > > Till > > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <[hidden email] > > > > wrote: > > > >> Also +1 in general. > >> > >> I have a few questions though: > >> > >> - does it only apply to the logic in > >> > >> > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > >> which is only the cluster side configuration? Or do you want to change > >> the logic also on the job side in ExecutionConfig? > >> > >> - if the latter, does that mean deprecated methods in ExecutionConfig > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no > >> effect? I think this would be a good idea, but would suggest to remove > >> the corresponding fields and methods. This is not that simple though. I > >> tried to do that for other parameters that have no effect already like > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > >> > >> 1) setNumberOfExecutionRetires are effectively marked with @Public > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have > >> this problem). Therefore this would be a binary incompatible change. > >> > >> 2) ExecutionConfig is stored in state as part of PojoSerializer in > >> pre flink 1.7. It should not be a problem for numberOfExecutionRetries & > >> executionRetryDelays as they are of primitive types. It is a problem for > >> codeAnalysisMode (we cannot remove the class, as this breaks > >> serialization). I wanted to mention that anyway, just to be aware of > that. > >> > >> Best, > >> > >> Dawid > >> > >> On 30/08/2019 14:48, Stephan Ewen wrote: > >> > +1 in general > >> > > >> > What is the default in batch, though? No restarts? I always found that > >> > somewhat uncommon. > >> > Should we also change that part, if we are changing the default > anyways? > >> > > >> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email]> > >> wrote: > >> > > >> >> Hi everyone, > >> >> > >> >> I wanted to discuss how to simplify Flink's cluster level > >> RestartStrategy > >> >> configuration [1]. Currently, Flink's behaviour with respect to > >> configuring > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The > >> reason > >> >> for this is that we evolved the way it has been configured and wanted > >> to > >> >> keep it backwards compatible. Due to this, we have currently the > >> following > >> >> behaviour: > >> >> > >> >> * If the config option `restart-strategy` is configured, then Flink > >> uses > >> >> this `RestartStrategy` (so far so simple) > >> >> * If the config option `restart-strategy` is not configured, then > >> >> ** If `restart-strategy.fixed-delay.attempts` or > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > >> >> restart-strategy.fixed-delay.delay)` > >> >> ** If `restart-strategy.fixed-delay.attempts` and > >> >> `restart-strategy.fixed-delay.delay` are not defined, then > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > >> >> *** If checkpointing is enabled, then choose > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > >> >> > >> >> I would like to simplify the configuration by removing the "If > >> >> `restart-strategy.fixed-delay.attempts` or > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the > >> logic > >> >> would be the following: > >> >> > >> >> * If the config option `restart-strategy` is configured, then Flink > >> uses > >> >> this `RestartStrategy` > >> >> * If the config option `restart-strategy` is not configured, then > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > >> >> ** If checkpointing is enabled, then choose > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > >> >> > >> >> That way we retain the user friendliness that jobs restart if the > user > >> >> enabled checkpointing and we make it clear that any ` > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected if > >> >> `restart-strategy` has been set to `fixed-delay`. > >> >> > >> >> This simplification would, however, change Flink's behaviour and > might > >> >> break existing setups. Since we introduced `RestartStrategies` with > >> Flink > >> >> 1.0.0 and deprecated the prior configuration mechanism which enables > >> >> restarting if either the `attempts` or the `delay` has been set, I > >> think > >> >> that the number of broken jobs should be minimal if not non-existent. > >> >> > >> >> I'm sure that one can simplify the way RestartStrategies are > >> >> programmatically configured as well but for the sake of > >> simplicity/scoping > >> >> I'd like to not touch it right away. > >> >> > >> >> What do you think about this behaviour change? > >> >> > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > >> >> > >> >> Cheers, > >> >> Till > >> >> > >> > >> > |
+1 to simplify the RestartStrategy configuration
One thing to confirm is whether the default delay should be "0 s" in the case of "If the config option `restart-strategy` is not configured" and "If checkpointing is enabled". I see a related discussion([SURVEY] Is the default restart delay of 0s causing problems) is ongoing and we may need to take the result from that. Thanks, Zhu Zhu Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道: > +1. The new behavior makes sense to me. > > BTW, we need a FLIP for this :) > > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> > wrote: > > > After an offline discussion with Stephan, we concluded that changing the > > default restart strategy for batch jobs is not that easy because the > > cluster level restart configuration does not necessarily know about the > > type of job which is submitted. We concluded that we would like to keep > the > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a > later > > point in time. > > > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > The current default behaviour for batch is `NoRestartStrategy` if > nothing > > > is configured. We could say that we set the default value of > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 > > s")` > > > independent of the checkpointing. The only downside I could see is that > > > some faulty batch jobs might get stuck in a restart loop without > > reaching a > > > terminal state. > > > > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > > > targets the cluster level configuration of the RestartStrategy. > > > > > > Cheers, > > > Till > > > > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < > [hidden email] > > > > > > wrote: > > > > > >> Also +1 in general. > > >> > > >> I have a few questions though: > > >> > > >> - does it only apply to the logic in > > >> > > >> > > > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > > >> which is only the cluster side configuration? Or do you want to change > > >> the logic also on the job side in ExecutionConfig? > > >> > > >> - if the latter, does that mean deprecated methods in ExecutionConfig > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no > > >> effect? I think this would be a good idea, but would suggest to remove > > >> the corresponding fields and methods. This is not that simple though. > I > > >> tried to do that for other parameters that have no effect already like > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > > >> > > >> 1) setNumberOfExecutionRetires are effectively marked with @Public > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't > have > > >> this problem). Therefore this would be a binary incompatible change. > > >> > > >> 2) ExecutionConfig is stored in state as part of PojoSerializer in > > >> pre flink 1.7. It should not be a problem for > numberOfExecutionRetries & > > >> executionRetryDelays as they are of primitive types. It is a problem > for > > >> codeAnalysisMode (we cannot remove the class, as this breaks > > >> serialization). I wanted to mention that anyway, just to be aware of > > that. > > >> > > >> Best, > > >> > > >> Dawid > > >> > > >> On 30/08/2019 14:48, Stephan Ewen wrote: > > >> > +1 in general > > >> > > > >> > What is the default in batch, though? No restarts? I always found > that > > >> > somewhat uncommon. > > >> > Should we also change that part, if we are changing the default > > anyways? > > >> > > > >> > > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email] > > > > >> wrote: > > >> > > > >> >> Hi everyone, > > >> >> > > >> >> I wanted to discuss how to simplify Flink's cluster level > > >> RestartStrategy > > >> >> configuration [1]. Currently, Flink's behaviour with respect to > > >> configuring > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The > > >> reason > > >> >> for this is that we evolved the way it has been configured and > wanted > > >> to > > >> >> keep it backwards compatible. Due to this, we have currently the > > >> following > > >> >> behaviour: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` (so far so simple) > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > > >> >> restart-strategy.fixed-delay.delay)` > > >> >> ** If `restart-strategy.fixed-delay.attempts` and > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> *** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> I would like to simplify the configuration by removing the "If > > >> >> `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, > the > > >> logic > > >> >> would be the following: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> ** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> That way we retain the user friendliness that jobs restart if the > > user > > >> >> enabled checkpointing and we make it clear that any ` > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected > if > > >> >> `restart-strategy` has been set to `fixed-delay`. > > >> >> > > >> >> This simplification would, however, change Flink's behaviour and > > might > > >> >> break existing setups. Since we introduced `RestartStrategies` with > > >> Flink > > >> >> 1.0.0 and deprecated the prior configuration mechanism which > enables > > >> >> restarting if either the `attempts` or the `delay` has been set, I > > >> think > > >> >> that the number of broken jobs should be minimal if not > non-existent. > > >> >> > > >> >> I'm sure that one can simplify the way RestartStrategies are > > >> >> programmatically configured as well but for the sake of > > >> simplicity/scoping > > >> >> I'd like to not touch it right away. > > >> >> > > >> >> What do you think about this behaviour change? > > >> >> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > > >> >> > > >> >> Cheers, > > >> >> Till > > >> >> > > >> > > >> > > > |
+1 for this proposal.
IMO, it not only simplifies the cluster configuration, but also seems more fit logic to not rely on some low-level speicific parameters to judge the upper-level strategy. It is also resonable to push forward the restart strategy configuration step by step for batch later. Best, Zhijiang ------------------------------------------------------------------ From:Zhu Zhu <[hidden email]> Send Time:2019年9月2日(星期一) 05:18 To:dev <[hidden email]> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration +1 to simplify the RestartStrategy configuration One thing to confirm is whether the default delay should be "0 s" in the case of "If the config option `restart-strategy` is not configured" and "If checkpointing is enabled". I see a related discussion([SURVEY] Is the default restart delay of 0s causing problems) is ongoing and we may need to take the result from that. Thanks, Zhu Zhu Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道: > +1. The new behavior makes sense to me. > > BTW, we need a FLIP for this :) > > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> > wrote: > > > After an offline discussion with Stephan, we concluded that changing the > > default restart strategy for batch jobs is not that easy because the > > cluster level restart configuration does not necessarily know about the > > type of job which is submitted. We concluded that we would like to keep > the > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a > later > > point in time. > > > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > The current default behaviour for batch is `NoRestartStrategy` if > nothing > > > is configured. We could say that we set the default value of > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 > > s")` > > > independent of the checkpointing. The only downside I could see is that > > > some faulty batch jobs might get stuck in a restart loop without > > reaching a > > > terminal state. > > > > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > > > targets the cluster level configuration of the RestartStrategy. > > > > > > Cheers, > > > Till > > > > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < > [hidden email] > > > > > > wrote: > > > > > >> Also +1 in general. > > >> > > >> I have a few questions though: > > >> > > >> - does it only apply to the logic in > > >> > > >> > > > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > > >> which is only the cluster side configuration? Or do you want to change > > >> the logic also on the job side in ExecutionConfig? > > >> > > >> - if the latter, does that mean deprecated methods in ExecutionConfig > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no > > >> effect? I think this would be a good idea, but would suggest to remove > > >> the corresponding fields and methods. This is not that simple though. > I > > >> tried to do that for other parameters that have no effect already like > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > > >> > > >> 1) setNumberOfExecutionRetires are effectively marked with @Public > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't > have > > >> this problem). Therefore this would be a binary incompatible change. > > >> > > >> 2) ExecutionConfig is stored in state as part of PojoSerializer in > > >> pre flink 1.7. It should not be a problem for > numberOfExecutionRetries & > > >> executionRetryDelays as they are of primitive types. It is a problem > for > > >> codeAnalysisMode (we cannot remove the class, as this breaks > > >> serialization). I wanted to mention that anyway, just to be aware of > > that. > > >> > > >> Best, > > >> > > >> Dawid > > >> > > >> On 30/08/2019 14:48, Stephan Ewen wrote: > > >> > +1 in general > > >> > > > >> > What is the default in batch, though? No restarts? I always found > that > > >> > somewhat uncommon. > > >> > Should we also change that part, if we are changing the default > > anyways? > > >> > > > >> > > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[hidden email] > > > > >> wrote: > > >> > > > >> >> Hi everyone, > > >> >> > > >> >> I wanted to discuss how to simplify Flink's cluster level > > >> RestartStrategy > > >> >> configuration [1]. Currently, Flink's behaviour with respect to > > >> configuring > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The > > >> reason > > >> >> for this is that we evolved the way it has been configured and > wanted > > >> to > > >> >> keep it backwards compatible. Due to this, we have currently the > > >> following > > >> >> behaviour: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` (so far so simple) > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > > >> >> restart-strategy.fixed-delay.delay)` > > >> >> ** If `restart-strategy.fixed-delay.attempts` and > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> *** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> I would like to simplify the configuration by removing the "If > > >> >> `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, > the > > >> logic > > >> >> would be the following: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> ** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> That way we retain the user friendliness that jobs restart if the > > user > > >> >> enabled checkpointing and we make it clear that any ` > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected > if > > >> >> `restart-strategy` has been set to `fixed-delay`. > > >> >> > > >> >> This simplification would, however, change Flink's behaviour and > > might > > >> >> break existing setups. Since we introduced `RestartStrategies` with > > >> Flink > > >> >> 1.0.0 and deprecated the prior configuration mechanism which > enables > > >> >> restarting if either the `attempts` or the `delay` has been set, I > > >> think > > >> >> that the number of broken jobs should be minimal if not > non-existent. > > >> >> > > >> >> I'm sure that one can simplify the way RestartStrategies are > > >> >> programmatically configured as well but for the sake of > > >> simplicity/scoping > > >> >> I'd like to not touch it right away. > > >> >> > > >> >> What do you think about this behaviour change? > > >> >> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > > >> >> > > >> >> Cheers, > > >> >> Till > > >> >> > > >> > > >> > > > |
Thanks a lot for the positive feedback. I think you are right Becket that
this needs a FLIP since it changes Flink's behaviour. I'll create one and post it to the dev ML. @Zhu Zhu <[hidden email]> I agree that the restart delay is related to the RestartStrategy configuration. However, I would like to exclude it from this improvement proposal since it would broaden the scope unnecessarily. I'd suggest to continue the discussion about this on the SURVEY thread and then based on the outcome start a DISCUSS thread about the concrete changes to the default restart delay value. Since I will create a FLIP for the simplification logic, I'll conclude this thread and would encourage everyone to move all further discussion to the Flip DISCUSS thread. I'll post the link to it shortly. Cheers, Till On Mon, Sep 2, 2019 at 6:22 AM zhijiang <[hidden email]> wrote: > +1 for this proposal. > > IMO, it not only simplifies the cluster configuration, but also seems more > fit logic to not rely on some low-level speicific parameters to judge the > upper-level strategy. > It is also resonable to push forward the restart strategy configuration > step by step for batch later. > > Best, > Zhijiang > ------------------------------------------------------------------ > From:Zhu Zhu <[hidden email]> > Send Time:2019年9月2日(星期一) 05:18 > To:dev <[hidden email]> > Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy > configuration > > +1 to simplify the RestartStrategy configuration > > One thing to confirm is whether the default delay should be "0 s" in the > case of > "If the config option `restart-strategy` is not configured" and "If > checkpointing is enabled". > I see a related discussion([SURVEY] Is the default restart delay of 0s > causing problems) is ongoing and we may need to take the result from that. > > Thanks, > Zhu Zhu > > Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道: > > > +1. The new behavior makes sense to me. > > > > BTW, we need a FLIP for this :) > > > > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > After an offline discussion with Stephan, we concluded that changing > the > > > default restart strategy for batch jobs is not that easy because the > > > cluster level restart configuration does not necessarily know about the > > > type of job which is submitted. We concluded that we would like to keep > > the > > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a > > later > > > point in time. > > > > > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > The current default behaviour for batch is `NoRestartStrategy` if > > nothing > > > > is configured. We could say that we set the default value of > > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, > "0 > > > s")` > > > > independent of the checkpointing. The only downside I could see is > that > > > > some faulty batch jobs might get stuck in a restart loop without > > > reaching a > > > > terminal state. > > > > > > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > > > > targets the cluster level configuration of the RestartStrategy. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < > > [hidden email] > > > > > > > > wrote: > > > > > > > >> Also +1 in general. > > > >> > > > >> I have a few questions though: > > > >> > > > >> - does it only apply to the logic in > > > >> > > > >> > > > > > > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > > > >> which is only the cluster side configuration? Or do you want to > change > > > >> the logic also on the job side in ExecutionConfig? > > > >> > > > >> - if the latter, does that mean deprecated methods in > ExecutionConfig > > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have > no > > > >> effect? I think this would be a good idea, but would suggest to > remove > > > >> the corresponding fields and methods. This is not that simple > though. > > I > > > >> tried to do that for other parameters that have no effect already > like > > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > > > >> > > > >> 1) setNumberOfExecutionRetires are effectively marked with > @Public > > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't > > have > > > >> this problem). Therefore this would be a binary incompatible change. > > > >> > > > >> 2) ExecutionConfig is stored in state as part of PojoSerializer > in > > > >> pre flink 1.7. It should not be a problem for > > numberOfExecutionRetries & > > > >> executionRetryDelays as they are of primitive types. It is a problem > > for > > > >> codeAnalysisMode (we cannot remove the class, as this breaks > > > >> serialization). I wanted to mention that anyway, just to be aware of > > > that. > > > >> > > > >> Best, > > > >> > > > >> Dawid > > > >> > > > >> On 30/08/2019 14:48, Stephan Ewen wrote: > > > >> > +1 in general > > > >> > > > > >> > What is the default in batch, though? No restarts? I always found > > that > > > >> > somewhat uncommon. > > > >> > Should we also change that part, if we are changing the default > > > anyways? > > > >> > > > > >> > > > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann < > [hidden email] > > > > > > >> wrote: > > > >> > > > > >> >> Hi everyone, > > > >> >> > > > >> >> I wanted to discuss how to simplify Flink's cluster level > > > >> RestartStrategy > > > >> >> configuration [1]. Currently, Flink's behaviour with respect to > > > >> configuring > > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. > The > > > >> reason > > > >> >> for this is that we evolved the way it has been configured and > > wanted > > > >> to > > > >> >> keep it backwards compatible. Due to this, we have currently the > > > >> following > > > >> >> behaviour: > > > >> >> > > > >> >> * If the config option `restart-strategy` is configured, then > Flink > > > >> uses > > > >> >> this `RestartStrategy` (so far so simple) > > > >> >> * If the config option `restart-strategy` is not configured, then > > > >> >> ** If `restart-strategy.fixed-delay.attempts` or > > > >> >> `restart-strategy.fixed-delay.delay` are defined, then > instantiate > > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > > > >> >> restart-strategy.fixed-delay.delay)` > > > >> >> ** If `restart-strategy.fixed-delay.attempts` and > > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then > > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > > > >> >> *** If checkpointing is enabled, then choose > > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > > >> >> > > > >> >> I would like to simplify the configuration by removing the "If > > > >> >> `restart-strategy.fixed-delay.attempts` or > > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, > > the > > > >> logic > > > >> >> would be the following: > > > >> >> > > > >> >> * If the config option `restart-strategy` is configured, then > Flink > > > >> uses > > > >> >> this `RestartStrategy` > > > >> >> * If the config option `restart-strategy` is not configured, then > > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > > > >> >> ** If checkpointing is enabled, then choose > > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > > >> >> > > > >> >> That way we retain the user friendliness that jobs restart if the > > > user > > > >> >> enabled checkpointing and we make it clear that any ` > > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected > > if > > > >> >> `restart-strategy` has been set to `fixed-delay`. > > > >> >> > > > >> >> This simplification would, however, change Flink's behaviour and > > > might > > > >> >> break existing setups. Since we introduced `RestartStrategies` > with > > > >> Flink > > > >> >> 1.0.0 and deprecated the prior configuration mechanism which > > enables > > > >> >> restarting if either the `attempts` or the `delay` has been set, > I > > > >> think > > > >> >> that the number of broken jobs should be minimal if not > > non-existent. > > > >> >> > > > >> >> I'm sure that one can simplify the way RestartStrategies are > > > >> >> programmatically configured as well but for the sake of > > > >> simplicity/scoping > > > >> >> I'd like to not touch it right away. > > > >> >> > > > >> >> What do you think about this behaviour change? > > > >> >> > > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > > > >> >> > > > >> >> Cheers, > > > >> >> Till > > > >> >> > > > >> > > > >> > > > > > > > |
The link to the dev ML discussion about FLIP-61 is
https://lists.apache.org/thread.html/e206390127bcbd9b24d9c41a838faa75157e468e01552ad241e3e24b@%3Cdev.flink.apache.org%3E Cheers, Till On Mon, Sep 2, 2019 at 10:37 PM Till Rohrmann <[hidden email]> wrote: > Thanks a lot for the positive feedback. I think you are right Becket that > this needs a FLIP since it changes Flink's behaviour. I'll create one and > post it to the dev ML. > > @Zhu Zhu <[hidden email]> I agree that the restart delay is related > to the RestartStrategy configuration. However, I would like to exclude it > from this improvement proposal since it would broaden the scope > unnecessarily. I'd suggest to continue the discussion about this on the > SURVEY thread and then based on the outcome start a DISCUSS thread about > the concrete changes to the default restart delay value. > > Since I will create a FLIP for the simplification logic, I'll conclude > this thread and would encourage everyone to move all further discussion to > the Flip DISCUSS thread. I'll post the link to it shortly. > > Cheers, > Till > > On Mon, Sep 2, 2019 at 6:22 AM zhijiang <[hidden email]> > wrote: > >> +1 for this proposal. >> >> IMO, it not only simplifies the cluster configuration, but also seems >> more fit logic to not rely on some low-level speicific parameters to judge >> the upper-level strategy. >> It is also resonable to push forward the restart strategy configuration >> step by step for batch later. >> >> Best, >> Zhijiang >> ------------------------------------------------------------------ >> From:Zhu Zhu <[hidden email]> >> Send Time:2019年9月2日(星期一) 05:18 >> To:dev <[hidden email]> >> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy >> configuration >> >> +1 to simplify the RestartStrategy configuration >> >> One thing to confirm is whether the default delay should be "0 s" in the >> case of >> "If the config option `restart-strategy` is not configured" and "If >> checkpointing is enabled". >> I see a related discussion([SURVEY] Is the default restart delay of 0s >> causing problems) is ongoing and we may need to take the result from that. >> >> Thanks, >> Zhu Zhu >> >> Becket Qin <[hidden email]> 于2019年9月2日周一 上午9:06写道: >> >> > +1. The new behavior makes sense to me. >> > >> > BTW, we need a FLIP for this :) >> > >> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[hidden email]> >> > wrote: >> > >> > > After an offline discussion with Stephan, we concluded that changing >> the >> > > default restart strategy for batch jobs is not that easy because the >> > > cluster level restart configuration does not necessarily know about >> the >> > > type of job which is submitted. We concluded that we would like to >> keep >> > the >> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a >> > later >> > > point in time. >> > > >> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[hidden email]> >> > > wrote: >> > > >> > > > The current default behaviour for batch is `NoRestartStrategy` if >> > nothing >> > > > is configured. We could say that we set the default value of >> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, >> "0 >> > > s")` >> > > > independent of the checkpointing. The only downside I could see is >> that >> > > > some faulty batch jobs might get stuck in a restart loop without >> > > reaching a >> > > > terminal state. >> > > > >> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change >> only >> > > > targets the cluster level configuration of the RestartStrategy. >> > > > >> > > > Cheers, >> > > > Till >> > > > >> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < >> > [hidden email] >> > > > >> > > > wrote: >> > > > >> > > >> Also +1 in general. >> > > >> >> > > >> I have a few questions though: >> > > >> >> > > >> - does it only apply to the logic in >> > > >> >> > > >> >> > > >> > >> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, >> > > >> which is only the cluster side configuration? Or do you want to >> change >> > > >> the logic also on the job side in ExecutionConfig? >> > > >> >> > > >> - if the latter, does that mean deprecated methods in >> ExecutionConfig >> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will >> have no >> > > >> effect? I think this would be a good idea, but would suggest to >> remove >> > > >> the corresponding fields and methods. This is not that simple >> though. >> > I >> > > >> tried to do that for other parameters that have no effect already >> like >> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: >> > > >> >> > > >> 1) setNumberOfExecutionRetires are effectively marked with >> @Public >> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't >> > have >> > > >> this problem). Therefore this would be a binary incompatible >> change. >> > > >> >> > > >> 2) ExecutionConfig is stored in state as part of >> PojoSerializer in >> > > >> pre flink 1.7. It should not be a problem for >> > numberOfExecutionRetries & >> > > >> executionRetryDelays as they are of primitive types. It is a >> problem >> > for >> > > >> codeAnalysisMode (we cannot remove the class, as this breaks >> > > >> serialization). I wanted to mention that anyway, just to be aware >> of >> > > that. >> > > >> >> > > >> Best, >> > > >> >> > > >> Dawid >> > > >> >> > > >> On 30/08/2019 14:48, Stephan Ewen wrote: >> > > >> > +1 in general >> > > >> > >> > > >> > What is the default in batch, though? No restarts? I always found >> > that >> > > >> > somewhat uncommon. >> > > >> > Should we also change that part, if we are changing the default >> > > anyways? >> > > >> > >> > > >> > >> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann < >> [hidden email] >> > > >> > > >> wrote: >> > > >> > >> > > >> >> Hi everyone, >> > > >> >> >> > > >> >> I wanted to discuss how to simplify Flink's cluster level >> > > >> RestartStrategy >> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to >> > > >> configuring >> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. >> The >> > > >> reason >> > > >> >> for this is that we evolved the way it has been configured and >> > wanted >> > > >> to >> > > >> >> keep it backwards compatible. Due to this, we have currently the >> > > >> following >> > > >> >> behaviour: >> > > >> >> >> > > >> >> * If the config option `restart-strategy` is configured, then >> Flink >> > > >> uses >> > > >> >> this `RestartStrategy` (so far so simple) >> > > >> >> * If the config option `restart-strategy` is not configured, >> then >> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or >> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then >> instantiate >> > > >> >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, >> > > >> >> restart-strategy.fixed-delay.delay)` >> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and >> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then >> > > >> >> *** If checkpointing is disabled, then choose >> `NoRestartStrategy` >> > > >> >> *** If checkpointing is enabled, then choose >> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> > > >> >> >> > > >> >> I would like to simplify the configuration by removing the "If >> > > >> >> `restart-strategy.fixed-delay.attempts` or >> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, >> > the >> > > >> logic >> > > >> >> would be the following: >> > > >> >> >> > > >> >> * If the config option `restart-strategy` is configured, then >> Flink >> > > >> uses >> > > >> >> this `RestartStrategy` >> > > >> >> * If the config option `restart-strategy` is not configured, >> then >> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` >> > > >> >> ** If checkpointing is enabled, then choose >> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> > > >> >> >> > > >> >> That way we retain the user friendliness that jobs restart if >> the >> > > user >> > > >> >> enabled checkpointing and we make it clear that any ` >> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be >> respected >> > if >> > > >> >> `restart-strategy` has been set to `fixed-delay`. >> > > >> >> >> > > >> >> This simplification would, however, change Flink's behaviour and >> > > might >> > > >> >> break existing setups. Since we introduced `RestartStrategies` >> with >> > > >> Flink >> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which >> > enables >> > > >> >> restarting if either the `attempts` or the `delay` has been >> set, I >> > > >> think >> > > >> >> that the number of broken jobs should be minimal if not >> > non-existent. >> > > >> >> >> > > >> >> I'm sure that one can simplify the way RestartStrategies are >> > > >> >> programmatically configured as well but for the sake of >> > > >> simplicity/scoping >> > > >> >> I'd like to not touch it right away. >> > > >> >> >> > > >> >> What do you think about this behaviour change? >> > > >> >> >> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 >> > > >> >> >> > > >> >> Cheers, >> > > >> >> Till >> > > >> >> >> > > >> >> > > >> >> > > >> > >> >> |
Free forum by Nabble | Edit this page |