[VOTE] Release 1.10.0, release candidate #1

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Thomas Weise
Hi Gary,

Thanks for the reply.

-->

On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:

> Hi Thomas,
>
> > 2) Was there a change in how job recovery reflects in the uptime metric?
> > Didn't uptime previously reset to 0 on recovery (now it just keeps
> > increasing)?
>
> The uptime is the difference between the current time and the time when the
> job transitioned to RUNNING state. By default we no longer transition the
> job
> out of the RUNNING state when restarting. This has something to do with the
> new scheduler which enables pipelined region failover by default [1].
> Actually
> we enabled pipelined region failover already in the binary distribution of
> Flink 1.9 by setting:
>
>     jobmanager.execution.failover-strategy: region
>
> in the default flink-conf.yaml. Unless you have removed this config option
> or
> you are using a custom yaml, you should be seeing this behavior in Flink
> 1.9.
> If you do not want region failover, set
>
>     jobmanager.execution.failover-strategy: full
>
>
We are using the default (the jobmanager.execution.failover-strategy
setting is not present in our flink config).

The change in behavior I see is between the 1.9 based deployment and the
1.10 RC.

Our 1.9 branch is here: https://github.com/lyft/flink/tree/release-1.9-lyft

I also notice that the exception causing a restart is no longer displayed
in the UI, which is probably related?


>
> > 1) Is the low watermark display in the UI still broken?
>
> I was not aware that this is broken. Is there an issue tracking this bug?
>

The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470

(I don't have a good way to verify it is fixed at the moment.)

Another problem with this 1.10 RC is that the checkpointAlignmentTime
metric is missing. (I have not been able to investigate this further yet.)


>
> Best,
> Gary
>
> [1] https://issues.apache.org/jira/browse/FLINK-14651
>
> On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
>
>> I opened a PR for FLINK-15868
>> <https://issues.apache.org/jira/browse/FLINK-15868>:
>> https://github.com/apache/flink/pull/11006
>>
>> With that change, I was able to run an application that consumes from
>> Kinesis.
>>
>> I should have data tomorrow regarding the performance.
>>
>> Two questions/observations:
>>
>> 1) Is the low watermark display in the UI still broken?
>> 2) Was there a change in how job recovery reflects in the uptime metric?
>> Didn't uptime previously reset to 0 on recovery (now it just keeps
>> increasing)?
>>
>> Thanks,
>> Thomas
>>
>>
>>
>>
>> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]> wrote:
>>
>> > I found another issue with the Kinesis connector:
>> >
>> > https://issues.apache.org/jira/browse/FLINK-15868
>> >
>> >
>> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> I am hereby canceling the vote due to:
>> >>
>> >>     FLINK-15837
>> >>     FLINK-15840
>> >>
>> >> Another RC will be created later today.
>> >>
>> >> Best,
>> >> Gary
>> >>
>> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]> wrote:
>> >>
>> >> > Hi everyone,
>> >> > Please review and vote on the release candidate #1 for the version
>> >> 1.10.0,
>> >> > as follows:
>> >> > [ ] +1, Approve the release
>> >> > [ ] -1, Do not approve the release (please provide specific comments)
>> >> >
>> >> >
>> >> > The complete staging area is available for your review, which
>> includes:
>> >> > * JIRA release notes [1],
>> >> > * the official Apache source release and binary convenience releases
>> to
>> >> be
>> >> > deployed to dist.apache.org [2], which are signed with the key with
>> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
>> >> > * all artifacts to be deployed to the Maven Central Repository [4],
>> >> > * source code tag "release-1.10.0-rc1" [5],
>> >> >
>> >> > The announcement blog post is in the works. I will update this voting
>> >> > thread with a link to the pull request soon.
>> >> >
>> >> > The vote will be open for at least 72 hours. It is adopted by
>> majority
>> >> > approval, with at least 3 PMC affirmative votes.
>> >> >
>> >> > Thanks,
>> >> > Yu & Gary
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>> >> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
>> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >> > [4]
>> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
>> >> > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>> >> >
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Zhijiang(wangzhijiang999)
Hi Thomas,

The reason of missing barrier alignment metric is found and I create the ticket [1] for tracing the progress. I guess it would be fixed soon. Thanks for reporting this.

[1] https://issues.apache.org/jira/browse/FLINK-15914

Best,
Zhijiang
------------------------------------------------------------------
From:Thomas Weise <[hidden email]>
Sent At:2020 Feb. 5 (Wed.) 14:04
To:Gary Yao <[hidden email]>; dev <[hidden email]>
Subject:Re: [VOTE] Release 1.10.0, release candidate #1

Hi Gary,

Thanks for the reply.

-->

On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:

> Hi Thomas,
>
> > 2) Was there a change in how job recovery reflects in the uptime metric?
> > Didn't uptime previously reset to 0 on recovery (now it just keeps
> > increasing)?
>
> The uptime is the difference between the current time and the time when the
> job transitioned to RUNNING state. By default we no longer transition the
> job
> out of the RUNNING state when restarting. This has something to do with the
> new scheduler which enables pipelined region failover by default [1].
> Actually
> we enabled pipelined region failover already in the binary distribution of
> Flink 1.9 by setting:
>
>     jobmanager.execution.failover-strategy: region
>
> in the default flink-conf.yaml. Unless you have removed this config option
> or
> you are using a custom yaml, you should be seeing this behavior in Flink
> 1.9.
> If you do not want region failover, set
>
>     jobmanager.execution.failover-strategy: full
>
>
We are using the default (the jobmanager.execution.failover-strategy
setting is not present in our flink config).

The change in behavior I see is between the 1.9 based deployment and the
1.10 RC.

Our 1.9 branch is here: https://github.com/lyft/flink/tree/release-1.9-lyft

I also notice that the exception causing a restart is no longer displayed
in the UI, which is probably related?


>
> > 1) Is the low watermark display in the UI still broken?
>
> I was not aware that this is broken. Is there an issue tracking this bug?
>

The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470

(I don't have a good way to verify it is fixed at the moment.)

Another problem with this 1.10 RC is that the checkpointAlignmentTime
metric is missing. (I have not been able to investigate this further yet.)


>
> Best,
> Gary
>
> [1] https://issues.apache.org/jira/browse/FLINK-14651
>
> On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
>
>> I opened a PR for FLINK-15868
>> <https://issues.apache.org/jira/browse/FLINK-15868>:
>> https://github.com/apache/flink/pull/11006
>>
>> With that change, I was able to run an application that consumes from
>> Kinesis.
>>
>> I should have data tomorrow regarding the performance.
>>
>> Two questions/observations:
>>
>> 1) Is the low watermark display in the UI still broken?
>> 2) Was there a change in how job recovery reflects in the uptime metric?
>> Didn't uptime previously reset to 0 on recovery (now it just keeps
>> increasing)?
>>
>> Thanks,
>> Thomas
>>
>>
>>
>>
>> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]> wrote:
>>
>> > I found another issue with the Kinesis connector:
>> >
>> > https://issues.apache.org/jira/browse/FLINK-15868
>> >
>> >
>> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> I am hereby canceling the vote due to:
>> >>
>> >>     FLINK-15837
>> >>     FLINK-15840
>> >>
>> >> Another RC will be created later today.
>> >>
>> >> Best,
>> >> Gary
>> >>
>> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]> wrote:
>> >>
>> >> > Hi everyone,
>> >> > Please review and vote on the release candidate #1 for the version
>> >> 1.10.0,
>> >> > as follows:
>> >> > [ ] +1, Approve the release
>> >> > [ ] -1, Do not approve the release (please provide specific comments)
>> >> >
>> >> >
>> >> > The complete staging area is available for your review, which
>> includes:
>> >> > * JIRA release notes [1],
>> >> > * the official Apache source release and binary convenience releases
>> to
>> >> be
>> >> > deployed to dist.apache.org [2], which are signed with the key with
>> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
>> >> > * all artifacts to be deployed to the Maven Central Repository [4],
>> >> > * source code tag "release-1.10.0-rc1" [5],
>> >> >
>> >> > The announcement blog post is in the works. I will update this voting
>> >> > thread with a link to the pull request soon.
>> >> >
>> >> > The vote will be open for at least 72 hours. It is adopted by
>> majority
>> >> > approval, with at least 3 PMC affirmative votes.
>> >> >
>> >> > Thanks,
>> >> > Yu & Gary
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>> >> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
>> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >> > [4]
>> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
>> >> > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>> >> >
>> >>
>> >
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Gary Yao-4
In reply to this post by Thomas Weise
> also notice that the exception causing a restart is no longer displayed
> in the UI, which is probably related?

Yes, this is also related to the new scheduler. I created FLINK-15917 [1] to
track this. Moreover, I created a ticket about the uptime metric not
resetting
[2]. Both issues already exist in 1.9 if
"jobmanager.execution.failover-strategy" is set to "region", which is the
case
in the default flink-conf.yaml.

In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
fall
back to the previous behavior.

In 1.10, you can still fall back to the previous behavior by setting
"jobmanager.scheduler: legacy" and unsetting
"jobmanager.execution.failover-strategy" in your flink-conf.yaml

I would not consider these issues blockers since there is a workaround for
them, but of course we would like to see the new scheduler getting some
production exposure. More detailed release notes about the caveats of the
new
scheduler will be added to the user documentation.


> The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470

This should be fixed now [3].


[1] https://issues.apache.org/jira/browse/FLINK-15917
[2] https://issues.apache.org/jira/browse/FLINK-15918
[3] https://issues.apache.org/jira/browse/FLINK-8949

On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:

> Hi Gary,
>
> Thanks for the reply.
>
> -->
>
> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
>
> > Hi Thomas,
> >
> > > 2) Was there a change in how job recovery reflects in the uptime
> metric?
> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
> > > increasing)?
> >
> > The uptime is the difference between the current time and the time when
> the
> > job transitioned to RUNNING state. By default we no longer transition the
> > job
> > out of the RUNNING state when restarting. This has something to do with
> the
> > new scheduler which enables pipelined region failover by default [1].
> > Actually
> > we enabled pipelined region failover already in the binary distribution
> of
> > Flink 1.9 by setting:
> >
> >     jobmanager.execution.failover-strategy: region
> >
> > in the default flink-conf.yaml. Unless you have removed this config
> option
> > or
> > you are using a custom yaml, you should be seeing this behavior in Flink
> > 1.9.
> > If you do not want region failover, set
> >
> >     jobmanager.execution.failover-strategy: full
> >
> >
> We are using the default (the jobmanager.execution.failover-strategy
> setting is not present in our flink config).
>
> The change in behavior I see is between the 1.9 based deployment and the
> 1.10 RC.
>
> Our 1.9 branch is here:
> https://github.com/lyft/flink/tree/release-1.9-lyft
>
> I also notice that the exception causing a restart is no longer displayed
> in the UI, which is probably related?
>
>
> >
> > > 1) Is the low watermark display in the UI still broken?
> >
> > I was not aware that this is broken. Is there an issue tracking this bug?
> >
>
> The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470
>
> (I don't have a good way to verify it is fixed at the moment.)
>
> Another problem with this 1.10 RC is that the checkpointAlignmentTime
> metric is missing. (I have not been able to investigate this further yet.)
>
>
> >
> > Best,
> > Gary
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> >
> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
> >
> >> I opened a PR for FLINK-15868
> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> >> https://github.com/apache/flink/pull/11006
> >>
> >> With that change, I was able to run an application that consumes from
> >> Kinesis.
> >>
> >> I should have data tomorrow regarding the performance.
> >>
> >> Two questions/observations:
> >>
> >> 1) Is the low watermark display in the UI still broken?
> >> 2) Was there a change in how job recovery reflects in the uptime metric?
> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> increasing)?
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >>
> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]> wrote:
> >>
> >> > I found another issue with the Kinesis connector:
> >> >
> >> > https://issues.apache.org/jira/browse/FLINK-15868
> >> >
> >> >
> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
> >> >
> >> >> Hi everyone,
> >> >>
> >> >> I am hereby canceling the vote due to:
> >> >>
> >> >>     FLINK-15837
> >> >>     FLINK-15840
> >> >>
> >> >> Another RC will be created later today.
> >> >>
> >> >> Best,
> >> >> Gary
> >> >>
> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]> wrote:
> >> >>
> >> >> > Hi everyone,
> >> >> > Please review and vote on the release candidate #1 for the version
> >> >> 1.10.0,
> >> >> > as follows:
> >> >> > [ ] +1, Approve the release
> >> >> > [ ] -1, Do not approve the release (please provide specific
> comments)
> >> >> >
> >> >> >
> >> >> > The complete staging area is available for your review, which
> >> includes:
> >> >> > * JIRA release notes [1],
> >> >> > * the official Apache source release and binary convenience
> releases
> >> to
> >> >> be
> >> >> > deployed to dist.apache.org [2], which are signed with the key
> with
> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> >> >> > * all artifacts to be deployed to the Maven Central Repository [4],
> >> >> > * source code tag "release-1.10.0-rc1" [5],
> >> >> >
> >> >> > The announcement blog post is in the works. I will update this
> voting
> >> >> > thread with a link to the pull request soon.
> >> >> >
> >> >> > The vote will be open for at least 72 hours. It is adopted by
> >> majority
> >> >> > approval, with at least 3 PMC affirmative votes.
> >> >> >
> >> >> > Thanks,
> >> >> > Yu & Gary
> >> >> >
> >> >> > [1]
> >> >> >
> >> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> >> >> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >> > [4]
> >> >>
> https://repository.apache.org/content/repositories/orgapacheflink-1325
> >> >> > [5]
> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> >> >> >
> >> >>
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Thomas Weise
Hi Gary,

Thanks for the clarification!

When we upgrade to a new Flink release, we don't start with a default
flink-conf.yaml but upgrade our existing tooling and configuration.
Therefore we notice this issue as part of the upgrade to 1.10, and not when
we upgraded to 1.9.

I would expect many other users to be in the same camp, and therefore
consider making these regressions a blocker for 1.10?

Thanks,
Thomas


On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[hidden email]> wrote:

> > also notice that the exception causing a restart is no longer displayed
> > in the UI, which is probably related?
>
> Yes, this is also related to the new scheduler. I created FLINK-15917 [1]
> to
> track this. Moreover, I created a ticket about the uptime metric not
> resetting
> [2]. Both issues already exist in 1.9 if
> "jobmanager.execution.failover-strategy" is set to "region", which is the
> case
> in the default flink-conf.yaml.
>
> In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
> fall
> back to the previous behavior.
>
> In 1.10, you can still fall back to the previous behavior by setting
> "jobmanager.scheduler: legacy" and unsetting
> "jobmanager.execution.failover-strategy" in your flink-conf.yaml
>
> I would not consider these issues blockers since there is a workaround for
> them, but of course we would like to see the new scheduler getting some
> production exposure. More detailed release notes about the caveats of the
> new
> scheduler will be added to the user documentation.
>
>
> > The watermark issue was
> https://issues.apache.org/jira/browse/FLINK-14470
>
> This should be fixed now [3].
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-15917
> [2] https://issues.apache.org/jira/browse/FLINK-15918
> [3] https://issues.apache.org/jira/browse/FLINK-8949
>
> On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:
>
>> Hi Gary,
>>
>> Thanks for the reply.
>>
>> -->
>>
>> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
>>
>> > Hi Thomas,
>> >
>> > > 2) Was there a change in how job recovery reflects in the uptime
>> metric?
>> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
>> > > increasing)?
>> >
>> > The uptime is the difference between the current time and the time when
>> the
>> > job transitioned to RUNNING state. By default we no longer transition
>> the
>> > job
>> > out of the RUNNING state when restarting. This has something to do with
>> the
>> > new scheduler which enables pipelined region failover by default [1].
>> > Actually
>> > we enabled pipelined region failover already in the binary distribution
>> of
>> > Flink 1.9 by setting:
>> >
>> >     jobmanager.execution.failover-strategy: region
>> >
>> > in the default flink-conf.yaml. Unless you have removed this config
>> option
>> > or
>> > you are using a custom yaml, you should be seeing this behavior in Flink
>> > 1.9.
>> > If you do not want region failover, set
>> >
>> >     jobmanager.execution.failover-strategy: full
>> >
>> >
>> We are using the default (the jobmanager.execution.failover-strategy
>> setting is not present in our flink config).
>>
>> The change in behavior I see is between the 1.9 based deployment and the
>> 1.10 RC.
>>
>> Our 1.9 branch is here:
>> https://github.com/lyft/flink/tree/release-1.9-lyft
>>
>> I also notice that the exception causing a restart is no longer displayed
>> in the UI, which is probably related?
>>
>>
>> >
>> > > 1) Is the low watermark display in the UI still broken?
>> >
>> > I was not aware that this is broken. Is there an issue tracking this
>> bug?
>> >
>>
>> The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470
>>
>> (I don't have a good way to verify it is fixed at the moment.)
>>
>> Another problem with this 1.10 RC is that the checkpointAlignmentTime
>> metric is missing. (I have not been able to investigate this further yet.)
>>
>>
>> >
>> > Best,
>> > Gary
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-14651
>> >
>> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
>> >
>> >> I opened a PR for FLINK-15868
>> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
>> >> https://github.com/apache/flink/pull/11006
>> >>
>> >> With that change, I was able to run an application that consumes from
>> >> Kinesis.
>> >>
>> >> I should have data tomorrow regarding the performance.
>> >>
>> >> Two questions/observations:
>> >>
>> >> 1) Is the low watermark display in the UI still broken?
>> >> 2) Was there a change in how job recovery reflects in the uptime
>> metric?
>> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> increasing)?
>> >>
>> >> Thanks,
>> >> Thomas
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]> wrote:
>> >>
>> >> > I found another issue with the Kinesis connector:
>> >> >
>> >> > https://issues.apache.org/jira/browse/FLINK-15868
>> >> >
>> >> >
>> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
>> >> >
>> >> >> Hi everyone,
>> >> >>
>> >> >> I am hereby canceling the vote due to:
>> >> >>
>> >> >>     FLINK-15837
>> >> >>     FLINK-15840
>> >> >>
>> >> >> Another RC will be created later today.
>> >> >>
>> >> >> Best,
>> >> >> Gary
>> >> >>
>> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]> wrote:
>> >> >>
>> >> >> > Hi everyone,
>> >> >> > Please review and vote on the release candidate #1 for the version
>> >> >> 1.10.0,
>> >> >> > as follows:
>> >> >> > [ ] +1, Approve the release
>> >> >> > [ ] -1, Do not approve the release (please provide specific
>> comments)
>> >> >> >
>> >> >> >
>> >> >> > The complete staging area is available for your review, which
>> >> includes:
>> >> >> > * JIRA release notes [1],
>> >> >> > * the official Apache source release and binary convenience
>> releases
>> >> to
>> >> >> be
>> >> >> > deployed to dist.apache.org [2], which are signed with the key
>> with
>> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
>> >> >> > * all artifacts to be deployed to the Maven Central Repository
>> [4],
>> >> >> > * source code tag "release-1.10.0-rc1" [5],
>> >> >> >
>> >> >> > The announcement blog post is in the works. I will update this
>> voting
>> >> >> > thread with a link to the pull request soon.
>> >> >> >
>> >> >> > The vote will be open for at least 72 hours. It is adopted by
>> >> majority
>> >> >> > approval, with at least 3 PMC affirmative votes.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Yu & Gary
>> >> >> >
>> >> >> > [1]
>> >> >> >
>> >> >>
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>> >> >> > [2]
>> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
>> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >> >> > [4]
>> >> >>
>> https://repository.apache.org/content/repositories/orgapacheflink-1325
>> >> >> > [5]
>> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Stephan Ewen
Should we make these a blocker? I am not sure - we could also clearly state
in the release notes how to restore the old behavior, if your setup assumes
that behavior.

Release candidates for this release have been out since mid December, it is
a bit unfortunate that these things have been raised so late.
Having these rather open ended tickets (how to re-define the existing
metrics in the new scheduler/failover handling) now as release blockers
would mean that 1.10 is indefinitely delayed.

Are we sure we want to do that?

On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <[hidden email]> wrote:

> Hi Gary,
>
> Thanks for the clarification!
>
> When we upgrade to a new Flink release, we don't start with a default
> flink-conf.yaml but upgrade our existing tooling and configuration.
> Therefore we notice this issue as part of the upgrade to 1.10, and not when
> we upgraded to 1.9.
>
> I would expect many other users to be in the same camp, and therefore
> consider making these regressions a blocker for 1.10?
>
> Thanks,
> Thomas
>
>
> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[hidden email]> wrote:
>
> > > also notice that the exception causing a restart is no longer displayed
> > > in the UI, which is probably related?
> >
> > Yes, this is also related to the new scheduler. I created FLINK-15917 [1]
> > to
> > track this. Moreover, I created a ticket about the uptime metric not
> > resetting
> > [2]. Both issues already exist in 1.9 if
> > "jobmanager.execution.failover-strategy" is set to "region", which is the
> > case
> > in the default flink-conf.yaml.
> >
> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
> > fall
> > back to the previous behavior.
> >
> > In 1.10, you can still fall back to the previous behavior by setting
> > "jobmanager.scheduler: legacy" and unsetting
> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
> >
> > I would not consider these issues blockers since there is a workaround
> for
> > them, but of course we would like to see the new scheduler getting some
> > production exposure. More detailed release notes about the caveats of the
> > new
> > scheduler will be added to the user documentation.
> >
> >
> > > The watermark issue was
> > https://issues.apache.org/jira/browse/FLINK-14470
> >
> > This should be fixed now [3].
> >
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-15917
> > [2] https://issues.apache.org/jira/browse/FLINK-15918
> > [3] https://issues.apache.org/jira/browse/FLINK-8949
> >
> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:
> >
> >> Hi Gary,
> >>
> >> Thanks for the reply.
> >>
> >> -->
> >>
> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
> >>
> >> > Hi Thomas,
> >> >
> >> > > 2) Was there a change in how job recovery reflects in the uptime
> >> metric?
> >> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> > > increasing)?
> >> >
> >> > The uptime is the difference between the current time and the time
> when
> >> the
> >> > job transitioned to RUNNING state. By default we no longer transition
> >> the
> >> > job
> >> > out of the RUNNING state when restarting. This has something to do
> with
> >> the
> >> > new scheduler which enables pipelined region failover by default [1].
> >> > Actually
> >> > we enabled pipelined region failover already in the binary
> distribution
> >> of
> >> > Flink 1.9 by setting:
> >> >
> >> >     jobmanager.execution.failover-strategy: region
> >> >
> >> > in the default flink-conf.yaml. Unless you have removed this config
> >> option
> >> > or
> >> > you are using a custom yaml, you should be seeing this behavior in
> Flink
> >> > 1.9.
> >> > If you do not want region failover, set
> >> >
> >> >     jobmanager.execution.failover-strategy: full
> >> >
> >> >
> >> We are using the default (the jobmanager.execution.failover-strategy
> >> setting is not present in our flink config).
> >>
> >> The change in behavior I see is between the 1.9 based deployment and the
> >> 1.10 RC.
> >>
> >> Our 1.9 branch is here:
> >> https://github.com/lyft/flink/tree/release-1.9-lyft
> >>
> >> I also notice that the exception causing a restart is no longer
> displayed
> >> in the UI, which is probably related?
> >>
> >>
> >> >
> >> > > 1) Is the low watermark display in the UI still broken?
> >> >
> >> > I was not aware that this is broken. Is there an issue tracking this
> >> bug?
> >> >
> >>
> >> The watermark issue was
> https://issues.apache.org/jira/browse/FLINK-14470
> >>
> >> (I don't have a good way to verify it is fixed at the moment.)
> >>
> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime
> >> metric is missing. (I have not been able to investigate this further
> yet.)
> >>
> >>
> >> >
> >> > Best,
> >> > Gary
> >> >
> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> >> >
> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
> >> >
> >> >> I opened a PR for FLINK-15868
> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> >> >> https://github.com/apache/flink/pull/11006
> >> >>
> >> >> With that change, I was able to run an application that consumes from
> >> >> Kinesis.
> >> >>
> >> >> I should have data tomorrow regarding the performance.
> >> >>
> >> >> Two questions/observations:
> >> >>
> >> >> 1) Is the low watermark display in the UI still broken?
> >> >> 2) Was there a change in how job recovery reflects in the uptime
> >> metric?
> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> >> increasing)?
> >> >>
> >> >> Thanks,
> >> >> Thomas
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]> wrote:
> >> >>
> >> >> > I found another issue with the Kinesis connector:
> >> >> >
> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
> >> >> >
> >> >> >
> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
> >> >> >
> >> >> >> Hi everyone,
> >> >> >>
> >> >> >> I am hereby canceling the vote due to:
> >> >> >>
> >> >> >>     FLINK-15837
> >> >> >>     FLINK-15840
> >> >> >>
> >> >> >> Another RC will be created later today.
> >> >> >>
> >> >> >> Best,
> >> >> >> Gary
> >> >> >>
> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]>
> wrote:
> >> >> >>
> >> >> >> > Hi everyone,
> >> >> >> > Please review and vote on the release candidate #1 for the
> version
> >> >> >> 1.10.0,
> >> >> >> > as follows:
> >> >> >> > [ ] +1, Approve the release
> >> >> >> > [ ] -1, Do not approve the release (please provide specific
> >> comments)
> >> >> >> >
> >> >> >> >
> >> >> >> > The complete staging area is available for your review, which
> >> >> includes:
> >> >> >> > * JIRA release notes [1],
> >> >> >> > * the official Apache source release and binary convenience
> >> releases
> >> >> to
> >> >> >> be
> >> >> >> > deployed to dist.apache.org [2], which are signed with the key
> >> with
> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> >> >> >> > * all artifacts to be deployed to the Maven Central Repository
> >> [4],
> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
> >> >> >> >
> >> >> >> > The announcement blog post is in the works. I will update this
> >> voting
> >> >> >> > thread with a link to the pull request soon.
> >> >> >> >
> >> >> >> > The vote will be open for at least 72 hours. It is adopted by
> >> >> majority
> >> >> >> > approval, with at least 3 PMC affirmative votes.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Yu & Gary
> >> >> >> >
> >> >> >> > [1]
> >> >> >> >
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> >> >> >> > [2]
> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >> >> > [4]
> >> >> >>
> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
> >> >> >> > [5]
> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Gary Yao-4
It is indeed unfortunate that these issues are discovered only now. I think
Thomas has a valid point, and we might be risking the trust of our users
here.

What are our options?

    1. Document this behavior and how to work around it copiously in the
release notes [1]
    2. Try to restore the previous behavior
    3. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature in 1.11
    4. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature earliest in 1.10.1

[1]
https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86

On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <[hidden email]> wrote:

> Should we make these a blocker? I am not sure - we could also clearly
> state in the release notes how to restore the old behavior, if your setup
> assumes that behavior.
>
> Release candidates for this release have been out since mid December, it
> is a bit unfortunate that these things have been raised so late.
> Having these rather open ended tickets (how to re-define the existing
> metrics in the new scheduler/failover handling) now as release blockers
> would mean that 1.10 is indefinitely delayed.
>
> Are we sure we want to do that?
>
> On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <[hidden email]> wrote:
>
>> Hi Gary,
>>
>> Thanks for the clarification!
>>
>> When we upgrade to a new Flink release, we don't start with a default
>> flink-conf.yaml but upgrade our existing tooling and configuration.
>> Therefore we notice this issue as part of the upgrade to 1.10, and not
>> when
>> we upgraded to 1.9.
>>
>> I would expect many other users to be in the same camp, and therefore
>> consider making these regressions a blocker for 1.10?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[hidden email]> wrote:
>>
>> > > also notice that the exception causing a restart is no longer
>> displayed
>> > > in the UI, which is probably related?
>> >
>> > Yes, this is also related to the new scheduler. I created FLINK-15917
>> [1]
>> > to
>> > track this. Moreover, I created a ticket about the uptime metric not
>> > resetting
>> > [2]. Both issues already exist in 1.9 if
>> > "jobmanager.execution.failover-strategy" is set to "region", which is
>> the
>> > case
>> > in the default flink-conf.yaml.
>> >
>> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
>> > fall
>> > back to the previous behavior.
>> >
>> > In 1.10, you can still fall back to the previous behavior by setting
>> > "jobmanager.scheduler: legacy" and unsetting
>> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
>> >
>> > I would not consider these issues blockers since there is a workaround
>> for
>> > them, but of course we would like to see the new scheduler getting some
>> > production exposure. More detailed release notes about the caveats of
>> the
>> > new
>> > scheduler will be added to the user documentation.
>> >
>> >
>> > > The watermark issue was
>> > https://issues.apache.org/jira/browse/FLINK-14470
>> >
>> > This should be fixed now [3].
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-15917
>> > [2] https://issues.apache.org/jira/browse/FLINK-15918
>> > [3] https://issues.apache.org/jira/browse/FLINK-8949
>> >
>> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:
>> >
>> >> Hi Gary,
>> >>
>> >> Thanks for the reply.
>> >>
>> >> -->
>> >>
>> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
>> >>
>> >> > Hi Thomas,
>> >> >
>> >> > > 2) Was there a change in how job recovery reflects in the uptime
>> >> metric?
>> >> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> > > increasing)?
>> >> >
>> >> > The uptime is the difference between the current time and the time
>> when
>> >> the
>> >> > job transitioned to RUNNING state. By default we no longer transition
>> >> the
>> >> > job
>> >> > out of the RUNNING state when restarting. This has something to do
>> with
>> >> the
>> >> > new scheduler which enables pipelined region failover by default [1].
>> >> > Actually
>> >> > we enabled pipelined region failover already in the binary
>> distribution
>> >> of
>> >> > Flink 1.9 by setting:
>> >> >
>> >> >     jobmanager.execution.failover-strategy: region
>> >> >
>> >> > in the default flink-conf.yaml. Unless you have removed this config
>> >> option
>> >> > or
>> >> > you are using a custom yaml, you should be seeing this behavior in
>> Flink
>> >> > 1.9.
>> >> > If you do not want region failover, set
>> >> >
>> >> >     jobmanager.execution.failover-strategy: full
>> >> >
>> >> >
>> >> We are using the default (the jobmanager.execution.failover-strategy
>> >> setting is not present in our flink config).
>> >>
>> >> The change in behavior I see is between the 1.9 based deployment and
>> the
>> >> 1.10 RC.
>> >>
>> >> Our 1.9 branch is here:
>> >> https://github.com/lyft/flink/tree/release-1.9-lyft
>> >>
>> >> I also notice that the exception causing a restart is no longer
>> displayed
>> >> in the UI, which is probably related?
>> >>
>> >>
>> >> >
>> >> > > 1) Is the low watermark display in the UI still broken?
>> >> >
>> >> > I was not aware that this is broken. Is there an issue tracking this
>> >> bug?
>> >> >
>> >>
>> >> The watermark issue was
>> https://issues.apache.org/jira/browse/FLINK-14470
>> >>
>> >> (I don't have a good way to verify it is fixed at the moment.)
>> >>
>> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime
>> >> metric is missing. (I have not been able to investigate this further
>> yet.)
>> >>
>> >>
>> >> >
>> >> > Best,
>> >> > Gary
>> >> >
>> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
>> >> >
>> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]> wrote:
>> >> >
>> >> >> I opened a PR for FLINK-15868
>> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
>> >> >> https://github.com/apache/flink/pull/11006
>> >> >>
>> >> >> With that change, I was able to run an application that consumes
>> from
>> >> >> Kinesis.
>> >> >>
>> >> >> I should have data tomorrow regarding the performance.
>> >> >>
>> >> >> Two questions/observations:
>> >> >>
>> >> >> 1) Is the low watermark display in the UI still broken?
>> >> >> 2) Was there a change in how job recovery reflects in the uptime
>> >> metric?
>> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> >> increasing)?
>> >> >>
>> >> >> Thanks,
>> >> >> Thomas
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]>
>> wrote:
>> >> >>
>> >> >> > I found another issue with the Kinesis connector:
>> >> >> >
>> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]> wrote:
>> >> >> >
>> >> >> >> Hi everyone,
>> >> >> >>
>> >> >> >> I am hereby canceling the vote due to:
>> >> >> >>
>> >> >> >>     FLINK-15837
>> >> >> >>     FLINK-15840
>> >> >> >>
>> >> >> >> Another RC will be created later today.
>> >> >> >>
>> >> >> >> Best,
>> >> >> >> Gary
>> >> >> >>
>> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]>
>> wrote:
>> >> >> >>
>> >> >> >> > Hi everyone,
>> >> >> >> > Please review and vote on the release candidate #1 for the
>> version
>> >> >> >> 1.10.0,
>> >> >> >> > as follows:
>> >> >> >> > [ ] +1, Approve the release
>> >> >> >> > [ ] -1, Do not approve the release (please provide specific
>> >> comments)
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > The complete staging area is available for your review, which
>> >> >> includes:
>> >> >> >> > * JIRA release notes [1],
>> >> >> >> > * the official Apache source release and binary convenience
>> >> releases
>> >> >> to
>> >> >> >> be
>> >> >> >> > deployed to dist.apache.org [2], which are signed with the key
>> >> with
>> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
>> >> >> >> > * all artifacts to be deployed to the Maven Central Repository
>> >> [4],
>> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
>> >> >> >> >
>> >> >> >> > The announcement blog post is in the works. I will update this
>> >> voting
>> >> >> >> > thread with a link to the pull request soon.
>> >> >> >> >
>> >> >> >> > The vote will be open for at least 72 hours. It is adopted by
>> >> >> majority
>> >> >> >> > approval, with at least 3 PMC affirmative votes.
>> >> >> >> >
>> >> >> >> > Thanks,
>> >> >> >> > Yu & Gary
>> >> >> >> >
>> >> >> >> > [1]
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>> >> >> >> > [2]
>> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
>> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >> >> >> > [4]
>> >> >> >>
>> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
>> >> >> >> > [5]
>> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>> >> >> >> >
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Arvid Heise-3
Couldn't we treat a missing option as legacy, but set the new scheduler as
the default value in all newly shipped flink-conf.yaml?

In this way, old users get the old behavior (either implicitly or
explicitly) unless they explicitly upgrade.
New users benefit from the new scheduler.

On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <[hidden email]> wrote:

> It is indeed unfortunate that these issues are discovered only now. I think
> Thomas has a valid point, and we might be risking the trust of our users
> here.
>
> What are our options?
>
>     1. Document this behavior and how to work around it copiously in the
> release notes [1]
>     2. Try to restore the previous behavior
>     3. Change default value of jobmanager.scheduler to "legacy" and rollout
> the feature in 1.11
>     4. Change default value of jobmanager.scheduler to "legacy" and rollout
> the feature earliest in 1.10.1
>
> [1]
>
> https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86
>
> On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <[hidden email]> wrote:
>
> > Should we make these a blocker? I am not sure - we could also clearly
> > state in the release notes how to restore the old behavior, if your setup
> > assumes that behavior.
> >
> > Release candidates for this release have been out since mid December, it
> > is a bit unfortunate that these things have been raised so late.
> > Having these rather open ended tickets (how to re-define the existing
> > metrics in the new scheduler/failover handling) now as release blockers
> > would mean that 1.10 is indefinitely delayed.
> >
> > Are we sure we want to do that?
> >
> > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <[hidden email]> wrote:
> >
> >> Hi Gary,
> >>
> >> Thanks for the clarification!
> >>
> >> When we upgrade to a new Flink release, we don't start with a default
> >> flink-conf.yaml but upgrade our existing tooling and configuration.
> >> Therefore we notice this issue as part of the upgrade to 1.10, and not
> >> when
> >> we upgraded to 1.9.
> >>
> >> I would expect many other users to be in the same camp, and therefore
> >> consider making these regressions a blocker for 1.10?
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[hidden email]> wrote:
> >>
> >> > > also notice that the exception causing a restart is no longer
> >> displayed
> >> > > in the UI, which is probably related?
> >> >
> >> > Yes, this is also related to the new scheduler. I created FLINK-15917
> >> [1]
> >> > to
> >> > track this. Moreover, I created a ticket about the uptime metric not
> >> > resetting
> >> > [2]. Both issues already exist in 1.9 if
> >> > "jobmanager.execution.failover-strategy" is set to "region", which is
> >> the
> >> > case
> >> > in the default flink-conf.yaml.
> >> >
> >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough
> to
> >> > fall
> >> > back to the previous behavior.
> >> >
> >> > In 1.10, you can still fall back to the previous behavior by setting
> >> > "jobmanager.scheduler: legacy" and unsetting
> >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
> >> >
> >> > I would not consider these issues blockers since there is a workaround
> >> for
> >> > them, but of course we would like to see the new scheduler getting
> some
> >> > production exposure. More detailed release notes about the caveats of
> >> the
> >> > new
> >> > scheduler will be added to the user documentation.
> >> >
> >> >
> >> > > The watermark issue was
> >> > https://issues.apache.org/jira/browse/FLINK-14470
> >> >
> >> > This should be fixed now [3].
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/FLINK-15917
> >> > [2] https://issues.apache.org/jira/browse/FLINK-15918
> >> > [3] https://issues.apache.org/jira/browse/FLINK-8949
> >> >
> >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:
> >> >
> >> >> Hi Gary,
> >> >>
> >> >> Thanks for the reply.
> >> >>
> >> >> -->
> >> >>
> >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
> >> >>
> >> >> > Hi Thomas,
> >> >> >
> >> >> > > 2) Was there a change in how job recovery reflects in the uptime
> >> >> metric?
> >> >> > > Didn't uptime previously reset to 0 on recovery (now it just
> keeps
> >> >> > > increasing)?
> >> >> >
> >> >> > The uptime is the difference between the current time and the time
> >> when
> >> >> the
> >> >> > job transitioned to RUNNING state. By default we no longer
> transition
> >> >> the
> >> >> > job
> >> >> > out of the RUNNING state when restarting. This has something to do
> >> with
> >> >> the
> >> >> > new scheduler which enables pipelined region failover by default
> [1].
> >> >> > Actually
> >> >> > we enabled pipelined region failover already in the binary
> >> distribution
> >> >> of
> >> >> > Flink 1.9 by setting:
> >> >> >
> >> >> >     jobmanager.execution.failover-strategy: region
> >> >> >
> >> >> > in the default flink-conf.yaml. Unless you have removed this config
> >> >> option
> >> >> > or
> >> >> > you are using a custom yaml, you should be seeing this behavior in
> >> Flink
> >> >> > 1.9.
> >> >> > If you do not want region failover, set
> >> >> >
> >> >> >     jobmanager.execution.failover-strategy: full
> >> >> >
> >> >> >
> >> >> We are using the default (the jobmanager.execution.failover-strategy
> >> >> setting is not present in our flink config).
> >> >>
> >> >> The change in behavior I see is between the 1.9 based deployment and
> >> the
> >> >> 1.10 RC.
> >> >>
> >> >> Our 1.9 branch is here:
> >> >> https://github.com/lyft/flink/tree/release-1.9-lyft
> >> >>
> >> >> I also notice that the exception causing a restart is no longer
> >> displayed
> >> >> in the UI, which is probably related?
> >> >>
> >> >>
> >> >> >
> >> >> > > 1) Is the low watermark display in the UI still broken?
> >> >> >
> >> >> > I was not aware that this is broken. Is there an issue tracking
> this
> >> >> bug?
> >> >> >
> >> >>
> >> >> The watermark issue was
> >> https://issues.apache.org/jira/browse/FLINK-14470
> >> >>
> >> >> (I don't have a good way to verify it is fixed at the moment.)
> >> >>
> >> >> Another problem with this 1.10 RC is that the checkpointAlignmentTime
> >> >> metric is missing. (I have not been able to investigate this further
> >> yet.)
> >> >>
> >> >>
> >> >> >
> >> >> > Best,
> >> >> > Gary
> >> >> >
> >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> >> >> >
> >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]>
> wrote:
> >> >> >
> >> >> >> I opened a PR for FLINK-15868
> >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> >> >> >> https://github.com/apache/flink/pull/11006
> >> >> >>
> >> >> >> With that change, I was able to run an application that consumes
> >> from
> >> >> >> Kinesis.
> >> >> >>
> >> >> >> I should have data tomorrow regarding the performance.
> >> >> >>
> >> >> >> Two questions/observations:
> >> >> >>
> >> >> >> 1) Is the low watermark display in the UI still broken?
> >> >> >> 2) Was there a change in how job recovery reflects in the uptime
> >> >> metric?
> >> >> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> >> >> increasing)?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Thomas
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]>
> >> wrote:
> >> >> >>
> >> >> >> > I found another issue with the Kinesis connector:
> >> >> >> >
> >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
> >> >> >> >
> >> >> >> >
> >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]>
> wrote:
> >> >> >> >
> >> >> >> >> Hi everyone,
> >> >> >> >>
> >> >> >> >> I am hereby canceling the vote due to:
> >> >> >> >>
> >> >> >> >>     FLINK-15837
> >> >> >> >>     FLINK-15840
> >> >> >> >>
> >> >> >> >> Another RC will be created later today.
> >> >> >> >>
> >> >> >> >> Best,
> >> >> >> >> Gary
> >> >> >> >>
> >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]>
> >> wrote:
> >> >> >> >>
> >> >> >> >> > Hi everyone,
> >> >> >> >> > Please review and vote on the release candidate #1 for the
> >> version
> >> >> >> >> 1.10.0,
> >> >> >> >> > as follows:
> >> >> >> >> > [ ] +1, Approve the release
> >> >> >> >> > [ ] -1, Do not approve the release (please provide specific
> >> >> comments)
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > The complete staging area is available for your review, which
> >> >> >> includes:
> >> >> >> >> > * JIRA release notes [1],
> >> >> >> >> > * the official Apache source release and binary convenience
> >> >> releases
> >> >> >> to
> >> >> >> >> be
> >> >> >> >> > deployed to dist.apache.org [2], which are signed with the
> key
> >> >> with
> >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> >> >> >> >> > * all artifacts to be deployed to the Maven Central
> Repository
> >> >> [4],
> >> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
> >> >> >> >> >
> >> >> >> >> > The announcement blog post is in the works. I will update
> this
> >> >> voting
> >> >> >> >> > thread with a link to the pull request soon.
> >> >> >> >> >
> >> >> >> >> > The vote will be open for at least 72 hours. It is adopted by
> >> >> >> majority
> >> >> >> >> > approval, with at least 3 PMC affirmative votes.
> >> >> >> >> >
> >> >> >> >> > Thanks,
> >> >> >> >> > Yu & Gary
> >> >> >> >> >
> >> >> >> >> > [1]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> >> >> >> >> > [2]
> >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >> >> >> > [4]
> >> >> >> >>
> >> >>
> https://repository.apache.org/content/repositories/orgapacheflink-1325
> >> >> >> >> > [5]
> >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.10.0, release candidate #1

Thomas Weise
I think that's a good idea.

(Opt-in for existing users, until the backward compatibility issues are
resolved.)


On Wed, Feb 5, 2020 at 11:57 AM Arvid Heise <[hidden email]> wrote:

> Couldn't we treat a missing option as legacy, but set the new scheduler as
> the default value in all newly shipped flink-conf.yaml?
>
> In this way, old users get the old behavior (either implicitly or
> explicitly) unless they explicitly upgrade.
> New users benefit from the new scheduler.
>
> On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <[hidden email]> wrote:
>
> > It is indeed unfortunate that these issues are discovered only now. I
> think
> > Thomas has a valid point, and we might be risking the trust of our users
> > here.
> >
> > What are our options?
> >
> >     1. Document this behavior and how to work around it copiously in the
> > release notes [1]
> >     2. Try to restore the previous behavior
> >     3. Change default value of jobmanager.scheduler to "legacy" and
> rollout
> > the feature in 1.11
> >     4. Change default value of jobmanager.scheduler to "legacy" and
> rollout
> > the feature earliest in 1.10.1
> >
> > [1]
> >
> >
> https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86
> >
> > On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <[hidden email]> wrote:
> >
> > > Should we make these a blocker? I am not sure - we could also clearly
> > > state in the release notes how to restore the old behavior, if your
> setup
> > > assumes that behavior.
> > >
> > > Release candidates for this release have been out since mid December,
> it
> > > is a bit unfortunate that these things have been raised so late.
> > > Having these rather open ended tickets (how to re-define the existing
> > > metrics in the new scheduler/failover handling) now as release blockers
> > > would mean that 1.10 is indefinitely delayed.
> > >
> > > Are we sure we want to do that?
> > >
> > > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <[hidden email]> wrote:
> > >
> > >> Hi Gary,
> > >>
> > >> Thanks for the clarification!
> > >>
> > >> When we upgrade to a new Flink release, we don't start with a default
> > >> flink-conf.yaml but upgrade our existing tooling and configuration.
> > >> Therefore we notice this issue as part of the upgrade to 1.10, and not
> > >> when
> > >> we upgraded to 1.9.
> > >>
> > >> I would expect many other users to be in the same camp, and therefore
> > >> consider making these regressions a blocker for 1.10?
> > >>
> > >> Thanks,
> > >> Thomas
> > >>
> > >>
> > >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <[hidden email]> wrote:
> > >>
> > >> > > also notice that the exception causing a restart is no longer
> > >> displayed
> > >> > > in the UI, which is probably related?
> > >> >
> > >> > Yes, this is also related to the new scheduler. I created
> FLINK-15917
> > >> [1]
> > >> > to
> > >> > track this. Moreover, I created a ticket about the uptime metric not
> > >> > resetting
> > >> > [2]. Both issues already exist in 1.9 if
> > >> > "jobmanager.execution.failover-strategy" is set to "region", which
> is
> > >> the
> > >> > case
> > >> > in the default flink-conf.yaml.
> > >> >
> > >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was
> enough
> > to
> > >> > fall
> > >> > back to the previous behavior.
> > >> >
> > >> > In 1.10, you can still fall back to the previous behavior by setting
> > >> > "jobmanager.scheduler: legacy" and unsetting
> > >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
> > >> >
> > >> > I would not consider these issues blockers since there is a
> workaround
> > >> for
> > >> > them, but of course we would like to see the new scheduler getting
> > some
> > >> > production exposure. More detailed release notes about the caveats
> of
> > >> the
> > >> > new
> > >> > scheduler will be added to the user documentation.
> > >> >
> > >> >
> > >> > > The watermark issue was
> > >> > https://issues.apache.org/jira/browse/FLINK-14470
> > >> >
> > >> > This should be fixed now [3].
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/FLINK-15917
> > >> > [2] https://issues.apache.org/jira/browse/FLINK-15918
> > >> > [3] https://issues.apache.org/jira/browse/FLINK-8949
> > >> >
> > >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <[hidden email]> wrote:
> > >> >
> > >> >> Hi Gary,
> > >> >>
> > >> >> Thanks for the reply.
> > >> >>
> > >> >> -->
> > >> >>
> > >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <[hidden email]> wrote:
> > >> >>
> > >> >> > Hi Thomas,
> > >> >> >
> > >> >> > > 2) Was there a change in how job recovery reflects in the
> uptime
> > >> >> metric?
> > >> >> > > Didn't uptime previously reset to 0 on recovery (now it just
> > keeps
> > >> >> > > increasing)?
> > >> >> >
> > >> >> > The uptime is the difference between the current time and the
> time
> > >> when
> > >> >> the
> > >> >> > job transitioned to RUNNING state. By default we no longer
> > transition
> > >> >> the
> > >> >> > job
> > >> >> > out of the RUNNING state when restarting. This has something to
> do
> > >> with
> > >> >> the
> > >> >> > new scheduler which enables pipelined region failover by default
> > [1].
> > >> >> > Actually
> > >> >> > we enabled pipelined region failover already in the binary
> > >> distribution
> > >> >> of
> > >> >> > Flink 1.9 by setting:
> > >> >> >
> > >> >> >     jobmanager.execution.failover-strategy: region
> > >> >> >
> > >> >> > in the default flink-conf.yaml. Unless you have removed this
> config
> > >> >> option
> > >> >> > or
> > >> >> > you are using a custom yaml, you should be seeing this behavior
> in
> > >> Flink
> > >> >> > 1.9.
> > >> >> > If you do not want region failover, set
> > >> >> >
> > >> >> >     jobmanager.execution.failover-strategy: full
> > >> >> >
> > >> >> >
> > >> >> We are using the default (the
> jobmanager.execution.failover-strategy
> > >> >> setting is not present in our flink config).
> > >> >>
> > >> >> The change in behavior I see is between the 1.9 based deployment
> and
> > >> the
> > >> >> 1.10 RC.
> > >> >>
> > >> >> Our 1.9 branch is here:
> > >> >> https://github.com/lyft/flink/tree/release-1.9-lyft
> > >> >>
> > >> >> I also notice that the exception causing a restart is no longer
> > >> displayed
> > >> >> in the UI, which is probably related?
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > > 1) Is the low watermark display in the UI still broken?
> > >> >> >
> > >> >> > I was not aware that this is broken. Is there an issue tracking
> > this
> > >> >> bug?
> > >> >> >
> > >> >>
> > >> >> The watermark issue was
> > >> https://issues.apache.org/jira/browse/FLINK-14470
> > >> >>
> > >> >> (I don't have a good way to verify it is fixed at the moment.)
> > >> >>
> > >> >> Another problem with this 1.10 RC is that the
> checkpointAlignmentTime
> > >> >> metric is missing. (I have not been able to investigate this
> further
> > >> yet.)
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > Best,
> > >> >> > Gary
> > >> >> >
> > >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> > >> >> >
> > >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <[hidden email]>
> > wrote:
> > >> >> >
> > >> >> >> I opened a PR for FLINK-15868
> > >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> > >> >> >> https://github.com/apache/flink/pull/11006
> > >> >> >>
> > >> >> >> With that change, I was able to run an application that consumes
> > >> from
> > >> >> >> Kinesis.
> > >> >> >>
> > >> >> >> I should have data tomorrow regarding the performance.
> > >> >> >>
> > >> >> >> Two questions/observations:
> > >> >> >>
> > >> >> >> 1) Is the low watermark display in the UI still broken?
> > >> >> >> 2) Was there a change in how job recovery reflects in the uptime
> > >> >> metric?
> > >> >> >> Didn't uptime previously reset to 0 on recovery (now it just
> keeps
> > >> >> >> increasing)?
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> Thomas
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <[hidden email]>
> > >> wrote:
> > >> >> >>
> > >> >> >> > I found another issue with the Kinesis connector:
> > >> >> >> >
> > >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <[hidden email]>
> > wrote:
> > >> >> >> >
> > >> >> >> >> Hi everyone,
> > >> >> >> >>
> > >> >> >> >> I am hereby canceling the vote due to:
> > >> >> >> >>
> > >> >> >> >>     FLINK-15837
> > >> >> >> >>     FLINK-15840
> > >> >> >> >>
> > >> >> >> >> Another RC will be created later today.
> > >> >> >> >>
> > >> >> >> >> Best,
> > >> >> >> >> Gary
> > >> >> >> >>
> > >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <[hidden email]>
> > >> wrote:
> > >> >> >> >>
> > >> >> >> >> > Hi everyone,
> > >> >> >> >> > Please review and vote on the release candidate #1 for the
> > >> version
> > >> >> >> >> 1.10.0,
> > >> >> >> >> > as follows:
> > >> >> >> >> > [ ] +1, Approve the release
> > >> >> >> >> > [ ] -1, Do not approve the release (please provide specific
> > >> >> comments)
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> > The complete staging area is available for your review,
> which
> > >> >> >> includes:
> > >> >> >> >> > * JIRA release notes [1],
> > >> >> >> >> > * the official Apache source release and binary convenience
> > >> >> releases
> > >> >> >> to
> > >> >> >> >> be
> > >> >> >> >> > deployed to dist.apache.org [2], which are signed with the
> > key
> > >> >> with
> > >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> > >> >> >> >> > * all artifacts to be deployed to the Maven Central
> > Repository
> > >> >> [4],
> > >> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
> > >> >> >> >> >
> > >> >> >> >> > The announcement blog post is in the works. I will update
> > this
> > >> >> voting
> > >> >> >> >> > thread with a link to the pull request soon.
> > >> >> >> >> >
> > >> >> >> >> > The vote will be open for at least 72 hours. It is adopted
> by
> > >> >> >> majority
> > >> >> >> >> > approval, with at least 3 PMC affirmative votes.
> > >> >> >> >> >
> > >> >> >> >> > Thanks,
> > >> >> >> >> > Yu & Gary
> > >> >> >> >> >
> > >> >> >> >> > [1]
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> > >> >> >> >> > [2]
> > >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> > >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > >> >> >> >> > [4]
> > >> >> >> >>
> > >> >>
> > https://repository.apache.org/content/repositories/orgapacheflink-1325
> > >> >> >> >> > [5]
> > >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>
12