[VOTE] Release Apache Flink 1.1.4 (RC3)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE] Release Apache Flink 1.1.4 (RC3)

Ufuk Celebi-2
Dear Flink community,

Please vote on releasing the following candidate as Apache Flink version 1.1.4.

The commit to be voted on:
2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)

Branch:
release-1.1.4-rc3
(https://git1-us-west.apache.org/repos/asf/flink/repo?p=flink.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)

The release artifacts to be voted on can be found at:
http://people.apache.org/~uce/flink-1.1.4-rc3/

The release artifacts are signed with the key with fingerprint 9D403309:
http://www.apache.org/dist/flink/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapacheflink-1109

-------------------------------------------------------------

The voting time is at least three days and the vote passes if a
majority of at least three +1 PMC votes are cast. The vote ends earliest
on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).

[ ] +1 Release this package as Apache Flink 1.1.4
[ ] -1 Do not release this package, because ...
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Robert Metzger
I'm not sure if we can release the release candidate like this, because I'm
running into two issues probably related to a recent rocksdb version
upgrade.

This is my list of points so far:

- Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
- Build a job against the staging repository
- Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and
WordCount batch jobs
- Executed a heavy, misbehaved streaming job for a few hours. While running
that job, I found that:
  - Not all checkpoint directories are cleaned up in HDFS (I use the async
rocksdb statebackend)
  -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
happening in the last minutes)
  - "beyond physical memory limits" container killings from YARN (I know we
can configure this, I just wonder what if we should change the default
value)
  -  the segfaults and memory limits caused the job to not run anymore in
the end because it was in a constant retry loop.
  - This is not a blocking issue I found during the testing:
https://issues.apache.org/jira/browse/FLINK-5345
  - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
https://issues.apache.org/jira/browse/FLINK-4631


Let me know if we should release anyways or fix these issues first.


On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:

> Dear Flink community,
>
> Please vote on releasing the following candidate as Apache Flink version
> 1.1.4.
>
> The commit to be voted on:
> 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
>
> Branch:
> release-1.1.4-rc3
> (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
> k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
>
> The release artifacts to be voted on can be found at:
> http://people.apache.org/~uce/flink-1.1.4-rc3/
>
> The release artifacts are signed with the key with fingerprint 9D403309:
> http://www.apache.org/dist/flink/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapacheflink-1109
>
> -------------------------------------------------------------
>
> The voting time is at least three days and the vote passes if a
> majority of at least three +1 PMC votes are cast. The vote ends earliest
> on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
>
> [ ] +1 Release this package as Apache Flink 1.1.4
> [ ] -1 Do not release this package, because ...
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Flavio Pompermaier
I personally think that it should be quite important to have a fix also for
the ES connector (https://issues.apache.org/jira/browse/FLINK-5122).

Best,
Flavio

On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]>
wrote:

> I'm not sure if we can release the release candidate like this, because I'm
> running into two issues probably related to a recent rocksdb version
> upgrade.
>
> This is my list of points so far:
>
> - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
> - Build a job against the staging repository
> - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and
> WordCount batch jobs
> - Executed a heavy, misbehaved streaming job for a few hours. While running
> that job, I found that:
>   - Not all checkpoint directories are cleaned up in HDFS (I use the async
> rocksdb statebackend)
>   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
> happening in the last minutes)
>   - "beyond physical memory limits" container killings from YARN (I know we
> can configure this, I just wonder what if we should change the default
> value)
>   -  the segfaults and memory limits caused the job to not run anymore in
> the end because it was in a constant retry loop.
>   - This is not a blocking issue I found during the testing:
> https://issues.apache.org/jira/browse/FLINK-5345
>   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
> https://issues.apache.org/jira/browse/FLINK-4631
>
>
> Let me know if we should release anyways or fix these issues first.
>
>
> On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:
>
> > Dear Flink community,
> >
> > Please vote on releasing the following candidate as Apache Flink version
> > 1.1.4.
> >
> > The commit to be voted on:
> > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
> >
> > Branch:
> > release-1.1.4-rc3
> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
> > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
> >
> > The release artifacts to be voted on can be found at:
> > http://people.apache.org/~uce/flink-1.1.4-rc3/
> >
> > The release artifacts are signed with the key with fingerprint 9D403309:
> > http://www.apache.org/dist/flink/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapacheflink-1109
> >
> > -------------------------------------------------------------
> >
> > The voting time is at least three days and the vote passes if a
> > majority of at least three +1 PMC votes are cast. The vote ends earliest
> > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
> >
> > [ ] +1 Release this package as Apache Flink 1.1.4
> > [ ] -1 Do not release this package, because ...
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Ufuk Celebi-2
In reply to this post by Robert Metzger
If the memory consumption behaviour changed since 1.1.3 I think that
this is a blocker for the release. I would not like users to upgrade
their Flink installation and all of a sudden run into killed
containers. I would not blindly change the default memory fraction
without understanding the root cause of this.

The other issues you found are nice to fix, but can also happen via 1.1.5 imo.

I checked the following so far:
- Files are cleaned up on checkpoint failures (FS backend)
- Streams are eagerly closed on cancellation (FS backend)
- Non-recoverable jobs in ZooKeeper are skipped and a warning is given
- StreamingStateMachine job with HA (killing TMs and JM)
- Manually compared stream consumption behaviour with 1.1.3 and RC2
(comparing the "fairness")

Especially the last fix is very important and should get out as soon
as possible.

On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> wrote:

> I'm not sure if we can release the release candidate like this, because I'm
> running into two issues probably related to a recent rocksdb version
> upgrade.
>
> This is my list of points so far:
>
> - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
> - Build a job against the staging repository
> - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and
> WordCount batch jobs
> - Executed a heavy, misbehaved streaming job for a few hours. While running
> that job, I found that:
>   - Not all checkpoint directories are cleaned up in HDFS (I use the async
> rocksdb statebackend)
>   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
> happening in the last minutes)
>   - "beyond physical memory limits" container killings from YARN (I know we
> can configure this, I just wonder what if we should change the default
> value)
>   -  the segfaults and memory limits caused the job to not run anymore in
> the end because it was in a constant retry loop.
>   - This is not a blocking issue I found during the testing:
> https://issues.apache.org/jira/browse/FLINK-5345
>   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
> https://issues.apache.org/jira/browse/FLINK-4631
>
>
> Let me know if we should release anyways or fix these issues first.
>
>
> On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:
>
>> Dear Flink community,
>>
>> Please vote on releasing the following candidate as Apache Flink version
>> 1.1.4.
>>
>> The commit to be voted on:
>> 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
>>
>> Branch:
>> release-1.1.4-rc3
>> (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
>> k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
>>
>> The release artifacts to be voted on can be found at:
>> http://people.apache.org/~uce/flink-1.1.4-rc3/
>>
>> The release artifacts are signed with the key with fingerprint 9D403309:
>> http://www.apache.org/dist/flink/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapacheflink-1109
>>
>> -------------------------------------------------------------
>>
>> The voting time is at least three days and the vote passes if a
>> majority of at least three +1 PMC votes are cast. The vote ends earliest
>> on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
>>
>> [ ] +1 Release this package as Apache Flink 1.1.4
>> [ ] -1 Do not release this package, because ...
>>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Gyula Fóra
In reply to this post by Flavio Pompermaier
@Robert

-I am not sure if the RocksDB problems are closely related to the version
upgrade, I have been experiencing similar problems for months. This is
usually not a huge problem on YARN I think, it mostly hurts in standalone
clusters.
-Also the yarn memory limits are tricky to configure nicely as it depends a
lot on how rocks handles native memory. It seems to grow quite a lot over
time.


Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec.
16., P, 10:56):

> I personally think that it should be quite important to have a fix also for
> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122).
>
> Best,
> Flavio
>
> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]>
> wrote:
>
> > I'm not sure if we can release the release candidate like this, because
> I'm
> > running into two issues probably related to a recent rocksdb version
> > upgrade.
> >
> > This is my list of points so far:
> >
> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
> > - Build a job against the staging repository
> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans
> and
> > WordCount batch jobs
> > - Executed a heavy, misbehaved streaming job for a few hours. While
> running
> > that job, I found that:
> >   - Not all checkpoint directories are cleaned up in HDFS (I use the
> async
> > rocksdb statebackend)
> >   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
> > happening in the last minutes)
> >   - "beyond physical memory limits" container killings from YARN (I know
> we
> > can configure this, I just wonder what if we should change the default
> > value)
> >   -  the segfaults and memory limits caused the job to not run anymore in
> > the end because it was in a constant retry loop.
> >   - This is not a blocking issue I found during the testing:
> > https://issues.apache.org/jira/browse/FLINK-5345
> >   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
> > https://issues.apache.org/jira/browse/FLINK-4631
> >
> >
> > Let me know if we should release anyways or fix these issues first.
> >
> >
> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:
> >
> > > Dear Flink community,
> > >
> > > Please vote on releasing the following candidate as Apache Flink
> version
> > > 1.1.4.
> > >
> > > The commit to be voted on:
> > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
> > >
> > > Branch:
> > > release-1.1.4-rc3
> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
> > >
> > > The release artifacts to be voted on can be found at:
> > > http://people.apache.org/~uce/flink-1.1.4-rc3/
> > >
> > > The release artifacts are signed with the key with fingerprint
> 9D403309:
> > > http://www.apache.org/dist/flink/KEYS
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapacheflink-1109
> > >
> > > -------------------------------------------------------------
> > >
> > > The voting time is at least three days and the vote passes if a
> > > majority of at least three +1 PMC votes are cast. The vote ends
> earliest
> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
> > >
> > > [ ] +1 Release this package as Apache Flink 1.1.4
> > > [ ] -1 Do not release this package, because ...
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Ufuk Celebi-2
-1 Changing the memory consumption between minor releases should not happen.

The good news: Robert ran a test with the latest 1.1 branch that
contains a fix for the changed RocksDB memory configuration and
reported stable behaviour.

@Flavio: I agree, but since we're already very late with this bugfix
release I would not like to wait for the PR to be merged. We can
include it in 1.1.5, which can follow very soon imo. I hope that's OK
for you.

On Fri, Dec 16, 2016 at 11:07 AM, Gyula Fóra <[hidden email]> wrote:

> @Robert
>
> -I am not sure if the RocksDB problems are closely related to the version
> upgrade, I have been experiencing similar problems for months. This is
> usually not a huge problem on YARN I think, it mostly hurts in standalone
> clusters.
> -Also the yarn memory limits are tricky to configure nicely as it depends a
> lot on how rocks handles native memory. It seems to grow quite a lot over
> time.
>
>
> Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec.
> 16., P, 10:56):
>
>> I personally think that it should be quite important to have a fix also for
>> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122).
>>
>> Best,
>> Flavio
>>
>> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]>
>> wrote:
>>
>> > I'm not sure if we can release the release candidate like this, because
>> I'm
>> > running into two issues probably related to a recent rocksdb version
>> > upgrade.
>> >
>> > This is my list of points so far:
>> >
>> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
>> > - Build a job against the staging repository
>> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans
>> and
>> > WordCount batch jobs
>> > - Executed a heavy, misbehaved streaming job for a few hours. While
>> running
>> > that job, I found that:
>> >   - Not all checkpoint directories are cleaned up in HDFS (I use the
>> async
>> > rocksdb statebackend)
>> >   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
>> > happening in the last minutes)
>> >   - "beyond physical memory limits" container killings from YARN (I know
>> we
>> > can configure this, I just wonder what if we should change the default
>> > value)
>> >   -  the segfaults and memory limits caused the job to not run anymore in
>> > the end because it was in a constant retry loop.
>> >   - This is not a blocking issue I found during the testing:
>> > https://issues.apache.org/jira/browse/FLINK-5345
>> >   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
>> > https://issues.apache.org/jira/browse/FLINK-4631
>> >
>> >
>> > Let me know if we should release anyways or fix these issues first.
>> >
>> >
>> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:
>> >
>> > > Dear Flink community,
>> > >
>> > > Please vote on releasing the following candidate as Apache Flink
>> version
>> > > 1.1.4.
>> > >
>> > > The commit to be voted on:
>> > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
>> > >
>> > > Branch:
>> > > release-1.1.4-rc3
>> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
>> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
>> > >
>> > > The release artifacts to be voted on can be found at:
>> > > http://people.apache.org/~uce/flink-1.1.4-rc3/
>> > >
>> > > The release artifacts are signed with the key with fingerprint
>> 9D403309:
>> > > http://www.apache.org/dist/flink/KEYS
>> > >
>> > > The staging repository for this release can be found at:
>> > > https://repository.apache.org/content/repositories/orgapacheflink-1109
>> > >
>> > > -------------------------------------------------------------
>> > >
>> > > The voting time is at least three days and the vote passes if a
>> > > majority of at least three +1 PMC votes are cast. The vote ends
>> earliest
>> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
>> > >
>> > > [ ] +1 Release this package as Apache Flink 1.1.4
>> > > [ ] -1 Do not release this package, because ...
>> > >
>> >
>>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release Apache Flink 1.1.4 (RC3)

Flavio Pompermaier
Ok Ufuk, It's not so urgent indeed ;)

Thanks anyway

On Mon, Dec 19, 2016 at 11:18 AM, Ufuk Celebi <[hidden email]> wrote:

> -1 Changing the memory consumption between minor releases should not
> happen.
>
> The good news: Robert ran a test with the latest 1.1 branch that
> contains a fix for the changed RocksDB memory configuration and
> reported stable behaviour.
>
> @Flavio: I agree, but since we're already very late with this bugfix
> release I would not like to wait for the PR to be merged. We can
> include it in 1.1.5, which can follow very soon imo. I hope that's OK
> for you.
>
> On Fri, Dec 16, 2016 at 11:07 AM, Gyula Fóra <[hidden email]> wrote:
> > @Robert
> >
> > -I am not sure if the RocksDB problems are closely related to the version
> > upgrade, I have been experiencing similar problems for months. This is
> > usually not a huge problem on YARN I think, it mostly hurts in standalone
> > clusters.
> > -Also the yarn memory limits are tricky to configure nicely as it
> depends a
> > lot on how rocks handles native memory. It seems to grow quite a lot over
> > time.
> >
> >
> > Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec.
> > 16., P, 10:56):
> >
> >> I personally think that it should be quite important to have a fix also
> for
> >> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122).
> >>
> >> Best,
> >> Flavio
> >>
> >> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]>
> >> wrote:
> >>
> >> > I'm not sure if we can release the release candidate like this,
> because
> >> I'm
> >> > running into two issues probably related to a recent rocksdb version
> >> > upgrade.
> >> >
> >> > This is my list of points so far:
> >> >
> >> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are
> okay.
> >> > - Build a job against the staging repository
> >> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans
> >> and
> >> > WordCount batch jobs
> >> > - Executed a heavy, misbehaved streaming job for a few hours. While
> >> running
> >> > that job, I found that:
> >> >   - Not all checkpoint directories are cleaned up in HDFS (I use the
> >> async
> >> > rocksdb statebackend)
> >> >   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
> >> > happening in the last minutes)
> >> >   - "beyond physical memory limits" container killings from YARN (I
> know
> >> we
> >> > can configure this, I just wonder what if we should change the default
> >> > value)
> >> >   -  the segfaults and memory limits caused the job to not run
> anymore in
> >> > the end because it was in a constant retry loop.
> >> >   - This is not a blocking issue I found during the testing:
> >> > https://issues.apache.org/jira/browse/FLINK-5345
> >> >   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
> >> > https://issues.apache.org/jira/browse/FLINK-4631
> >> >
> >> >
> >> > Let me know if we should release anyways or fix these issues first.
> >> >
> >> >
> >> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote:
> >> >
> >> > > Dear Flink community,
> >> > >
> >> > > Please vote on releasing the following candidate as Apache Flink
> >> version
> >> > > 1.1.4.
> >> > >
> >> > > The commit to be voted on:
> >> > > 2cd6579 (http://git-wip-us.apache.org/
> repos/asf/flink/commit/2cd6579)
> >> > >
> >> > > Branch:
> >> > > release-1.1.4-rc3
> >> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
> >> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
> >> > >
> >> > > The release artifacts to be voted on can be found at:
> >> > > http://people.apache.org/~uce/flink-1.1.4-rc3/
> >> > >
> >> > > The release artifacts are signed with the key with fingerprint
> >> 9D403309:
> >> > > http://www.apache.org/dist/flink/KEYS
> >> > >
> >> > > The staging repository for this release can be found at:
> >> > > https://repository.apache.org/content/repositories/
> orgapacheflink-1109
> >> > >
> >> > > -------------------------------------------------------------
> >> > >
> >> > > The voting time is at least three days and the vote passes if a
> >> > > majority of at least three +1 PMC votes are cast. The vote ends
> >> earliest
> >> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
> >> > >
> >> > > [ ] +1 Release this package as Apache Flink 1.1.4
> >> > > [ ] -1 Do not release this package, because ...
> >> > >
> >> >
> >>
>