Dear Flink community,
Please vote on releasing the following candidate as Apache Flink version 1.1.4. The commit to be voted on: 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) Branch: release-1.1.4-rc3 (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flink.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) The release artifacts to be voted on can be found at: http://people.apache.org/~uce/flink-1.1.4-rc3/ The release artifacts are signed with the key with fingerprint 9D403309: http://www.apache.org/dist/flink/KEYS The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapacheflink-1109 ------------------------------------------------------------- The voting time is at least three days and the vote passes if a majority of at least three +1 PMC votes are cast. The vote ends earliest on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). [ ] +1 Release this package as Apache Flink 1.1.4 [ ] -1 Do not release this package, because ... |
I'm not sure if we can release the release candidate like this, because I'm
running into two issues probably related to a recent rocksdb version upgrade. This is my list of points so far: - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. - Build a job against the staging repository - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and WordCount batch jobs - Executed a heavy, misbehaved streaming job for a few hours. While running that job, I found that: - Not all checkpoint directories are cleaned up in HDFS (I use the async rocksdb statebackend) - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all happening in the last minutes) - "beyond physical memory limits" container killings from YARN (I know we can configure this, I just wonder what if we should change the default value) - the segfaults and memory limits caused the job to not run anymore in the end because it was in a constant retry loop. - This is not a blocking issue I found during the testing: https://issues.apache.org/jira/browse/FLINK-5345 - This is also a non blocking issue for 1.1.4 (fixed for 1.2) https://issues.apache.org/jira/browse/FLINK-4631 Let me know if we should release anyways or fix these issues first. On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: > Dear Flink community, > > Please vote on releasing the following candidate as Apache Flink version > 1.1.4. > > The commit to be voted on: > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) > > Branch: > release-1.1.4-rc3 > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) > > The release artifacts to be voted on can be found at: > http://people.apache.org/~uce/flink-1.1.4-rc3/ > > The release artifacts are signed with the key with fingerprint 9D403309: > http://www.apache.org/dist/flink/KEYS > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapacheflink-1109 > > ------------------------------------------------------------- > > The voting time is at least three days and the vote passes if a > majority of at least three +1 PMC votes are cast. The vote ends earliest > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). > > [ ] +1 Release this package as Apache Flink 1.1.4 > [ ] -1 Do not release this package, because ... > |
I personally think that it should be quite important to have a fix also for
the ES connector (https://issues.apache.org/jira/browse/FLINK-5122). Best, Flavio On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> wrote: > I'm not sure if we can release the release candidate like this, because I'm > running into two issues probably related to a recent rocksdb version > upgrade. > > This is my list of points so far: > > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. > - Build a job against the staging repository > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and > WordCount batch jobs > - Executed a heavy, misbehaved streaming job for a few hours. While running > that job, I found that: > - Not all checkpoint directories are cleaned up in HDFS (I use the async > rocksdb statebackend) > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all > happening in the last minutes) > - "beyond physical memory limits" container killings from YARN (I know we > can configure this, I just wonder what if we should change the default > value) > - the segfaults and memory limits caused the job to not run anymore in > the end because it was in a constant retry loop. > - This is not a blocking issue I found during the testing: > https://issues.apache.org/jira/browse/FLINK-5345 > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) > https://issues.apache.org/jira/browse/FLINK-4631 > > > Let me know if we should release anyways or fix these issues first. > > > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: > > > Dear Flink community, > > > > Please vote on releasing the following candidate as Apache Flink version > > 1.1.4. > > > > The commit to be voted on: > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) > > > > Branch: > > release-1.1.4-rc3 > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) > > > > The release artifacts to be voted on can be found at: > > http://people.apache.org/~uce/flink-1.1.4-rc3/ > > > > The release artifacts are signed with the key with fingerprint 9D403309: > > http://www.apache.org/dist/flink/KEYS > > > > The staging repository for this release can be found at: > > https://repository.apache.org/content/repositories/orgapacheflink-1109 > > > > ------------------------------------------------------------- > > > > The voting time is at least three days and the vote passes if a > > majority of at least three +1 PMC votes are cast. The vote ends earliest > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). > > > > [ ] +1 Release this package as Apache Flink 1.1.4 > > [ ] -1 Do not release this package, because ... > > > |
In reply to this post by Robert Metzger
If the memory consumption behaviour changed since 1.1.3 I think that
this is a blocker for the release. I would not like users to upgrade their Flink installation and all of a sudden run into killed containers. I would not blindly change the default memory fraction without understanding the root cause of this. The other issues you found are nice to fix, but can also happen via 1.1.5 imo. I checked the following so far: - Files are cleaned up on checkpoint failures (FS backend) - Streams are eagerly closed on cancellation (FS backend) - Non-recoverable jobs in ZooKeeper are skipped and a warning is given - StreamingStateMachine job with HA (killing TMs and JM) - Manually compared stream consumption behaviour with 1.1.3 and RC2 (comparing the "fairness") Especially the last fix is very important and should get out as soon as possible. On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> wrote: > I'm not sure if we can release the release candidate like this, because I'm > running into two issues probably related to a recent rocksdb version > upgrade. > > This is my list of points so far: > > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. > - Build a job against the staging repository > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and > WordCount batch jobs > - Executed a heavy, misbehaved streaming job for a few hours. While running > that job, I found that: > - Not all checkpoint directories are cleaned up in HDFS (I use the async > rocksdb statebackend) > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all > happening in the last minutes) > - "beyond physical memory limits" container killings from YARN (I know we > can configure this, I just wonder what if we should change the default > value) > - the segfaults and memory limits caused the job to not run anymore in > the end because it was in a constant retry loop. > - This is not a blocking issue I found during the testing: > https://issues.apache.org/jira/browse/FLINK-5345 > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) > https://issues.apache.org/jira/browse/FLINK-4631 > > > Let me know if we should release anyways or fix these issues first. > > > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: > >> Dear Flink community, >> >> Please vote on releasing the following candidate as Apache Flink version >> 1.1.4. >> >> The commit to be voted on: >> 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) >> >> Branch: >> release-1.1.4-rc3 >> (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin >> k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) >> >> The release artifacts to be voted on can be found at: >> http://people.apache.org/~uce/flink-1.1.4-rc3/ >> >> The release artifacts are signed with the key with fingerprint 9D403309: >> http://www.apache.org/dist/flink/KEYS >> >> The staging repository for this release can be found at: >> https://repository.apache.org/content/repositories/orgapacheflink-1109 >> >> ------------------------------------------------------------- >> >> The voting time is at least three days and the vote passes if a >> majority of at least three +1 PMC votes are cast. The vote ends earliest >> on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). >> >> [ ] +1 Release this package as Apache Flink 1.1.4 >> [ ] -1 Do not release this package, because ... >> |
In reply to this post by Flavio Pompermaier
@Robert
-I am not sure if the RocksDB problems are closely related to the version upgrade, I have been experiencing similar problems for months. This is usually not a huge problem on YARN I think, it mostly hurts in standalone clusters. -Also the yarn memory limits are tricky to configure nicely as it depends a lot on how rocks handles native memory. It seems to grow quite a lot over time. Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec. 16., P, 10:56): > I personally think that it should be quite important to have a fix also for > the ES connector (https://issues.apache.org/jira/browse/FLINK-5122). > > Best, > Flavio > > On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> > wrote: > > > I'm not sure if we can release the release candidate like this, because > I'm > > running into two issues probably related to a recent rocksdb version > > upgrade. > > > > This is my list of points so far: > > > > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. > > - Build a job against the staging repository > > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans > and > > WordCount batch jobs > > - Executed a heavy, misbehaved streaming job for a few hours. While > running > > that job, I found that: > > - Not all checkpoint directories are cleaned up in HDFS (I use the > async > > rocksdb statebackend) > > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all > > happening in the last minutes) > > - "beyond physical memory limits" container killings from YARN (I know > we > > can configure this, I just wonder what if we should change the default > > value) > > - the segfaults and memory limits caused the job to not run anymore in > > the end because it was in a constant retry loop. > > - This is not a blocking issue I found during the testing: > > https://issues.apache.org/jira/browse/FLINK-5345 > > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) > > https://issues.apache.org/jira/browse/FLINK-4631 > > > > > > Let me know if we should release anyways or fix these issues first. > > > > > > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: > > > > > Dear Flink community, > > > > > > Please vote on releasing the following candidate as Apache Flink > version > > > 1.1.4. > > > > > > The commit to be voted on: > > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) > > > > > > Branch: > > > release-1.1.4-rc3 > > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) > > > > > > The release artifacts to be voted on can be found at: > > > http://people.apache.org/~uce/flink-1.1.4-rc3/ > > > > > > The release artifacts are signed with the key with fingerprint > 9D403309: > > > http://www.apache.org/dist/flink/KEYS > > > > > > The staging repository for this release can be found at: > > > https://repository.apache.org/content/repositories/orgapacheflink-1109 > > > > > > ------------------------------------------------------------- > > > > > > The voting time is at least three days and the vote passes if a > > > majority of at least three +1 PMC votes are cast. The vote ends > earliest > > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). > > > > > > [ ] +1 Release this package as Apache Flink 1.1.4 > > > [ ] -1 Do not release this package, because ... > > > > > > |
-1 Changing the memory consumption between minor releases should not happen.
The good news: Robert ran a test with the latest 1.1 branch that contains a fix for the changed RocksDB memory configuration and reported stable behaviour. @Flavio: I agree, but since we're already very late with this bugfix release I would not like to wait for the PR to be merged. We can include it in 1.1.5, which can follow very soon imo. I hope that's OK for you. On Fri, Dec 16, 2016 at 11:07 AM, Gyula Fóra <[hidden email]> wrote: > @Robert > > -I am not sure if the RocksDB problems are closely related to the version > upgrade, I have been experiencing similar problems for months. This is > usually not a huge problem on YARN I think, it mostly hurts in standalone > clusters. > -Also the yarn memory limits are tricky to configure nicely as it depends a > lot on how rocks handles native memory. It seems to grow quite a lot over > time. > > > Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec. > 16., P, 10:56): > >> I personally think that it should be quite important to have a fix also for >> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122). >> >> Best, >> Flavio >> >> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> >> wrote: >> >> > I'm not sure if we can release the release candidate like this, because >> I'm >> > running into two issues probably related to a recent rocksdb version >> > upgrade. >> > >> > This is my list of points so far: >> > >> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay. >> > - Build a job against the staging repository >> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans >> and >> > WordCount batch jobs >> > - Executed a heavy, misbehaved streaming job for a few hours. While >> running >> > that job, I found that: >> > - Not all checkpoint directories are cleaned up in HDFS (I use the >> async >> > rocksdb statebackend) >> > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all >> > happening in the last minutes) >> > - "beyond physical memory limits" container killings from YARN (I know >> we >> > can configure this, I just wonder what if we should change the default >> > value) >> > - the segfaults and memory limits caused the job to not run anymore in >> > the end because it was in a constant retry loop. >> > - This is not a blocking issue I found during the testing: >> > https://issues.apache.org/jira/browse/FLINK-5345 >> > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) >> > https://issues.apache.org/jira/browse/FLINK-4631 >> > >> > >> > Let me know if we should release anyways or fix these issues first. >> > >> > >> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: >> > >> > > Dear Flink community, >> > > >> > > Please vote on releasing the following candidate as Apache Flink >> version >> > > 1.1.4. >> > > >> > > The commit to be voted on: >> > > 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579) >> > > >> > > Branch: >> > > release-1.1.4-rc3 >> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin >> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) >> > > >> > > The release artifacts to be voted on can be found at: >> > > http://people.apache.org/~uce/flink-1.1.4-rc3/ >> > > >> > > The release artifacts are signed with the key with fingerprint >> 9D403309: >> > > http://www.apache.org/dist/flink/KEYS >> > > >> > > The staging repository for this release can be found at: >> > > https://repository.apache.org/content/repositories/orgapacheflink-1109 >> > > >> > > ------------------------------------------------------------- >> > > >> > > The voting time is at least three days and the vote passes if a >> > > majority of at least three +1 PMC votes are cast. The vote ends >> earliest >> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). >> > > >> > > [ ] +1 Release this package as Apache Flink 1.1.4 >> > > [ ] -1 Do not release this package, because ... >> > > >> > >> |
Ok Ufuk, It's not so urgent indeed ;)
Thanks anyway On Mon, Dec 19, 2016 at 11:18 AM, Ufuk Celebi <[hidden email]> wrote: > -1 Changing the memory consumption between minor releases should not > happen. > > The good news: Robert ran a test with the latest 1.1 branch that > contains a fix for the changed RocksDB memory configuration and > reported stable behaviour. > > @Flavio: I agree, but since we're already very late with this bugfix > release I would not like to wait for the PR to be merged. We can > include it in 1.1.5, which can follow very soon imo. I hope that's OK > for you. > > On Fri, Dec 16, 2016 at 11:07 AM, Gyula Fóra <[hidden email]> wrote: > > @Robert > > > > -I am not sure if the RocksDB problems are closely related to the version > > upgrade, I have been experiencing similar problems for months. This is > > usually not a huge problem on YARN I think, it mostly hurts in standalone > > clusters. > > -Also the yarn memory limits are tricky to configure nicely as it > depends a > > lot on how rocks handles native memory. It seems to grow quite a lot over > > time. > > > > > > Flavio Pompermaier <[hidden email]> ezt írta (időpont: 2016. dec. > > 16., P, 10:56): > > > >> I personally think that it should be quite important to have a fix also > for > >> the ES connector (https://issues.apache.org/jira/browse/FLINK-5122). > >> > >> Best, > >> Flavio > >> > >> On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <[hidden email]> > >> wrote: > >> > >> > I'm not sure if we can release the release candidate like this, > because > >> I'm > >> > running into two issues probably related to a recent rocksdb version > >> > upgrade. > >> > > >> > This is my list of points so far: > >> > > >> > - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are > okay. > >> > - Build a job against the staging repository > >> > - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans > >> and > >> > WordCount batch jobs > >> > - Executed a heavy, misbehaved streaming job for a few hours. While > >> running > >> > that job, I found that: > >> > - Not all checkpoint directories are cleaned up in HDFS (I use the > >> async > >> > rocksdb statebackend) > >> > - segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all > >> > happening in the last minutes) > >> > - "beyond physical memory limits" container killings from YARN (I > know > >> we > >> > can configure this, I just wonder what if we should change the default > >> > value) > >> > - the segfaults and memory limits caused the job to not run > anymore in > >> > the end because it was in a constant retry loop. > >> > - This is not a blocking issue I found during the testing: > >> > https://issues.apache.org/jira/browse/FLINK-5345 > >> > - This is also a non blocking issue for 1.1.4 (fixed for 1.2) > >> > https://issues.apache.org/jira/browse/FLINK-4631 > >> > > >> > > >> > Let me know if we should release anyways or fix these issues first. > >> > > >> > > >> > On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <[hidden email]> wrote: > >> > > >> > > Dear Flink community, > >> > > > >> > > Please vote on releasing the following candidate as Apache Flink > >> version > >> > > 1.1.4. > >> > > > >> > > The commit to be voted on: > >> > > 2cd6579 (http://git-wip-us.apache.org/ > repos/asf/flink/commit/2cd6579) > >> > > > >> > > Branch: > >> > > release-1.1.4-rc3 > >> > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > >> > > k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3) > >> > > > >> > > The release artifacts to be voted on can be found at: > >> > > http://people.apache.org/~uce/flink-1.1.4-rc3/ > >> > > > >> > > The release artifacts are signed with the key with fingerprint > >> 9D403309: > >> > > http://www.apache.org/dist/flink/KEYS > >> > > > >> > > The staging repository for this release can be found at: > >> > > https://repository.apache.org/content/repositories/ > orgapacheflink-1109 > >> > > > >> > > ------------------------------------------------------------- > >> > > > >> > > The voting time is at least three days and the vote passes if a > >> > > majority of at least three +1 PMC votes are cast. The vote ends > >> earliest > >> > > on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST). > >> > > > >> > > [ ] +1 Release this package as Apache Flink 1.1.4 > >> > > [ ] -1 Do not release this package, because ... > >> > > > >> > > >> > |
Free forum by Nabble | Edit this page |