Dear Flink community,
Please vote on releasing the following candidate as Apache Flink version 1.2.0. The commit to be voted on: 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/8b5b6a8b) Branch: release-1.2.0-rc2 (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) The release artifacts to be voted on can be found at: *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* The release artifacts are signed with the key with fingerprint D9839159: http://www.apache.org/dist/flink/KEYS The staging repository for this release can be found at: *https://repository.apache.org/content/repositories/orgapacheflink-1113 <https://repository.apache.org/content/repositories/orgapacheflink-1113>* ------------------------------------------------------------- I would like to keep Friday as the target release time. Please let me know if you want me to move the deadline to Monday if you need more time of the testing. The vote ends on Friday, January 27, 2017, 6pm CET. Please test the release rather now than on Friday morning, to be able to cancel it as early as possible. For making the testing easier, I've created this document to track what has already been tested and what needs to be tested: https://docs.google.co m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/edit?usp=sharing Feel free to add more tests or change existing ones. [ ] +1 Release this package as Apache Flink 1.2.0 [ ] -1 Do not release this package, because ... |
I ran some tests and found the following issues:
https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails because of closed registry => This happened a couple of times for the first checkpoints after submitting a job. If it happened on every submission I would definitely make this a blocker, but I happen to run into it in like 3 out of 10 job submission. What do we make of this? https://issues.apache.org/jira/browse/FLINK-5665: When the failures happened, I also had some lingering 0-byte files. https://issues.apache.org/jira/browse/FLINK-5664: I also found the logging of the RocksDB backend a little noisy (for my local setup at least with many tasks per TM and low checkpointing interval.) All in all, I'm not sure if we want to make these a blocker or not. I'm fine both ways with a follow up 1.2.1 release. === - Verified signatures and checksums - Checked out the Java quickstarts and ran the jobs - All poms point to 1.2.0 - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo types, session windows (w/o lateness), operator and keyed state for all three backends - Rescaled the same jobs from 1.2.0 savepoints with all three backends - Verified the "migration namespace serializer" fix - Ran streaming state machine with Kafka source, RocksDB backend and master and worker failures (standalone cluster) On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> wrote: > Dear Flink community, > > Please vote on releasing the following candidate as Apache Flink version > 1.2.0. > > The commit to be voted on: > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/8b5b6a8b) > > Branch: > release-1.2.0-rc2 > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > The release artifacts to be voted on can be found at: > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* > > The release artifacts are signed with the key with fingerprint D9839159: > http://www.apache.org/dist/flink/KEYS > > The staging repository for this release can be found at: > *https://repository.apache.org/content/repositories/orgapacheflink-1113 > <https://repository.apache.org/content/repositories/orgapacheflink-1113>* > > ------------------------------------------------------------- > > I would like to keep Friday as the target release time. Please let me know > if you want me to move the deadline to Monday if you need more time of the > testing. > > The vote ends on Friday, January 27, 2017, 6pm CET. > > Please test the release rather now than on Friday morning, to be able to > cancel it as early as possible. > For making the testing easier, I've created this document to track what has > already been tested and what needs to be tested: https://docs.google.co > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/edit?usp=sharing > Feel free to add more tests or change existing ones. > > [ ] +1 Release this package as Apache Flink 1.2.0 > [ ] -1 Do not release this package, because ... |
Robert also found an issue that pending checkpoint files are not properly
cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my surprise, the issue was already fixed in 1.1.4 so I guess I've forgotten to forward port the fix. There is a pending PR to fix it. The fix could also be part of a 1.2.1 release. Cheers, Till On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi <[hidden email]> wrote: > I ran some tests and found the following issues: > > https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails > because of closed registry > => This happened a couple of times for the first checkpoints after > submitting a job. If it happened on every submission I would > definitely make this a blocker, but I happen to run into it in like 3 > out of 10 job submission. What do we make of this? > > https://issues.apache.org/jira/browse/FLINK-5665: When the failures > happened, I also had some lingering 0-byte files. > > https://issues.apache.org/jira/browse/FLINK-5664: I also found the > logging of the RocksDB backend a little noisy (for my local setup at > least with many tasks per TM and low checkpointing interval.) > > All in all, I'm not sure if we want to make these a blocker or not. > I'm fine both ways with a follow up 1.2.1 release. > > === > > - Verified signatures and checksums > - Checked out the Java quickstarts and ran the jobs > - All poms point to 1.2.0 > - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo > types, session windows (w/o lateness), operator and keyed state for > all three backends > - Rescaled the same jobs from 1.2.0 savepoints with all three backends > - Verified the "migration namespace serializer" fix > - Ran streaming state machine with Kafka source, RocksDB backend and > master and worker failures (standalone cluster) > > On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> > wrote: > > Dear Flink community, > > > > Please vote on releasing the following candidate as Apache Flink version > > 1.2.0. > > > > The commit to be voted on: > > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/8b5b6a8b) > > > > Branch: > > release-1.2.0-rc2 > > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > > > The release artifacts to be voted on can be found at: > > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* > > > > The release artifacts are signed with the key with fingerprint D9839159: > > http://www.apache.org/dist/flink/KEYS > > > > The staging repository for this release can be found at: > > *https://repository.apache.org/content/repositories/orgapacheflink-1113 > > <https://repository.apache.org/content/repositories/orgapacheflink-1113 > >* > > > > ------------------------------------------------------------- > > > > I would like to keep Friday as the target release time. Please let me > know > > if you want me to move the deadline to Monday if you need more time of > the > > testing. > > > > The vote ends on Friday, January 27, 2017, 6pm CET. > > > > Please test the release rather now than on Friday morning, to be able to > > cancel it as early as possible. > > For making the testing easier, I've created this document to track what > has > > already been tested and what needs to be tested: https://docs.google.co > > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/edit?usp= > sharing > > Feel free to add more tests or change existing ones. > > > > [ ] +1 Release this package as Apache Flink 1.2.0 > > [ ] -1 Do not release this package, because ... > |
I have found another problem: Under certain circumstances Flink can lose
state data by completing an invalid checkpoint. https://issues.apache.org/jira/browse/FLINK-5667. Cheers, Till On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann <[hidden email]> wrote: > Robert also found an issue that pending checkpoint files are not properly > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my > surprise, the issue was already fixed in 1.1.4 so I guess I've forgotten to > forward port the fix. There is a pending PR to fix it. The fix could also > be part of a 1.2.1 release. > > Cheers, > Till > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi <[hidden email]> wrote: > >> I ran some tests and found the following issues: >> >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails >> because of closed registry >> => This happened a couple of times for the first checkpoints after >> submitting a job. If it happened on every submission I would >> definitely make this a blocker, but I happen to run into it in like 3 >> out of 10 job submission. What do we make of this? >> >> https://issues.apache.org/jira/browse/FLINK-5665: When the failures >> happened, I also had some lingering 0-byte files. >> >> https://issues.apache.org/jira/browse/FLINK-5664: I also found the >> logging of the RocksDB backend a little noisy (for my local setup at >> least with many tasks per TM and low checkpointing interval.) >> >> All in all, I'm not sure if we want to make these a blocker or not. >> I'm fine both ways with a follow up 1.2.1 release. >> >> === >> >> - Verified signatures and checksums >> - Checked out the Java quickstarts and ran the jobs >> - All poms point to 1.2.0 >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo >> types, session windows (w/o lateness), operator and keyed state for >> all three backends >> - Rescaled the same jobs from 1.2.0 savepoints with all three backends >> - Verified the "migration namespace serializer" fix >> - Ran streaming state machine with Kafka source, RocksDB backend and >> master and worker failures (standalone cluster) >> >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> >> wrote: >> > Dear Flink community, >> > >> > Please vote on releasing the following candidate as Apache Flink version >> > 1.2.0. >> > >> > The commit to be voted on: >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/8b5b6a8b) >> > >> > Branch: >> > release-1.2.0-rc2 >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) >> > >> > The release artifacts to be voted on can be found at: >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ >> > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* >> > >> > The release artifacts are signed with the key with fingerprint D9839159: >> > http://www.apache.org/dist/flink/KEYS >> > >> > The staging repository for this release can be found at: >> > *https://repository.apache.org/content/repositories/orgapacheflink-1113 >> > <https://repository.apache.org/content/repositories/orgapacheflink-1113 >> >* >> > >> > ------------------------------------------------------------- >> > >> > I would like to keep Friday as the target release time. Please let me >> know >> > if you want me to move the deadline to Monday if you need more time of >> the >> > testing. >> > >> > The vote ends on Friday, January 27, 2017, 6pm CET. >> > >> > Please test the release rather now than on Friday morning, to be able to >> > cancel it as early as possible. >> > For making the testing easier, I've created this document to track what >> has >> > already been tested and what needs to be tested: https://docs.google.co >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ >> edit?usp=sharing >> > Feel free to add more tests or change existing ones. >> > >> > [ ] +1 Release this package as Apache Flink 1.2.0 >> > [ ] -1 Do not release this package, because ... >> > > |
@Till - I think that FLINK-5667 is a blocker
Good catch finding it! On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann <[hidden email]> wrote: > I have found another problem: Under certain circumstances Flink can lose > state data by completing an invalid checkpoint. > https://issues.apache.org/jira/browse/FLINK-5667. > > Cheers, > Till > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann <[hidden email]> > wrote: > > > Robert also found an issue that pending checkpoint files are not properly > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my > > surprise, the issue was already fixed in 1.1.4 so I guess I've forgotten > to > > forward port the fix. There is a pending PR to fix it. The fix could also > > be part of a 1.2.1 release. > > > > Cheers, > > Till > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi <[hidden email]> wrote: > > > >> I ran some tests and found the following issues: > >> > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails > >> because of closed registry > >> => This happened a couple of times for the first checkpoints after > >> submitting a job. If it happened on every submission I would > >> definitely make this a blocker, but I happen to run into it in like 3 > >> out of 10 job submission. What do we make of this? > >> > >> https://issues.apache.org/jira/browse/FLINK-5665: When the failures > >> happened, I also had some lingering 0-byte files. > >> > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found the > >> logging of the RocksDB backend a little noisy (for my local setup at > >> least with many tasks per TM and low checkpointing interval.) > >> > >> All in all, I'm not sure if we want to make these a blocker or not. > >> I'm fine both ways with a follow up 1.2.1 release. > >> > >> === > >> > >> - Verified signatures and checksums > >> - Checked out the Java quickstarts and ran the jobs > >> - All poms point to 1.2.0 > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo > >> types, session windows (w/o lateness), operator and keyed state for > >> all three backends > >> - Rescaled the same jobs from 1.2.0 savepoints with all three backends > >> - Verified the "migration namespace serializer" fix > >> - Ran streaming state machine with Kafka source, RocksDB backend and > >> master and worker failures (standalone cluster) > >> > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> > >> wrote: > >> > Dear Flink community, > >> > > >> > Please vote on releasing the following candidate as Apache Flink > version > >> > 1.2.0. > >> > > >> > The commit to be voted on: > >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/ > 8b5b6a8b) > >> > > >> > Branch: > >> > release-1.2.0-rc2 > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > >> > > >> > The release artifacts to be voted on can be found at: > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > >> > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* > >> > > >> > The release artifacts are signed with the key with fingerprint > D9839159: > >> > http://www.apache.org/dist/flink/KEYS > >> > > >> > The staging repository for this release can be found at: > >> > *https://repository.apache.org/content/repositories/ > orgapacheflink-1113 > >> > <https://repository.apache.org/content/repositories/ > orgapacheflink-1113 > >> >* > >> > > >> > ------------------------------------------------------------- > >> > > >> > I would like to keep Friday as the target release time. Please let me > >> know > >> > if you want me to move the deadline to Monday if you need more time of > >> the > >> > testing. > >> > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > >> > > >> > Please test the release rather now than on Friday morning, to be able > to > >> > cancel it as early as possible. > >> > For making the testing easier, I've created this document to track > what > >> has > >> > already been tested and what needs to be tested: > https://docs.google.co > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > >> edit?usp=sharing > >> > Feel free to add more tests or change existing ones. > >> > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > >> > [ ] -1 Do not release this package, because ... > >> > > > > > |
Damn. I really hoped that this RC goes through.
I propose to keep the RC2 open until we've fixed all issues mentioned here and to get some more testing feedback. On Thu, Jan 26, 2017 at 8:06 PM, Stephan Ewen <[hidden email]> wrote: > @Till - I think that FLINK-5667 is a blocker > > Good catch finding it! > > On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann <[hidden email]> > wrote: > > > I have found another problem: Under certain circumstances Flink can lose > > state data by completing an invalid checkpoint. > > https://issues.apache.org/jira/browse/FLINK-5667. > > > > Cheers, > > Till > > > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann <[hidden email]> > > wrote: > > > > > Robert also found an issue that pending checkpoint files are not > properly > > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my > > > surprise, the issue was already fixed in 1.1.4 so I guess I've > forgotten > > to > > > forward port the fix. There is a pending PR to fix it. The fix could > also > > > be part of a 1.2.1 release. > > > > > > Cheers, > > > Till > > > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi <[hidden email]> wrote: > > > > > >> I ran some tests and found the following issues: > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails > > >> because of closed registry > > >> => This happened a couple of times for the first checkpoints after > > >> submitting a job. If it happened on every submission I would > > >> definitely make this a blocker, but I happen to run into it in like 3 > > >> out of 10 job submission. What do we make of this? > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5665: When the failures > > >> happened, I also had some lingering 0-byte files. > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found the > > >> logging of the RocksDB backend a little noisy (for my local setup at > > >> least with many tasks per TM and low checkpointing interval.) > > >> > > >> All in all, I'm not sure if we want to make these a blocker or not. > > >> I'm fine both ways with a follow up 1.2.1 release. > > >> > > >> === > > >> > > >> - Verified signatures and checksums > > >> - Checked out the Java quickstarts and ran the jobs > > >> - All poms point to 1.2.0 > > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo > > >> types, session windows (w/o lateness), operator and keyed state for > > >> all three backends > > >> - Rescaled the same jobs from 1.2.0 savepoints with all three backends > > >> - Verified the "migration namespace serializer" fix > > >> - Ran streaming state machine with Kafka source, RocksDB backend and > > >> master and worker failures (standalone cluster) > > >> > > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> > > >> wrote: > > >> > Dear Flink community, > > >> > > > >> > Please vote on releasing the following candidate as Apache Flink > > version > > >> > 1.2.0. > > >> > > > >> > The commit to be voted on: > > >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/ > > 8b5b6a8b) > > >> > > > >> > Branch: > > >> > release-1.2.0-rc2 > > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > >> > > > >> > The release artifacts to be voted on can be found at: > > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > >> > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* > > >> > > > >> > The release artifacts are signed with the key with fingerprint > > D9839159: > > >> > http://www.apache.org/dist/flink/KEYS > > >> > > > >> > The staging repository for this release can be found at: > > >> > *https://repository.apache.org/content/repositories/ > > orgapacheflink-1113 > > >> > <https://repository.apache.org/content/repositories/ > > orgapacheflink-1113 > > >> >* > > >> > > > >> > ------------------------------------------------------------- > > >> > > > >> > I would like to keep Friday as the target release time. Please let > me > > >> know > > >> > if you want me to move the deadline to Monday if you need more time > of > > >> the > > >> > testing. > > >> > > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > > >> > > > >> > Please test the release rather now than on Friday morning, to be > able > > to > > >> > cancel it as early as possible. > > >> > For making the testing easier, I've created this document to track > > what > > >> has > > >> > already been tested and what needs to be tested: > > https://docs.google.co > > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > > >> edit?usp=sharing > > >> > Feel free to add more tests or change existing ones. > > >> > > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > > >> > [ ] -1 Do not release this package, because ... > > >> > > > > > > > > > |
Hi,
Aside from the issues mentioned above I have some good news as well. I have finished porting and started testing one of our major production jobs (RBea) on 1.2 and everything seems to run well so far, with savepoints, rescaling, externalized checkpoints, metrics etc. on YARN. In this job I use, windowing, RocksDB state, iterations, timers, broadcast states, repartitionable operator states etc. and everything seems to be working extremely well under normal circumstances. So far I mostly ran sunny day tests but I will continue testing with larger load and some failure scenarios. I will keep you posted. Great job! Gyula Robert Metzger <[hidden email]> ezt írta (időpont: 2017. jan. 26., Cs, 21:28): Damn. I really hoped that this RC goes through. I propose to keep the RC2 open until we've fixed all issues mentioned here and to get some more testing feedback. On Thu, Jan 26, 2017 at 8:06 PM, Stephan Ewen <[hidden email]> wrote: > @Till - I think that FLINK-5667 is a blocker > > Good catch finding it! > > On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann <[hidden email]> > wrote: > > > I have found another problem: Under certain circumstances Flink can lose > > state data by completing an invalid checkpoint. > > https://issues.apache.org/jira/browse/FLINK-5667. > > > > Cheers, > > Till > > > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann <[hidden email]> > > wrote: > > > > > Robert also found an issue that pending checkpoint files are not > properly > > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my > > > surprise, the issue was already fixed in 1.1.4 so I guess I've > forgotten > > to > > > forward port the fix. There is a pending PR to fix it. The fix could > also > > > be part of a 1.2.1 release. > > > > > > Cheers, > > > Till > > > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi <[hidden email]> wrote: > > > > > >> I ran some tests and found the following issues: > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails > > >> because of closed registry > > >> => This happened a couple of times for the first checkpoints after > > >> submitting a job. If it happened on every submission I would > > >> definitely make this a blocker, but I happen to run into it in like 3 > > >> out of 10 job submission. What do we make of this? > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5665: When the failures > > >> happened, I also had some lingering 0-byte files. > > >> > > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found the > > >> logging of the RocksDB backend a little noisy (for my local setup at > > >> least with many tasks per TM and low checkpointing interval.) > > >> > > >> All in all, I'm not sure if we want to make these a blocker or not. > > >> I'm fine both ways with a follow up 1.2.1 release. > > >> > > >> === > > >> > > >> - Verified signatures and checksums > > >> - Checked out the Java quickstarts and ran the jobs > > >> - All poms point to 1.2.0 > > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo > > >> types, session windows (w/o lateness), operator and keyed state for > > >> all three backends > > >> - Rescaled the same jobs from 1.2.0 savepoints with all three > > >> - Verified the "migration namespace serializer" fix > > >> - Ran streaming state machine with Kafka source, RocksDB backend and > > >> master and worker failures (standalone cluster) > > >> > > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger <[hidden email]> > > >> wrote: > > >> > Dear Flink community, > > >> > > > >> > Please vote on releasing the following candidate as Apache Flink > > version > > >> > 1.2.0. > > >> > > > >> > The commit to be voted on: > > >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/ > > 8b5b6a8b) > > >> > > > >> > Branch: > > >> > release-1.2.0-rc2 > > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > >> > > > >> > The release artifacts to be voted on can be found at: > > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > >> > <http://people.apache.org/~rmetzger/flink-1.2.0-rc2/>* > > >> > > > >> > The release artifacts are signed with the key with fingerprint > > D9839159: > > >> > http://www.apache.org/dist/flink/KEYS > > >> > > > >> > The staging repository for this release can be found at: > > >> > *https://repository.apache.org/content/repositories/ > > orgapacheflink-1113 > > >> > <https://repository.apache.org/content/repositories/ > > orgapacheflink-1113 > > >> >* > > >> > > > >> > ------------------------------------------------------------- > > >> > > > >> > I would like to keep Friday as the target release time. Please let > me > > >> know > > >> > if you want me to move the deadline to Monday if you need more time > of > > >> the > > >> > testing. > > >> > > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > > >> > > > >> > Please test the release rather now than on Friday morning, to be > able > > to > > >> > cancel it as early as possible. > > >> > For making the testing easier, I've created this document to track > > what > > >> has > > >> > already been tested and what needs to be tested: > > https://docs.google.co > > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > > >> edit?usp=sharing > > >> > Feel free to add more tests or change existing ones. > > >> > > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > > >> > [ ] -1 Do not release this package, because ... > > >> > > > > > > > > > |
Thanks Gyula!
The current state of things is: - Stefan is working on a fix for https://issues.apache.org/jira/browse/FLINK-5663. - Till is working on https://issues.apache.org/jira/browse/FLINK-5667. As far as I can tell, these will be fixed today and we are ready to go for RC3. I resolved the other issues I created. – Ufuk On 26 January 2017 at 22:16:26, Gyula Fóra ([hidden email]) wrote: > Hi, > > Aside from the issues mentioned above I have some good news as well. > > I have finished porting and started testing one of our major production > jobs (RBea) on 1.2 and everything seems to run well so far, with > savepoints, rescaling, externalized checkpoints, metrics etc. on YARN. > > In this job I use, windowing, RocksDB state, iterations, timers, broadcast > states, repartitionable operator states etc. and everything seems to be > working extremely well under normal circumstances. > > So far I mostly ran sunny day tests but I will continue testing with larger > load and some failure scenarios. I will keep you posted. > > Great job! > Gyula > > > > Robert Metzger ezt írta (időpont: 2017. jan. 26., Cs, > 21:28): > > Damn. I really hoped that this RC goes through. > > I propose to keep the RC2 open until we've fixed all issues mentioned here > and to get some more testing feedback. > > > > On Thu, Jan 26, 2017 at 8:06 PM, Stephan Ewen wrote: > > > @Till - I think that FLINK-5667 is a blocker > > > > Good catch finding it! > > > > On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann > > wrote: > > > > > I have found another problem: Under certain circumstances Flink can lose > > > state data by completing an invalid checkpoint. > > > https://issues.apache.org/jira/browse/FLINK-5667. > > > > > > Cheers, > > > Till > > > > > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann > > > wrote: > > > > > > > Robert also found an issue that pending checkpoint files are not > > properly > > > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To my > > > > surprise, the issue was already fixed in 1.1.4 so I guess I've > > forgotten > > > to > > > > forward port the fix. There is a pending PR to fix it. The fix could > > also > > > > be part of a 1.2.1 release. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi wrote: > > > > > > > >> I ran some tests and found the following issues: > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint fails > > > >> because of closed registry > > > >> => This happened a couple of times for the first checkpoints after > > > >> submitting a job. If it happened on every submission I would > > > >> definitely make this a blocker, but I happen to run into it in like 3 > > > >> out of 10 job submission. What do we make of this? > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-5665: When the failures > > > >> happened, I also had some lingering 0-byte files. > > > >> > > > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found the > > > >> logging of the RocksDB backend a little noisy (for my local setup at > > > >> least with many tasks per TM and low checkpointing interval.) > > > >> > > > >> All in all, I'm not sure if we want to make these a blocker or not. > > > >> I'm fine both ways with a follow up 1.2.1 release. > > > >> > > > >> === > > > >> > > > >> - Verified signatures and checksums > > > >> - Checked out the Java quickstarts and ran the jobs > > > >> - All poms point to 1.2.0 > > > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with Kryo > > > >> types, session windows (w/o lateness), operator and keyed state for > > > >> all three backends > > > >> - Rescaled the same jobs from 1.2.0 savepoints with all three > backends > > > >> - Verified the "migration namespace serializer" fix > > > >> - Ran streaming state machine with Kafka source, RocksDB backend and > > > >> master and worker failures (standalone cluster) > > > >> > > > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger > > > >> wrote: > > > >> > Dear Flink community, > > > >> > > > > >> > Please vote on releasing the following candidate as Apache Flink > > > version > > > >> > 1.2.0. > > > >> > > > > >> > The commit to be voted on: > > > >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/ > > > 8b5b6a8b) > > > >> > > > > >> > Branch: > > > >> > release-1.2.0-rc2 > > > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > > >> > > > > >> > The release artifacts to be voted on can be found at: > > > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > > >> > * > > > >> > > > > >> > The release artifacts are signed with the key with fingerprint > > > D9839159: > > > >> > http://www.apache.org/dist/flink/KEYS > > > >> > > > > >> > The staging repository for this release can be found at: > > > >> > *https://repository.apache.org/content/repositories/ > > > orgapacheflink-1113 > > > >> > > > > orgapacheflink-1113 > > > >> >* > > > >> > > > > >> > ------------------------------------------------------------- > > > >> > > > > >> > I would like to keep Friday as the target release time. Please let > > me > > > >> know > > > >> > if you want me to move the deadline to Monday if you need more time > > of > > > >> the > > > >> > testing. > > > >> > > > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > > > >> > > > > >> > Please test the release rather now than on Friday morning, to be > > able > > > to > > > >> > cancel it as early as possible. > > > >> > For making the testing easier, I've created this document to track > > > what > > > >> has > > > >> > already been tested and what needs to be tested: > > > https://docs.google.co > > > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > > > >> edit?usp=sharing > > > >> > Feel free to add more tests or change existing ones. > > > >> > > > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > > > >> > [ ] -1 Do not release this package, because ... > > > >> > > > > > > > > > > > > > > |
I think this issue that Ufuk opened is also a blocker:
https://issues.apache.org/jira/browse/FLINK-5670 As I comment in the Issue, at least one bigger user of Flink has run into this problem on their cluster. On Fri, 27 Jan 2017 at 10:50 Ufuk Celebi <[hidden email]> wrote: > Thanks Gyula! > > The current state of things is: > - Stefan is working on a fix for > https://issues.apache.org/jira/browse/FLINK-5663. > - Till is working on https://issues.apache.org/jira/browse/FLINK-5667. > > As far as I can tell, these will be fixed today and we are ready to go for > RC3. > > I resolved the other issues I created. > > – Ufuk > > On 26 January 2017 at 22:16:26, Gyula Fóra ([hidden email]) wrote: > > Hi, > > > > Aside from the issues mentioned above I have some good news as well. > > > > I have finished porting and started testing one of our major production > > jobs (RBea) on 1.2 and everything seems to run well so far, with > > savepoints, rescaling, externalized checkpoints, metrics etc. on YARN. > > > > In this job I use, windowing, RocksDB state, iterations, timers, > broadcast > > states, repartitionable operator states etc. and everything seems to be > > working extremely well under normal circumstances. > > > > So far I mostly ran sunny day tests but I will continue testing with > larger > > load and some failure scenarios. I will keep you posted. > > > > Great job! > > Gyula > > > > > > > > Robert Metzger ezt írta (időpont: 2017. jan. 26., Cs, > > 21:28): > > > > Damn. I really hoped that this RC goes through. > > > > I propose to keep the RC2 open until we've fixed all issues mentioned > here > > and to get some more testing feedback. > > > > > > > > On Thu, Jan 26, 2017 at 8:06 PM, Stephan Ewen wrote: > > > > > @Till - I think that FLINK-5667 is a blocker > > > > > > Good catch finding it! > > > > > > On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann > > > wrote: > > > > > > > I have found another problem: Under certain circumstances Flink can > lose > > > > state data by completing an invalid checkpoint. > > > > https://issues.apache.org/jira/browse/FLINK-5667. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann > > > > wrote: > > > > > > > > > Robert also found an issue that pending checkpoint files are not > > > properly > > > > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To > my > > > > > surprise, the issue was already fixed in 1.1.4 so I guess I've > > > forgotten > > > > to > > > > > forward port the fix. There is a pending PR to fix it. The fix > could > > > also > > > > > be part of a 1.2.1 release. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi wrote: > > > > > > > > > >> I ran some tests and found the following issues: > > > > >> > > > > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint > fails > > > > >> because of closed registry > > > > >> => This happened a couple of times for the first checkpoints after > > > > >> submitting a job. If it happened on every submission I would > > > > >> definitely make this a blocker, but I happen to run into it in > like 3 > > > > >> out of 10 job submission. What do we make of this? > > > > >> > > > > >> https://issues.apache.org/jira/browse/FLINK-5665: When the > failures > > > > >> happened, I also had some lingering 0-byte files. > > > > >> > > > > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found > the > > > > >> logging of the RocksDB backend a little noisy (for my local setup > at > > > > >> least with many tasks per TM and low checkpointing interval.) > > > > >> > > > > >> All in all, I'm not sure if we want to make these a blocker or > not. > > > > >> I'm fine both ways with a follow up 1.2.1 release. > > > > >> > > > > >> === > > > > >> > > > > >> - Verified signatures and checksums > > > > >> - Checked out the Java quickstarts and ran the jobs > > > > >> - All poms point to 1.2.0 > > > > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with > Kryo > > > > >> types, session windows (w/o lateness), operator and keyed state > for > > > > >> all three backends > > > > >> - Rescaled the same jobs from 1.2.0 savepoints with all three > > backends > > > > >> - Verified the "migration namespace serializer" fix > > > > >> - Ran streaming state machine with Kafka source, RocksDB backend > and > > > > >> master and worker failures (standalone cluster) > > > > >> > > > > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger > > > > >> wrote: > > > > >> > Dear Flink community, > > > > >> > > > > > >> > Please vote on releasing the following candidate as Apache Flink > > > > version > > > > >> > 1.2.0. > > > > >> > > > > > >> > The commit to be voted on: > > > > >> > 8b5b6a8b (http://git-wip-us.apache.org/repos/asf/flink/commit/ > > > > 8b5b6a8b) > > > > >> > > > > > >> > Branch: > > > > >> > release-1.2.0-rc2 > > > > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > > > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > > > >> > > > > > >> > The release artifacts to be voted on can be found at: > > > > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > > > >> > * > > > > >> > > > > > >> > The release artifacts are signed with the key with fingerprint > > > > D9839159: > > > > >> > http://www.apache.org/dist/flink/KEYS > > > > >> > > > > > >> > The staging repository for this release can be found at: > > > > >> > *https://repository.apache.org/content/repositories/ > > > > orgapacheflink-1113 > > > > >> > > > > orgapacheflink-1113 > > > > >> >* > > > > >> > > > > > >> > ------------------------------------------------------------- > > > > >> > > > > > >> > I would like to keep Friday as the target release time. Please > let > > > me > > > > >> know > > > > >> > if you want me to move the deadline to Monday if you need more > time > > > of > > > > >> the > > > > >> > testing. > > > > >> > > > > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > > > > >> > > > > > >> > Please test the release rather now than on Friday morning, to be > > > able > > > > to > > > > >> > cancel it as early as possible. > > > > >> > For making the testing easier, I've created this document to > track > > > > what > > > > >> has > > > > >> > already been tested and what needs to be tested: > > > > https://docs.google.co > > > > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > > > > >> edit?usp=sharing > > > > >> > Feel free to add more tests or change existing ones. > > > > >> > > > > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > > > > >> > [ ] -1 Do not release this package, because ... > > > > >> > > > > > > > > > > > > > > > > > > > > > |
Thank you for testing the RC Gyula!
Regarding the other reported JIRAs: These issues are resolved: https://issues.apache.org/jira/browse/FLINK-5670 (merged) https://issues.apache.org/jira/browse/FLINK-5667 (merged) https://issues.apache.org/jira/browse/FLINK-5660 (merged) https://issues.apache.org/jira/browse/FLINK-5665 (duplicate) https://issues.apache.org/jira/browse/FLINK-5664 (wontfix) Unresolved: https://issues.apache.org/jira/browse/FLINK-5663 (pending PR merge) I'll create RC3 once FLINK-5663 has been merged to the "release-1.2" branch, so that we can start testing and voting on Monday morning again. On Fri, Jan 27, 2017 at 1:26 PM, Aljoscha Krettek <[hidden email]> wrote: > I think this issue that Ufuk opened is also a blocker: > https://issues.apache.org/jira/browse/FLINK-5670 > > As I comment in the Issue, at least one bigger user of Flink has run into > this problem on their cluster. > > On Fri, 27 Jan 2017 at 10:50 Ufuk Celebi <[hidden email]> wrote: > > > Thanks Gyula! > > > > The current state of things is: > > - Stefan is working on a fix for > > https://issues.apache.org/jira/browse/FLINK-5663. > > - Till is working on https://issues.apache.org/jira/browse/FLINK-5667. > > > > As far as I can tell, these will be fixed today and we are ready to go > for > > RC3. > > > > I resolved the other issues I created. > > > > – Ufuk > > > > On 26 January 2017 at 22:16:26, Gyula Fóra ([hidden email]) wrote: > > > Hi, > > > > > > Aside from the issues mentioned above I have some good news as well. > > > > > > I have finished porting and started testing one of our major production > > > jobs (RBea) on 1.2 and everything seems to run well so far, with > > > savepoints, rescaling, externalized checkpoints, metrics etc. on YARN. > > > > > > In this job I use, windowing, RocksDB state, iterations, timers, > > broadcast > > > states, repartitionable operator states etc. and everything seems to be > > > working extremely well under normal circumstances. > > > > > > So far I mostly ran sunny day tests but I will continue testing with > > larger > > > load and some failure scenarios. I will keep you posted. > > > > > > Great job! > > > Gyula > > > > > > > > > > > > Robert Metzger ezt írta (időpont: 2017. jan. 26., Cs, > > > 21:28): > > > > > > Damn. I really hoped that this RC goes through. > > > > > > I propose to keep the RC2 open until we've fixed all issues mentioned > > here > > > and to get some more testing feedback. > > > > > > > > > > > > On Thu, Jan 26, 2017 at 8:06 PM, Stephan Ewen wrote: > > > > > > > @Till - I think that FLINK-5667 is a blocker > > > > > > > > Good catch finding it! > > > > > > > > On Thu, Jan 26, 2017 at 7:51 PM, Till Rohrmann > > > > wrote: > > > > > > > > > I have found another problem: Under certain circumstances Flink can > > lose > > > > > state data by completing an invalid checkpoint. > > > > > https://issues.apache.org/jira/browse/FLINK-5667. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Thu, Jan 26, 2017 at 6:27 PM, Till Rohrmann > > > > > wrote: > > > > > > > > > > > Robert also found an issue that pending checkpoint files are not > > > > properly > > > > > > cleaned up: https://issues.apache.org/jira/browse/FLINK-5660. To > > my > > > > > > surprise, the issue was already fixed in 1.1.4 so I guess I've > > > > forgotten > > > > > to > > > > > > forward port the fix. There is a pending PR to fix it. The fix > > could > > > > also > > > > > > be part of a 1.2.1 release. > > > > > > > > > > > > Cheers, > > > > > > Till > > > > > > > > > > > > On Thu, Jan 26, 2017 at 6:04 PM, Ufuk Celebi wrote: > > > > > > > > > > > >> I ran some tests and found the following issues: > > > > > >> > > > > > >> https://issues.apache.org/jira/browse/FLINK-5663: Checkpoint > > fails > > > > > >> because of closed registry > > > > > >> => This happened a couple of times for the first checkpoints > after > > > > > >> submitting a job. If it happened on every submission I would > > > > > >> definitely make this a blocker, but I happen to run into it in > > like 3 > > > > > >> out of 10 job submission. What do we make of this? > > > > > >> > > > > > >> https://issues.apache.org/jira/browse/FLINK-5665: When the > > failures > > > > > >> happened, I also had some lingering 0-byte files. > > > > > >> > > > > > >> https://issues.apache.org/jira/browse/FLINK-5664: I also found > > the > > > > > >> logging of the RocksDB backend a little noisy (for my local > setup > > at > > > > > >> least with many tasks per TM and low checkpointing interval.) > > > > > >> > > > > > >> All in all, I'm not sure if we want to make these a blocker or > > not. > > > > > >> I'm fine both ways with a follow up 1.2.1 release. > > > > > >> > > > > > >> === > > > > > >> > > > > > >> - Verified signatures and checksums > > > > > >> - Checked out the Java quickstarts and ran the jobs > > > > > >> - All poms point to 1.2.0 > > > > > >> - Migrated multiple jobs via savepoint from 1.1.4 to 1.2.0 with > > Kryo > > > > > >> types, session windows (w/o lateness), operator and keyed state > > for > > > > > >> all three backends > > > > > >> - Rescaled the same jobs from 1.2.0 savepoints with all three > > > backends > > > > > >> - Verified the "migration namespace serializer" fix > > > > > >> - Ran streaming state machine with Kafka source, RocksDB backend > > and > > > > > >> master and worker failures (standalone cluster) > > > > > >> > > > > > >> On Wed, Jan 25, 2017 at 9:14 PM, Robert Metzger > > > > > >> wrote: > > > > > >> > Dear Flink community, > > > > > >> > > > > > > >> > Please vote on releasing the following candidate as Apache > Flink > > > > > version > > > > > >> > 1.2.0. > > > > > >> > > > > > > >> > The commit to be voted on: > > > > > >> > 8b5b6a8b (http://git-wip-us.apache.org/ > repos/asf/flink/commit/ > > > > > 8b5b6a8b) > > > > > >> > > > > > > >> > Branch: > > > > > >> > release-1.2.0-rc2 > > > > > >> > (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin > > > > > >> > k.git;a=shortlog;h=refs/heads/release-1.2.0-rc2) > > > > > >> > > > > > > >> > The release artifacts to be voted on can be found at: > > > > > >> > *http://people.apache.org/~rmetzger/flink-1.2.0-rc2/ > > > > > >> > * > > > > > >> > > > > > > >> > The release artifacts are signed with the key with fingerprint > > > > > D9839159: > > > > > >> > http://www.apache.org/dist/flink/KEYS > > > > > >> > > > > > > >> > The staging repository for this release can be found at: > > > > > >> > *https://repository.apache.org/content/repositories/ > > > > > orgapacheflink-1113 > > > > > >> > > > > orgapacheflink-1113 > > > > > >> >* > > > > > >> > > > > > > >> > ------------------------------------------------------------- > > > > > >> > > > > > > >> > I would like to keep Friday as the target release time. Please > > let > > > > me > > > > > >> know > > > > > >> > if you want me to move the deadline to Monday if you need more > > time > > > > of > > > > > >> the > > > > > >> > testing. > > > > > >> > > > > > > >> > The vote ends on Friday, January 27, 2017, 6pm CET. > > > > > >> > > > > > > >> > Please test the release rather now than on Friday morning, to > be > > > > able > > > > > to > > > > > >> > cancel it as early as possible. > > > > > >> > For making the testing easier, I've created this document to > > track > > > > > what > > > > > >> has > > > > > >> > already been tested and what needs to be tested: > > > > > https://docs.google.co > > > > > >> > m/document/d/1MX-8l9RrLly3UmZMODHBnuZUrK_n-DGIBLjFKyCrTAs/ > > > > > >> edit?usp=sharing > > > > > >> > Feel free to add more tests or change existing ones. > > > > > >> > > > > > > >> > [ ] +1 Release this package as Apache Flink 1.2.0 > > > > > >> > [ ] -1 Do not release this package, because ... > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |