(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Releasing Flink 1.1.4

Classic

List

Threaded

18 messages Options

Ufuk Celebi-2

[DISCUSS] Releasing Flink 1.1.4

Hey all,

I would like to start the discussion for kicking off the next bug fix
release, Flink 1.1.4. What do you think about aiming for a RC by end
of this week?

Users reported some instabilities/inconveniences that would be good to fix.

Personally, I would like to backport the following fixes:

(1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client if
savepoint restore fails (Already merged for master, needs minimal
adjustment for 1.1)
(2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net for
stuck task cancellation (Already reviewed for master, waiting for
tests to finish of backport)
(3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
CheckpointCoordinator (Already merged for master, needs minimal
adjustments for 1.1)

Furthermore, I would like to address the following:

(4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
ignore unmatched state when restoring from savepoint
(5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block on
buffer request after broadcast event

Strictly speaking, the (4) is not a bug fix. But given that it would
only add an optional flag to savepoint restoring and should have been
addressed for 1.1.0 already, I would like to get it in.

Stephan Ewen

Re: [DISCUSS] Releasing Flink 1.1.4

Thanks fort starting this Ufuk.

I would like to add the following issues to 1.1.4:

Build errors due to Storm dependencies *(fix pending)*
- [FLINK-4298] [storm compatibility] Add proper repository for Closure
dependencies.

Stability on S3 considering eventual consistency *(fix pending)*
- [FLINK-4218] [checkpoints] Do not fail checkpoints when state size
cannot be determined

Avoiding Zombie TaskManagers *(still needs to be done)*
- [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to restart
in case they notice quarantine

Adding a limit to the amount of data spilled during checkpoint alignments *(fix
is work in progress)*
- [FLINK-4904] [checkpoints] Add a limit for how much data may be
spilled in checkpoint alignments

I can push the first two fixes to the 1.1.4 branch in a bit, the fourth one
later today.
The third one (akka) is still pending.

Best,
Stephan

On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:

> Hey all,
>
> I would like to start the discussion for kicking off the next bug fix
> release, Flink 1.1.4. What do you think about aiming for a RC by end
> of this week?
>
> Users reported some instabilities/inconveniences that would be good to fix.
>
> Personally, I would like to backport the following fixes:
>
> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client if
> savepoint restore fails (Already merged for master, needs minimal
> adjustment for 1.1)
> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net for
> stuck task cancellation (Already reviewed for master, waiting for
> tests to finish of backport)
> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
> CheckpointCoordinator (Already merged for master, needs minimal
> adjustments for 1.1)
>
> Furthermore, I would like to address the following:
>
> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
> ignore unmatched state when restoring from savepoint
> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block on
> buffer request after broadcast event
>
> Strictly speaking, the (4) is not a bug fix. But given that it would
> only add an optional flag to savepoint restoring and should have been
> addressed for 1.1.0 already, I would like to get it in.
>

Robert Metzger

Re: [DISCUSS] Releasing Flink 1.1.4

+1 for a bugfix release soon.

On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]> wrote:

> Thanks fort starting this Ufuk.
>
> I would like to add the following issues to 1.1.4:
>
> Build errors due to Storm dependencies *(fix pending)*
> - [FLINK-4298] [storm compatibility] Add proper repository for Closure
> dependencies.
>
> Stability on S3 considering eventual consistency *(fix pending)*
> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state size
> cannot be determined
>
> Avoiding Zombie TaskManagers *(still needs to be done)*
> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to restart
> in case they notice quarantine
>
> Adding a limit to the amount of data spilled during checkpoint alignments
> *(fix
> is work in progress)*
> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
> spilled in checkpoint alignments
>
>
> I can push the first two fixes to the 1.1.4 branch in a bit, the fourth one
> later today.
> The third one (akka) is still pending.
>
> Best,
> Stephan
>
>
>
> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>
> > Hey all,
> >
> > I would like to start the discussion for kicking off the next bug fix
> > release, Flink 1.1.4. What do you think about aiming for a RC by end
> > of this week?
> >
> > Users reported some instabilities/inconveniences that would be good to
> fix.
> >
> > Personally, I would like to backport the following fixes:
> >
> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client if
> > savepoint restore fails (Already merged for master, needs minimal
> > adjustment for 1.1)
> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net for
> > stuck task cancellation (Already reviewed for master, waiting for
> > tests to finish of backport)
> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
> > CheckpointCoordinator (Already merged for master, needs minimal
> > adjustments for 1.1)
> >
> > Furthermore, I would like to address the following:
> >
> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
> > ignore unmatched state when restoring from savepoint
> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block on
> > buffer request after broadcast event
> >
> > Strictly speaking, the (4) is not a bug fix. But given that it would
> > only add an optional flag to savepoint restoring and should have been
> > addressed for 1.1.0 already, I would like to get it in.
> >
>

Chesnay Schepler-3

Re: [DISCUSS] Releasing Flink 1.1.4

I've added the following fix to the 1.1 branch

* [FLINK-4875] [metrics] Use correct operator name

It is a crucial fix for streaming topologies that involve multi-chains.
2 users already ran into this.

On 25.10.2016 14:43, Robert Metzger wrote:

> +1 for a bugfix release soon.
>
> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]> wrote:
>
>> Thanks fort starting this Ufuk.
>>
>> I would like to add the following issues to 1.1.4:
>>
>> Build errors due to Storm dependencies *(fix pending)*
>> - [FLINK-4298] [storm compatibility] Add proper repository for Closure
>> dependencies.
>>
>> Stability on S3 considering eventual consistency *(fix pending)*
>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state size
>> cannot be determined
>>
>> Avoiding Zombie TaskManagers *(still needs to be done)*
>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to restart
>> in case they notice quarantine
>>
>> Adding a limit to the amount of data spilled during checkpoint alignments
>> *(fix
>> is work in progress)*
>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
>> spilled in checkpoint alignments
>>
>>
>> I can push the first two fixes to the 1.1.4 branch in a bit, the fourth one
>> later today.
>> The third one (akka) is still pending.
>>
>> Best,
>> Stephan
>>
>>
>>
>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>>> Hey all,
>>>
>>> I would like to start the discussion for kicking off the next bug fix
>>> release, Flink 1.1.4. What do you think about aiming for a RC by end
>>> of this week?
>>>
>>> Users reported some instabilities/inconveniences that would be good to
>> fix.
>>> Personally, I would like to backport the following fixes:
>>>
>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client if
>>> savepoint restore fails (Already merged for master, needs minimal
>>> adjustment for 1.1)
>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net for
>>> stuck task cancellation (Already reviewed for master, waiting for
>>> tests to finish of backport)
>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
>>> CheckpointCoordinator (Already merged for master, needs minimal
>>> adjustments for 1.1)
>>>
>>> Furthermore, I would like to address the following:
>>>
>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
>>> ignore unmatched state when restoring from savepoint
>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block on
>>> buffer request after broadcast event
>>>
>>> Strictly speaking, the (4) is not a bug fix. But given that it would
>>> only add an optional flag to savepoint restoring and should have been
>>> addressed for 1.1.0 already, I would like to get it in.
>>>

Jean-Baptiste Onofré

Re: [DISCUSS] Releasing Flink 1.1.4

In reply to this post by Robert Metzger

+1

Looking forward this release !

Regards
JB

⁣

On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]> wrote:

>+1 for a bugfix release soon.
>
>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>wrote:
>
>> Thanks fort starting this Ufuk.
>>
>> I would like to add the following issues to 1.1.4:
>>
>> Build errors due to Storm dependencies *(fix pending)*
>> - [FLINK-4298] [storm compatibility] Add proper repository for
>Closure
>> dependencies.
>>
>> Stability on S3 considering eventual consistency *(fix pending)*
>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>size
>> cannot be determined
>>
>> Avoiding Zombie TaskManagers *(still needs to be done)*
>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>restart
>> in case they notice quarantine
>>
>> Adding a limit to the amount of data spilled during checkpoint
>alignments
>> *(fix
>> is work in progress)*
>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
>> spilled in checkpoint alignments
>>
>>
>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>fourth one
>> later today.
>> The third one (akka) is still pending.
>>
>> Best,
>> Stephan
>>
>>
>>
>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > Hey all,
>> >
>> > I would like to start the discussion for kicking off the next bug
>fix
>> > release, Flink 1.1.4. What do you think about aiming for a RC by
>end
>> > of this week?
>> >
>> > Users reported some instabilities/inconveniences that would be good
>to
>> fix.
>> >
>> > Personally, I would like to backport the following fixes:
>> >
>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client
>if
>> > savepoint restore fails (Already merged for master, needs minimal
>> > adjustment for 1.1)
>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>for
>> > stuck task cancellation (Already reviewed for master, waiting for
>> > tests to finish of backport)
>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
>> > CheckpointCoordinator (Already merged for master, needs minimal
>> > adjustments for 1.1)
>> >
>> > Furthermore, I would like to address the following:
>> >
>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
>> > ignore unmatched state when restoring from savepoint
>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
>on
>> > buffer request after broadcast event
>> >
>> > Strictly speaking, the (4) is not a bug fix. But given that it
>would
>> > only add an optional flag to savepoint restoring and should have
>been
>> > addressed for 1.1.0 already, I would like to get it in.
>> >
>>

Stefan Richter

Re: [DISCUSS] Releasing Flink 1.1.4

+1

I think it could make sense to backport my safety net PR https://github.com/apache/flink/pull/2691for <https://github.com/apache/flink/pull/2691for> 1.1.4. The changes are pretty much isolated and it could help a lot about resource leaks and task cancelation times.

Best,
Stefan

> Am 26.10.2016 um 07:05 schrieb Jean-Baptiste Onofré <[hidden email]>:
>
> +1
>
> Looking forward this release !
>
> Regards
> JB
>
> ⁣
>
> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]> wrote:
>> +1 for a bugfix release soon.
>>
>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>> wrote:
>>
>>> Thanks fort starting this Ufuk.
>>>
>>> I would like to add the following issues to 1.1.4:
>>>
>>> Build errors due to Storm dependencies *(fix pending)*
>>> - [FLINK-4298] [storm compatibility] Add proper repository for
>> Closure
>>> dependencies.
>>>
>>> Stability on S3 considering eventual consistency *(fix pending)*
>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>> size
>>> cannot be determined
>>>
>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>> restart
>>> in case they notice quarantine
>>>
>>> Adding a limit to the amount of data spilled during checkpoint
>> alignments
>>> *(fix
>>> is work in progress)*
>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
>>> spilled in checkpoint alignments
>>>
>>>
>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>> fourth one
>>> later today.
>>> The third one (akka) is still pending.
>>>
>>> Best,
>>> Stephan
>>>
>>>
>>>
>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I would like to start the discussion for kicking off the next bug
>> fix
>>>> release, Flink 1.1.4. What do you think about aiming for a RC by
>> end
>>>> of this week?
>>>>
>>>> Users reported some instabilities/inconveniences that would be good
>> to
>>> fix.
>>>>
>>>> Personally, I would like to backport the following fixes:
>>>>
>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client
>> if
>>>> savepoint restore fails (Already merged for master, needs minimal
>>>> adjustment for 1.1)
>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>> for
>>>> stuck task cancellation (Already reviewed for master, waiting for
>>>> tests to finish of backport)
>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
>>>> CheckpointCoordinator (Already merged for master, needs minimal
>>>> adjustments for 1.1)
>>>>
>>>> Furthermore, I would like to address the following:
>>>>
>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
>>>> ignore unmatched state when restoring from savepoint
>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
>> on
>>>> buffer request after broadcast event
>>>>
>>>> Strictly speaking, the (4) is not a bug fix. But given that it
>> would
>>>> only add an optional flag to savepoint restoring and should have
>> been
>>>> addressed for 1.1.0 already, I would like to get it in.
>>>>
>>>

mxm

Re: [DISCUSS] Releasing Flink 1.1.4

In reply to this post by Jean-Baptiste Onofré

+1 for a 1.1.4 release

We could backport putting user jars into the system class loader for
per-job Yarn clusters: https://github.com/apache/flink/pull/2692
Arguably, this is somewhat a new feature but it gets rid of duplicate
class loading issues users experienced in practice.

We already have the following commits on the release-1.1 branch:

05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
found for jdbc:calcite"
9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
210230c [FLINK-4829] snapshot accumulators on a best-effort basis
c1d6b24 [FLINK-4829] protect user accumulators against concurrent updates
fe464b4 [FLINK-4709] [core] Fix resource leak in InputStreamFSInputWrapper
9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for InputFormats.
9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat about
incorrect default of allowNullValues
c9433bf [FLINK-3706] Fix YARN test instability
2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.

-Max

On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]>
wrote:

> +1
>
> Looking forward this release !
>
> Regards
> JB
>
> ⁣
>
> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]>

wrote:

>>+1 for a bugfix release soon.
>>
>>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>>wrote:
>>
>>> Thanks fort starting this Ufuk.
>>>
>>> I would like to add the following issues to 1.1.4:
>>>
>>> Build errors due to Storm dependencies *(fix pending)*
>>> - [FLINK-4298] [storm compatibility] Add proper repository for
>>Closure
>>> dependencies.
>>>
>>> Stability on S3 considering eventual consistency *(fix pending)*
>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>>size
>>> cannot be determined
>>>
>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>>restart
>>> in case they notice quarantine
>>>
>>> Adding a limit to the amount of data spilled during checkpoint
>>alignments
>>> *(fix
>>> is work in progress)*
>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
>>> spilled in checkpoint alignments
>>>
>>>
>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>>fourth one
>>> later today.
>>> The third one (akka) is still pending.
>>>
>>> Best,
>>> Stephan
>>>
>>>
>>>
>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>>>
>>> > Hey all,
>>> >
>>> > I would like to start the discussion for kicking off the next bug
>>fix
>>> > release, Flink 1.1.4. What do you think about aiming for a RC by
>>end
>>> > of this week?
>>> >
>>> > Users reported some instabilities/inconveniences that would be good
>>to
>>> fix.
>>> >
>>> > Personally, I would like to backport the following fixes:
>>> >
>>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client
>>if
>>> > savepoint restore fails (Already merged for master, needs minimal
>>> > adjustment for 1.1)
>>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>>for
>>> > stuck task cancellation (Already reviewed for master, waiting for
>>> > tests to finish of backport)
>>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
>>> > CheckpointCoordinator (Already merged for master, needs minimal
>>> > adjustments for 1.1)
>>> >
>>> > Furthermore, I would like to address the following:
>>> >
>>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
>>> > ignore unmatched state when restoring from savepoint
>>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
>>on
>>> > buffer request after broadcast event
>>> >
>>> > Strictly speaking, the (4) is not a bug fix. But given that it
>>would
>>> > only add an optional flag to savepoint restoring and should have
>>been
>>> > addressed for 1.1.0 already, I would like to get it in.
>>> >
>>>

Stephan Ewen

Re: [DISCUSS] Releasing Flink 1.1.4

Concerning backporting the "I/O streams safety net" - we need to make sure
that this does not change any behavior that users may implicitly expect.

On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]> wrote:

> +1 for a 1.1.4 release
>
> We could backport putting user jars into the system class loader for
> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> Arguably, this is somewhat a new feature but it gets rid of duplicate
> class loading issues users experienced in practice.
>
> We already have the following commits on the release-1.1 branch:
>
> 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
> found for jdbc:calcite"
> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> c1d6b24 [FLINK-4829] protect user accumulators against concurrent updates
> fe464b4 [FLINK-4709] [core] Fix resource leak in InputStreamFSInputWrapper
> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for InputFormats.
> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat about
> incorrect default of allowNullValues
> c9433bf [FLINK-3706] Fix YARN test instability
> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
>
> -Max
>
>
> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]>
> wrote:
> > +1
> >
> > Looking forward this release !
> >
> > Regards
> > JB
> >
> > ⁣
> >
> > On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]>
> wrote:
> >>+1 for a bugfix release soon.
> >>
> >>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
> >>wrote:
> >>
> >>> Thanks fort starting this Ufuk.
> >>>
> >>> I would like to add the following issues to 1.1.4:
> >>>
> >>> Build errors due to Storm dependencies *(fix pending)*
> >>> - [FLINK-4298] [storm compatibility] Add proper repository for
> >>Closure
> >>> dependencies.
> >>>
> >>> Stability on S3 considering eventual consistency *(fix pending)*
> >>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
> >>size
> >>> cannot be determined
> >>>
> >>> Avoiding Zombie TaskManagers *(still needs to be done)*
> >>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
> >>restart
> >>> in case they notice quarantine
> >>>
> >>> Adding a limit to the amount of data spilled during checkpoint
> >>alignments
> >>> *(fix
> >>> is work in progress)*
> >>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
> >>> spilled in checkpoint alignments
> >>>
> >>>
> >>> I can push the first two fixes to the 1.1.4 branch in a bit, the
> >>fourth one
> >>> later today.
> >>> The third one (akka) is still pending.
> >>>
> >>> Best,
> >>> Stephan
> >>>
> >>>
> >>>
> >>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
> >>>
> >>> > Hey all,
> >>> >
> >>> > I would like to start the discussion for kicking off the next bug
> >>fix
> >>> > release, Flink 1.1.4. What do you think about aiming for a RC by
> >>end
> >>> > of this week?
> >>> >
> >>> > Users reported some instabilities/inconveniences that would be good
> >>to
> >>> fix.
> >>> >
> >>> > Personally, I would like to backport the following fixes:
> >>> >
> >>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client
> >>if
> >>> > savepoint restore fails (Already merged for master, needs minimal
> >>> > adjustment for 1.1)
> >>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
> >>for
> >>> > stuck task cancellation (Already reviewed for master, waiting for
> >>> > tests to finish of backport)
> >>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
> >>> > CheckpointCoordinator (Already merged for master, needs minimal
> >>> > adjustments for 1.1)
> >>> >
> >>> > Furthermore, I would like to address the following:
> >>> >
> >>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
> >>> > ignore unmatched state when restoring from savepoint
> >>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
> >>on
> >>> > buffer request after broadcast event
> >>> >
> >>> > Strictly speaking, the (4) is not a bug fix. But given that it
> >>would
> >>> > only add an optional flag to savepoint restoring and should have
> >>been
> >>> > addressed for 1.1.0 already, I would like to get it in.
> >>> >
> >>>
>

Stefan Richter

Re: [DISCUSS] Releasing Flink 1.1.4

I think changes in behaviour should be limited to the case of streams that obtained from FileSystem within a task's main thread (or any of its child threads) and are also still used after that task finished.

> Am 26.10.2016 um 13:02 schrieb Stephan Ewen <[hidden email]>:
>
> Concerning backporting the "I/O streams safety net" - we need to make sure
> that this does not change any behavior that users may implicitly expect.
>
>
> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]> wrote:
>
>> +1 for a 1.1.4 release
>>
>> We could backport putting user jars into the system class loader for
>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
>> Arguably, this is somewhat a new feature but it gets rid of duplicate
>> class loading issues users experienced in practice.
>>
>> We already have the following commits on the release-1.1 branch:
>>
>> 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
>> found for jdbc:calcite"
>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent updates
>> fe464b4 [FLINK-4709] [core] Fix resource leak in InputStreamFSInputWrapper
>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for InputFormats.
>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat about
>> incorrect default of allowNullValues
>> c9433bf [FLINK-3706] Fix YARN test instability
>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
>>
>> -Max
>>
>>
>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]>
>> wrote:
>>> +1
>>>
>>> Looking forward this release !
>>>
>>> Regards
>>> JB
>>>
>>> ⁣
>>>
>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]>
>> wrote:
>>>> +1 for a bugfix release soon.
>>>>
>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks fort starting this Ufuk.
>>>>>
>>>>> I would like to add the following issues to 1.1.4:
>>>>>
>>>>> Build errors due to Storm dependencies *(fix pending)*
>>>>> - [FLINK-4298] [storm compatibility] Add proper repository for
>>>> Closure
>>>>> dependencies.
>>>>>
>>>>> Stability on S3 considering eventual consistency *(fix pending)*
>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>>>> size
>>>>> cannot be determined
>>>>>
>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>>>> restart
>>>>> in case they notice quarantine
>>>>>
>>>>> Adding a limit to the amount of data spilled during checkpoint
>>>> alignments
>>>>> *(fix
>>>>> is work in progress)*
>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
>>>>> spilled in checkpoint alignments
>>>>>
>>>>>
>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>>>> fourth one
>>>>> later today.
>>>>> The third one (akka) is still pending.
>>>>>
>>>>> Best,
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> I would like to start the discussion for kicking off the next bug
>>>> fix
>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC by
>>>> end
>>>>>> of this week?
>>>>>>
>>>>>> Users reported some instabilities/inconveniences that would be good
>>>> to
>>>>> fix.
>>>>>>
>>>>>> Personally, I would like to backport the following fixes:
>>>>>>
>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer client
>>>> if
>>>>>> savepoint restore fails (Already merged for master, needs minimal
>>>>>> adjustment for 1.1)
>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>>>> for
>>>>>> stuck task cancellation (Already reviewed for master, waiting for
>>>>>> tests to finish of backport)
>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always create
>>>>>> CheckpointCoordinator (Already merged for master, needs minimal
>>>>>> adjustments for 1.1)
>>>>>>
>>>>>> Furthermore, I would like to address the following:
>>>>>>
>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option to
>>>>>> ignore unmatched state when restoring from savepoint
>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
>>>> on
>>>>>> buffer request after broadcast event
>>>>>>
>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
>>>> would
>>>>>> only add an optional flag to savepoint restoring and should have
>>>> been
>>>>>> addressed for 1.1.0 already, I would like to get it in.
>>>>>>
>>>>>
>>

Till Rohrmann

Re: [DISCUSS] Releasing Flink 1.1.4

In reply to this post by Stephan Ewen

I'll work on FLINK-3347. Additionally I would like to get in

- https://issues.apache.org/jira/browse/FLINK-4932: Don't let
ExecutionGraph fail when in state Restarting
- https://issues.apache.org/jira/browse/FLINK-4933:
ExecutionGraph.scheduleOrUpdateConsumers
can fail the ExecutionGraph

Cheers,
Till

On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]> wrote:

> Concerning backporting the "I/O streams safety net" - we need to make sure
> that this does not change any behavior that users may implicitly expect.
>
>
> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]>
> wrote:
>
> > +1 for a 1.1.4 release
> >
> > We could backport putting user jars into the system class loader for
> > per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> > Arguably, this is somewhat a new feature but it gets rid of duplicate
> > class loading issues users experienced in practice.
> >
> > We already have the following commits on the release-1.1 branch:
> >
> > 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
> > 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
> > found for jdbc:calcite"
> > 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> > 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> > c1d6b24 [FLINK-4829] protect user accumulators against concurrent updates
> > fe464b4 [FLINK-4709] [core] Fix resource leak in
> InputStreamFSInputWrapper
> > 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> InputFormats.
> > 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat about
> > incorrect default of allowNullValues
> > c9433bf [FLINK-3706] Fix YARN test instability
> > 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
> >
> > -Max
> >
> >
> > On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]>
> > wrote:
> > > +1
> > >
> > > Looking forward this release !
> > >
> > > Regards
> > > JB
> > >
> > > ⁣
> > >
> > > On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <[hidden email]>
> > wrote:
> > >>+1 for a bugfix release soon.
> > >>
> > >>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
> > >>wrote:
> > >>
> > >>> Thanks fort starting this Ufuk.
> > >>>
> > >>> I would like to add the following issues to 1.1.4:
> > >>>
> > >>> Build errors due to Storm dependencies *(fix pending)*
> > >>> - [FLINK-4298] [storm compatibility] Add proper repository for
> > >>Closure
> > >>> dependencies.
> > >>>
> > >>> Stability on S3 considering eventual consistency *(fix pending)*
> > >>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
> > >>size
> > >>> cannot be determined
> > >>>
> > >>> Avoiding Zombie TaskManagers *(still needs to be done)*
> > >>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
> > >>restart
> > >>> in case they notice quarantine
> > >>>
> > >>> Adding a limit to the amount of data spilled during checkpoint
> > >>alignments
> > >>> *(fix
> > >>> is work in progress)*
> > >>> - [FLINK-4904] [checkpoints] Add a limit for how much data may be
> > >>> spilled in checkpoint alignments
> > >>>
> > >>>
> > >>> I can push the first two fixes to the 1.1.4 branch in a bit, the
> > >>fourth one
> > >>> later today.
> > >>> The third one (akka) is still pending.
> > >>>
> > >>> Best,
> > >>> Stephan
> > >>>
> > >>>
> > >>>
> > >>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]> wrote:
> > >>>
> > >>> > Hey all,
> > >>> >
> > >>> > I would like to start the discussion for kicking off the next bug
> > >>fix
> > >>> > release, Flink 1.1.4. What do you think about aiming for a RC by
> > >>end
> > >>> > of this week?
> > >>> >
> > >>> > Users reported some instabilities/inconveniences that would be good
> > >>to
> > >>> fix.
> > >>> >
> > >>> > Personally, I would like to backport the following fixes:
> > >>> >
> > >>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> client
> > >>if
> > >>> > savepoint restore fails (Already merged for master, needs minimal
> > >>> > adjustment for 1.1)
> > >>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
> > >>for
> > >>> > stuck task cancellation (Already reviewed for master, waiting for
> > >>> > tests to finish of backport)
> > >>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> create
> > >>> > CheckpointCoordinator (Already merged for master, needs minimal
> > >>> > adjustments for 1.1)
> > >>> >
> > >>> > Furthermore, I would like to address the following:
> > >>> >
> > >>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option
> to
> > >>> > ignore unmatched state when restoring from savepoint
> > >>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't block
> > >>on
> > >>> > buffer request after broadcast event
> > >>> >
> > >>> > Strictly speaking, the (4) is not a bug fix. But given that it
> > >>would
> > >>> > only add an optional flag to savepoint restoring and should have
> > >>been
> > >>> > addressed for 1.1.0 already, I would like to get it in.
> > >>> >
> > >>>
> >
>

dan bress

Re: [DISCUSS] Releasing Flink 1.1.4

+1 for this release,
also +1 to Chesnay's suggesting for including this: [FLINK-4875] [metrics]
Use correct operator name

Dan

On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]> wrote:

> I'll work on FLINK-3347. Additionally I would like to get in
>
> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> ExecutionGraph fail when in state Restarting
> - https://issues.apache.org/jira/browse/FLINK-4933:
> ExecutionGraph.scheduleOrUpdateConsumers
> can fail the ExecutionGraph
>
> Cheers,
> Till
>
> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Concerning backporting the "I/O streams safety net" - we need to make
> sure
> > that this does not change any behavior that users may implicitly expect.
> >
> >
> > On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]>
> > wrote:
> >
> > > +1 for a 1.1.4 release
> > >
> > > We could backport putting user jars into the system class loader for
> > > per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> > > Arguably, this is somewhat a new feature but it gets rid of duplicate
> > > class loading issues users experienced in practice.
> > >
> > > We already have the following commits on the release-1.1 branch:
> > >
> > > 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
> > > 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
> > > found for jdbc:calcite"
> > > 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> > > 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> > > c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> updates
> > > fe464b4 [FLINK-4709] [core] Fix resource leak in
> > InputStreamFSInputWrapper
> > > 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> > InputFormats.
> > > 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat
> about
> > > incorrect default of allowNullValues
> > > c9433bf [FLINK-3706] Fix YARN test instability
> > > 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
> > >
> > > -Max
> > >
> > >
> > > On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]
> >
> > > wrote:
> > > > +1
> > > >
> > > > Looking forward this release !
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > ⁣
> > > >
> > > > On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> [hidden email]>
> > > wrote:
> > > >>+1 for a bugfix release soon.
> > > >>
> > > >>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
> > > >>wrote:
> > > >>
> > > >>> Thanks fort starting this Ufuk.
> > > >>>
> > > >>> I would like to add the following issues to 1.1.4:
> > > >>>
> > > >>> Build errors due to Storm dependencies *(fix pending)*
> > > >>> - [FLINK-4298] [storm compatibility] Add proper repository for
> > > >>Closure
> > > >>> dependencies.
> > > >>>
> > > >>> Stability on S3 considering eventual consistency *(fix pending)*
> > > >>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
> > > >>size
> > > >>> cannot be determined
> > > >>>
> > > >>> Avoiding Zombie TaskManagers *(still needs to be done)*
> > > >>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
> > > >>restart
> > > >>> in case they notice quarantine
> > > >>>
> > > >>> Adding a limit to the amount of data spilled during checkpoint
> > > >>alignments
> > > >>> *(fix
> > > >>> is work in progress)*
> > > >>> - [FLINK-4904] [checkpoints] Add a limit for how much data may
> be
> > > >>> spilled in checkpoint alignments
> > > >>>
> > > >>>
> > > >>> I can push the first two fixes to the 1.1.4 branch in a bit, the
> > > >>fourth one
> > > >>> later today.
> > > >>> The third one (akka) is still pending.
> > > >>>
> > > >>> Best,
> > > >>> Stephan
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
> wrote:
> > > >>>
> > > >>> > Hey all,
> > > >>> >
> > > >>> > I would like to start the discussion for kicking off the next bug
> > > >>fix
> > > >>> > release, Flink 1.1.4. What do you think about aiming for a RC by
> > > >>end
> > > >>> > of this week?
> > > >>> >
> > > >>> > Users reported some instabilities/inconveniences that would be
> good
> > > >>to
> > > >>> fix.
> > > >>> >
> > > >>> > Personally, I would like to backport the following fixes:
> > > >>> >
> > > >>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> > client
> > > >>if
> > > >>> > savepoint restore fails (Already merged for master, needs minimal
> > > >>> > adjustment for 1.1)
> > > >>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
> > > >>for
> > > >>> > stuck task cancellation (Already reviewed for master, waiting for
> > > >>> > tests to finish of backport)
> > > >>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> > create
> > > >>> > CheckpointCoordinator (Already merged for master, needs minimal
> > > >>> > adjustments for 1.1)
> > > >>> >
> > > >>> > Furthermore, I would like to address the following:
> > > >>> >
> > > >>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option
> > to
> > > >>> > ignore unmatched state when restoring from savepoint
> > > >>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> block
> > > >>on
> > > >>> > buffer request after broadcast event
> > > >>> >
> > > >>> > Strictly speaking, the (4) is not a bug fix. But given that it
> > > >>would
> > > >>> > only add an optional flag to savepoint restoring and should have
> > > >>been
> > > >>> > addressed for 1.1.0 already, I would like to get it in.
> > > >>> >
> > > >>>
> > >
> >
>

Ufuk Celebi-2

Re: [DISCUSS] Releasing Flink 1.1.4

Thanks for all your feedback.

If there are no objections, I would like to stick to the mentioned
issues in this thread and create RC1 as soon as they are all
addressed. This will probably not be this week though, but it looks
good for next week.

DONE
=====
- FLINK-4619: Answer client if savepoint restore fails
- FLINK-4715: Safety net for stuck task cancellation
- FLINK-4510: Always create CheckpointCoordinator
- FLINK-4894: Don't block on buffer request after broadcast event
- FLINK-4298: Add proper repository for Closure dependencies
- FLINK-4218: Do not fail checkpoints when state size cannot be determined
- FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
they notice quarantine
- FLINK-4875: Use correct operator name
- FLINK-4913: Include user jars in system class loader

PENDING REVIEW
===============
- FLINK-4445: Add option to ignore unmatched state when restoring from
savepoint => https://github.com/apache/flink/pull/2713
- FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
=> https://github.com/apache/flink/pull/2711
- FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
ExecutionGraph => https://github.com/apache/flink/pull/2701

OPEN
=====
- FLINK-4904: Add a limit for how much data may be spilled in
checkpoint alignments => fix pending
- FLINK-4910: Introduce safety net for closing file system streams =>
@Stephan, Stefan: What's the conclusion of your discussion whether to
backport this or not?

On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]> wrote:

> +1 for this release,
> also +1 to Chesnay's suggesting for including this: [FLINK-4875] [metrics]
> Use correct operator name
>
> Dan
>
> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]> wrote:
>
>> I'll work on FLINK-3347. Additionally I would like to get in
>>
>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
>> ExecutionGraph fail when in state Restarting
>> - https://issues.apache.org/jira/browse/FLINK-4933:
>> ExecutionGraph.scheduleOrUpdateConsumers
>> can fail the ExecutionGraph
>>
>> Cheers,
>> Till
>>
>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]> wrote:
>>
>> > Concerning backporting the "I/O streams safety net" - we need to make
>> sure
>> > that this does not change any behavior that users may implicitly expect.
>> >
>> >
>> > On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >
>> > > +1 for a 1.1.4 release
>> > >
>> > > We could backport putting user jars into the system class loader for
>> > > per-job Yarn clusters: https://github.com/apache/flink/pull/2692
>> > > Arguably, this is somewhat a new feature but it gets rid of duplicate
>> > > class loading issues users experienced in practice.
>> > >
>> > > We already have the following commits on the release-1.1 branch:
>> > >
>> > > 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
>> > > 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
>> > > found for jdbc:calcite"
>> > > 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
>> > > 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
>> > > c1d6b24 [FLINK-4829] protect user accumulators against concurrent
>> updates
>> > > fe464b4 [FLINK-4709] [core] Fix resource leak in
>> > InputStreamFSInputWrapper
>> > > 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
>> > InputFormats.
>> > > 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat
>> about
>> > > incorrect default of allowNullValues
>> > > c9433bf [FLINK-3706] Fix YARN test instability
>> > > 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
>> > >
>> > > -Max
>> > >
>> > >
>> > > On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]
>> >
>> > > wrote:
>> > > > +1
>> > > >
>> > > > Looking forward this release !
>> > > >
>> > > > Regards
>> > > > JB
>> > > >
>> > > > ⁣
>> > > >
>> > > > On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
>> [hidden email]>
>> > > wrote:
>> > > >>+1 for a bugfix release soon.
>> > > >>
>> > > >>On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>> > > >>wrote:
>> > > >>
>> > > >>> Thanks fort starting this Ufuk.
>> > > >>>
>> > > >>> I would like to add the following issues to 1.1.4:
>> > > >>>
>> > > >>> Build errors due to Storm dependencies *(fix pending)*
>> > > >>> - [FLINK-4298] [storm compatibility] Add proper repository for
>> > > >>Closure
>> > > >>> dependencies.
>> > > >>>
>> > > >>> Stability on S3 considering eventual consistency *(fix pending)*
>> > > >>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>> > > >>size
>> > > >>> cannot be determined
>> > > >>>
>> > > >>> Avoiding Zombie TaskManagers *(still needs to be done)*
>> > > >>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>> > > >>restart
>> > > >>> in case they notice quarantine
>> > > >>>
>> > > >>> Adding a limit to the amount of data spilled during checkpoint
>> > > >>alignments
>> > > >>> *(fix
>> > > >>> is work in progress)*
>> > > >>> - [FLINK-4904] [checkpoints] Add a limit for how much data may
>> be
>> > > >>> spilled in checkpoint alignments
>> > > >>>
>> > > >>>
>> > > >>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>> > > >>fourth one
>> > > >>> later today.
>> > > >>> The third one (akka) is still pending.
>> > > >>>
>> > > >>> Best,
>> > > >>> Stephan
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
>> wrote:
>> > > >>>
>> > > >>> > Hey all,
>> > > >>> >
>> > > >>> > I would like to start the discussion for kicking off the next bug
>> > > >>fix
>> > > >>> > release, Flink 1.1.4. What do you think about aiming for a RC by
>> > > >>end
>> > > >>> > of this week?
>> > > >>> >
>> > > >>> > Users reported some instabilities/inconveniences that would be
>> good
>> > > >>to
>> > > >>> fix.
>> > > >>> >
>> > > >>> > Personally, I would like to backport the following fixes:
>> > > >>> >
>> > > >>> > (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
>> > client
>> > > >>if
>> > > >>> > savepoint restore fails (Already merged for master, needs minimal
>> > > >>> > adjustment for 1.1)
>> > > >>> > (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>> > > >>for
>> > > >>> > stuck task cancellation (Already reviewed for master, waiting for
>> > > >>> > tests to finish of backport)
>> > > >>> > (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
>> > create
>> > > >>> > CheckpointCoordinator (Already merged for master, needs minimal
>> > > >>> > adjustments for 1.1)
>> > > >>> >
>> > > >>> > Furthermore, I would like to address the following:
>> > > >>> >
>> > > >>> > (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option
>> > to
>> > > >>> > ignore unmatched state when restoring from savepoint
>> > > >>> > (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
>> block
>> > > >>on
>> > > >>> > buffer request after broadcast event
>> > > >>> >
>> > > >>> > Strictly speaking, the (4) is not a bug fix. But given that it
>> > > >>would
>> > > >>> > only add an optional flag to savepoint restoring and should have
>> > > >>been
>> > > >>> > addressed for 1.1.0 already, I would like to get it in.
>> > > >>> >
>> > > >>>
>> > >
>> >
>>

Stefan Richter

Re: [DISCUSS] Releasing Flink 1.1.4

Benefit of a backport, as I see it, is increased stability. The danger is potentially breaking some code that was casting FileSystems to subtypes like LocalFileSytem. I don’t know how common that would be in user code.

> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <[hidden email]>:
>
> Thanks for all your feedback.
>
> If there are no objections, I would like to stick to the mentioned
> issues in this thread and create RC1 as soon as they are all
> addressed. This will probably not be this week though, but it looks
> good for next week.
>
> DONE
> =====
> - FLINK-4619: Answer client if savepoint restore fails
> - FLINK-4715: Safety net for stuck task cancellation
> - FLINK-4510: Always create CheckpointCoordinator
> - FLINK-4894: Don't block on buffer request after broadcast event
> - FLINK-4298: Add proper repository for Closure dependencies
> - FLINK-4218: Do not fail checkpoints when state size cannot be determined
> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
> they notice quarantine
> - FLINK-4875: Use correct operator name
> - FLINK-4913: Include user jars in system class loader
>
> PENDING REVIEW
> ===============
> - FLINK-4445: Add option to ignore unmatched state when restoring from
> savepoint => https://github.com/apache/flink/pull/2713
> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> => https://github.com/apache/flink/pull/2711
> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> ExecutionGraph => https://github.com/apache/flink/pull/2701
>
> OPEN
> =====
> - FLINK-4904: Add a limit for how much data may be spilled in
> checkpoint alignments => fix pending
> - FLINK-4910: Introduce safety net for closing file system streams =>
> @Stephan, Stefan: What's the conclusion of your discussion whether to
> backport this or not?
>
>
> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]> wrote:
>> +1 for this release,
>> also +1 to Chesnay's suggesting for including this: [FLINK-4875] [metrics]
>> Use correct operator name
>>
>> Dan
>>
>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]> wrote:
>>
>>> I'll work on FLINK-3347. Additionally I would like to get in
>>>
>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
>>> ExecutionGraph fail when in state Restarting
>>> - https://issues.apache.org/jira/browse/FLINK-4933:
>>> ExecutionGraph.scheduleOrUpdateConsumers
>>> can fail the ExecutionGraph
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]> wrote:
>>>
>>>> Concerning backporting the "I/O streams safety net" - we need to make
>>> sure
>>>> that this does not change any behavior that users may implicitly expect.
>>>>
>>>>
>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]>
>>>> wrote:
>>>>
>>>>> +1 for a 1.1.4 release
>>>>>
>>>>> We could backport putting user jars into the system class loader for
>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
>>>>> Arguably, this is somewhat a new feature but it gets rid of duplicate
>>>>> class loading issues users experienced in practice.
>>>>>
>>>>> We already have the following commits on the release-1.1 branch:
>>>>>
>>>>> 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
>>>>> found for jdbc:calcite"
>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
>>> updates
>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
>>>> InputStreamFSInputWrapper
>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
>>>> InputFormats.
>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat
>>> about
>>>>> incorrect default of allowNullValues
>>>>> c9433bf [FLINK-3706] Fix YARN test instability
>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
>>>>>
>>>>> -Max
>>>>>
>>>>>
>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]
>>>>
>>>>> wrote:
>>>>>> +1
>>>>>>
>>>>>> Looking forward this release !
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> ⁣
>>>>>>
>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
>>> [hidden email]>
>>>>> wrote:
>>>>>>> +1 for a bugfix release soon.
>>>>>>>
>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks fort starting this Ufuk.
>>>>>>>>
>>>>>>>> I would like to add the following issues to 1.1.4:
>>>>>>>>
>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository for
>>>>>>> Closure
>>>>>>>> dependencies.
>>>>>>>>
>>>>>>>> Stability on S3 considering eventual consistency *(fix pending)*
>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>>>>>>> size
>>>>>>>> cannot be determined
>>>>>>>>
>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>>>>>>> restart
>>>>>>>> in case they notice quarantine
>>>>>>>>
>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
>>>>>>> alignments
>>>>>>>> *(fix
>>>>>>>> is work in progress)*
>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may
>>> be
>>>>>>>> spilled in checkpoint alignments
>>>>>>>>
>>>>>>>>
>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>>>>>>> fourth one
>>>>>>>> later today.
>>>>>>>> The third one (akka) is still pending.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
>>> wrote:
>>>>>>>>
>>>>>>>>> Hey all,
>>>>>>>>>
>>>>>>>>> I would like to start the discussion for kicking off the next bug
>>>>>>> fix
>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC by
>>>>>>> end
>>>>>>>>> of this week?
>>>>>>>>>
>>>>>>>>> Users reported some instabilities/inconveniences that would be
>>> good
>>>>>>> to
>>>>>>>> fix.
>>>>>>>>>
>>>>>>>>> Personally, I would like to backport the following fixes:
>>>>>>>>>
>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
>>>> client
>>>>>>> if
>>>>>>>>> savepoint restore fails (Already merged for master, needs minimal
>>>>>>>>> adjustment for 1.1)
>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>>>>>>> for
>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting for
>>>>>>>>> tests to finish of backport)
>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
>>>> create
>>>>>>>>> CheckpointCoordinator (Already merged for master, needs minimal
>>>>>>>>> adjustments for 1.1)
>>>>>>>>>
>>>>>>>>> Furthermore, I would like to address the following:
>>>>>>>>>
>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option
>>>> to
>>>>>>>>> ignore unmatched state when restoring from savepoint
>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
>>> block
>>>>>>> on
>>>>>>>>> buffer request after broadcast event
>>>>>>>>>
>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
>>>>>>> would
>>>>>>>>> only add an optional flag to savepoint restoring and should have
>>>>>>> been
>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>>

Ufuk Celebi-2

Re: [DISCUSS] Releasing Flink 1.1.4

As a quick update: the "pending review" issues have all been resolved.

The open issues are still open:

- FLINK-4904: Add a limit for how much data may be spilled in
checkpoint alignments => fix pending
- FLINK-4910: Introduce safety net for closing file system streams

Any updates here?

– Ufuk

On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
<[hidden email]> wrote:

> Benefit of a backport, as I see it, is increased stability. The danger is potentially breaking some code that was casting FileSystems to subtypes like LocalFileSytem. I don’t know how common that would be in user code.
>
>> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <[hidden email]>:
>>
>> Thanks for all your feedback.
>>
>> If there are no objections, I would like to stick to the mentioned
>> issues in this thread and create RC1 as soon as they are all
>> addressed. This will probably not be this week though, but it looks
>> good for next week.
>>
>> DONE
>> =====
>> - FLINK-4619: Answer client if savepoint restore fails
>> - FLINK-4715: Safety net for stuck task cancellation
>> - FLINK-4510: Always create CheckpointCoordinator
>> - FLINK-4894: Don't block on buffer request after broadcast event
>> - FLINK-4298: Add proper repository for Closure dependencies
>> - FLINK-4218: Do not fail checkpoints when state size cannot be determined
>> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
>> they notice quarantine
>> - FLINK-4875: Use correct operator name
>> - FLINK-4913: Include user jars in system class loader
>>
>> PENDING REVIEW
>> ===============
>> - FLINK-4445: Add option to ignore unmatched state when restoring from
>> savepoint => https://github.com/apache/flink/pull/2713
>> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
>> => https://github.com/apache/flink/pull/2711
>> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
>> ExecutionGraph => https://github.com/apache/flink/pull/2701
>>
>> OPEN
>> =====
>> - FLINK-4904: Add a limit for how much data may be spilled in
>> checkpoint alignments => fix pending
>> - FLINK-4910: Introduce safety net for closing file system streams =>
>> @Stephan, Stefan: What's the conclusion of your discussion whether to
>> backport this or not?
>>
>>
>> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]> wrote:
>>> +1 for this release,
>>> also +1 to Chesnay's suggesting for including this: [FLINK-4875] [metrics]
>>> Use correct operator name
>>>
>>> Dan
>>>
>>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]> wrote:
>>>
>>>> I'll work on FLINK-3347. Additionally I would like to get in
>>>>
>>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
>>>> ExecutionGraph fail when in state Restarting
>>>> - https://issues.apache.org/jira/browse/FLINK-4933:
>>>> ExecutionGraph.scheduleOrUpdateConsumers
>>>> can fail the ExecutionGraph
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]> wrote:
>>>>
>>>>> Concerning backporting the "I/O streams safety net" - we need to make
>>>> sure
>>>>> that this does not change any behavior that users may implicitly expect.
>>>>>
>>>>>
>>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> +1 for a 1.1.4 release
>>>>>>
>>>>>> We could backport putting user jars into the system class loader for
>>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
>>>>>> Arguably, this is somewhat a new feature but it gets rid of duplicate
>>>>>> class loading issues users experienced in practice.
>>>>>>
>>>>>> We already have the following commits on the release-1.1 branch:
>>>>>>
>>>>>> 05a5f46 [FLINK-4862] fix Timer register in ContinuousEventTimeTrigger
>>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable driver
>>>>>> found for jdbc:calcite"
>>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
>>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
>>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
>>>> updates
>>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
>>>>> InputStreamFSInputWrapper
>>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
>>>>> InputFormats.
>>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat
>>>> about
>>>>>> incorrect default of allowNullValues
>>>>>> c9433bf [FLINK-3706] Fix YARN test instability
>>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI examples.
>>>>>>
>>>>>> -Max
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <[hidden email]
>>>>>
>>>>>> wrote:
>>>>>>> +1
>>>>>>>
>>>>>>> Looking forward this release !
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> ⁣
>>>>>>>
>>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
>>>> [hidden email]>
>>>>>> wrote:
>>>>>>>> +1 for a bugfix release soon.
>>>>>>>>
>>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks fort starting this Ufuk.
>>>>>>>>>
>>>>>>>>> I would like to add the following issues to 1.1.4:
>>>>>>>>>
>>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
>>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository for
>>>>>>>> Closure
>>>>>>>>> dependencies.
>>>>>>>>>
>>>>>>>>> Stability on S3 considering eventual consistency *(fix pending)*
>>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when state
>>>>>>>> size
>>>>>>>>> cannot be determined
>>>>>>>>>
>>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
>>>>>>>> restart
>>>>>>>>> in case they notice quarantine
>>>>>>>>>
>>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
>>>>>>>> alignments
>>>>>>>>> *(fix
>>>>>>>>> is work in progress)*
>>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may
>>>> be
>>>>>>>>> spilled in checkpoint alignments
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
>>>>>>>> fourth one
>>>>>>>>> later today.
>>>>>>>>> The third one (akka) is still pending.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey all,
>>>>>>>>>>
>>>>>>>>>> I would like to start the discussion for kicking off the next bug
>>>>>>>> fix
>>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC by
>>>>>>>> end
>>>>>>>>>> of this week?
>>>>>>>>>>
>>>>>>>>>> Users reported some instabilities/inconveniences that would be
>>>> good
>>>>>>>> to
>>>>>>>>> fix.
>>>>>>>>>>
>>>>>>>>>> Personally, I would like to backport the following fixes:
>>>>>>>>>>
>>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
>>>>> client
>>>>>>>> if
>>>>>>>>>> savepoint restore fails (Already merged for master, needs minimal
>>>>>>>>>> adjustment for 1.1)
>>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety net
>>>>>>>> for
>>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting for
>>>>>>>>>> tests to finish of backport)
>>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
>>>>> create
>>>>>>>>>> CheckpointCoordinator (Already merged for master, needs minimal
>>>>>>>>>> adjustments for 1.1)
>>>>>>>>>>
>>>>>>>>>> Furthermore, I would like to address the following:
>>>>>>>>>>
>>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add option
>>>>> to
>>>>>>>>>> ignore unmatched state when restoring from savepoint
>>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
>>>> block
>>>>>>>> on
>>>>>>>>>> buffer request after broadcast event
>>>>>>>>>>
>>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
>>>>>>>> would
>>>>>>>>>> only add an optional flag to savepoint restoring and should have
>>>>>>>> been
>>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>>
>

till.rohrmann

Re: [DISCUSS] Releasing Flink 1.1.4

It might make sense to backport

- [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
side: https://github.com/apache/flink/pull/2742

as well. This will allow us to activate the quarantine monitoring per
default in 1.1.4 without risking to kill all TMs in case of a JM failure.

Cheers,
Till

On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <[hidden email]> wrote:

> As a quick update: the "pending review" issues have all been resolved.
>
> The open issues are still open:
>
> - FLINK-4904: Add a limit for how much data may be spilled in
> checkpoint alignments => fix pending
> - FLINK-4910: Introduce safety net for closing file system streams
>
> Any updates here?
>
> – Ufuk
>
>
> On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
> <[hidden email]> wrote:
> > Benefit of a backport, as I see it, is increased stability. The danger
> is potentially breaking some code that was casting FileSystems to subtypes
> like LocalFileSytem. I don’t know how common that would be in user code.
> >
> >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <[hidden email]>:
> >>
> >> Thanks for all your feedback.
> >>
> >> If there are no objections, I would like to stick to the mentioned
> >> issues in this thread and create RC1 as soon as they are all
> >> addressed. This will probably not be this week though, but it looks
> >> good for next week.
> >>
> >> DONE
> >> =====
> >> - FLINK-4619: Answer client if savepoint restore fails
> >> - FLINK-4715: Safety net for stuck task cancellation
> >> - FLINK-4510: Always create CheckpointCoordinator
> >> - FLINK-4894: Don't block on buffer request after broadcast event
> >> - FLINK-4298: Add proper repository for Closure dependencies
> >> - FLINK-4218: Do not fail checkpoints when state size cannot be
> determined
> >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
> >> they notice quarantine
> >> - FLINK-4875: Use correct operator name
> >> - FLINK-4913: Include user jars in system class loader
> >>
> >> PENDING REVIEW
> >> ===============
> >> - FLINK-4445: Add option to ignore unmatched state when restoring from
> >> savepoint => https://github.com/apache/flink/pull/2713
> >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> >> => https://github.com/apache/flink/pull/2711
> >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> >> ExecutionGraph => https://github.com/apache/flink/pull/2701
> >>
> >> OPEN
> >> =====
> >> - FLINK-4904: Add a limit for how much data may be spilled in
> >> checkpoint alignments => fix pending
> >> - FLINK-4910: Introduce safety net for closing file system streams =>
> >> @Stephan, Stefan: What's the conclusion of your discussion whether to
> >> backport this or not?
> >>
> >>
> >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]> wrote:
> >>> +1 for this release,
> >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
> [metrics]
> >>> Use correct operator name
> >>>
> >>> Dan
> >>>
> >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]>
> wrote:
> >>>
> >>>> I'll work on FLINK-3347. Additionally I would like to get in
> >>>>
> >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> >>>> ExecutionGraph fail when in state Restarting
> >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
> >>>> ExecutionGraph.scheduleOrUpdateConsumers
> >>>> can fail the ExecutionGraph
> >>>>
> >>>> Cheers,
> >>>> Till
> >>>>
> >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]>
> wrote:
> >>>>
> >>>>> Concerning backporting the "I/O streams safety net" - we need to make
> >>>> sure
> >>>>> that this does not change any behavior that users may implicitly
> expect.
> >>>>>
> >>>>>
> >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <[hidden email]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> +1 for a 1.1.4 release
> >>>>>>
> >>>>>> We could backport putting user jars into the system class loader for
> >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> >>>>>> Arguably, this is somewhat a new feature but it gets rid of
> duplicate
> >>>>>> class loading issues users experienced in practice.
> >>>>>>
> >>>>>> We already have the following commits on the release-1.1 branch:
> >>>>>>
> >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
> ContinuousEventTimeTrigger
> >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
> driver
> >>>>>> found for jdbc:calcite"
> >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> >>>> updates
> >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
> >>>>> InputStreamFSInputWrapper
> >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> >>>>> InputFormats.
> >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of CsvOutputFormat
> >>>> about
> >>>>>> incorrect default of allowNullValues
> >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
> >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
> examples.
> >>>>>>
> >>>>>> -Max
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
> [hidden email]
> >>>>>
> >>>>>> wrote:
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Looking forward this release !
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> JB
> >>>>>>>
> >>>>>>> ⁣
> >>>>>>>
> >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> >>>> [hidden email]>
> >>>>>> wrote:
> >>>>>>>> +1 for a bugfix release soon.
> >>>>>>>>
> >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <[hidden email]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks fort starting this Ufuk.
> >>>>>>>>>
> >>>>>>>>> I would like to add the following issues to 1.1.4:
> >>>>>>>>>
> >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
> >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository for
> >>>>>>>> Closure
> >>>>>>>>> dependencies.
> >>>>>>>>>
> >>>>>>>>> Stability on S3 considering eventual consistency *(fix pending)*
> >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when
> state
> >>>>>>>> size
> >>>>>>>>> cannot be determined
> >>>>>>>>>
> >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
> >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need to
> >>>>>>>> restart
> >>>>>>>>> in case they notice quarantine
> >>>>>>>>>
> >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
> >>>>>>>> alignments
> >>>>>>>>> *(fix
> >>>>>>>>> is work in progress)*
> >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data may
> >>>> be
> >>>>>>>>> spilled in checkpoint alignments
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, the
> >>>>>>>> fourth one
> >>>>>>>>> later today.
> >>>>>>>>> The third one (akka) is still pending.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Stephan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey all,
> >>>>>>>>>>
> >>>>>>>>>> I would like to start the discussion for kicking off the next
> bug
> >>>>>>>> fix
> >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC by
> >>>>>>>> end
> >>>>>>>>>> of this week?
> >>>>>>>>>>
> >>>>>>>>>> Users reported some instabilities/inconveniences that would be
> >>>> good
> >>>>>>>> to
> >>>>>>>>> fix.
> >>>>>>>>>>
> >>>>>>>>>> Personally, I would like to backport the following fixes:
> >>>>>>>>>>
> >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> >>>>> client
> >>>>>>>> if
> >>>>>>>>>> savepoint restore fails (Already merged for master, needs
> minimal
> >>>>>>>>>> adjustment for 1.1)
> >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety
> net
> >>>>>>>> for
> >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
> for
> >>>>>>>>>> tests to finish of backport)
> >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> >>>>> create
> >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs minimal
> >>>>>>>>>> adjustments for 1.1)
> >>>>>>>>>>
> >>>>>>>>>> Furthermore, I would like to address the following:
> >>>>>>>>>>
> >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
> option
> >>>>> to
> >>>>>>>>>> ignore unmatched state when restoring from savepoint
> >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> >>>> block
> >>>>>>>> on
> >>>>>>>>>> buffer request after broadcast event
> >>>>>>>>>>
> >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
> >>>>>>>> would
> >>>>>>>>>> only add an optional flag to savepoint restoring and should have
> >>>>>>>> been
> >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >
>

Stephan Ewen

Re: [DISCUSS] Releasing Flink 1.1.4

The issue FLINK-4904 (Add a limit for how much data may be spilled in
checkpoint alignments) is doen for master and I am currently backporting
it. Hope to finish that this week...

Stephan

On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann <[hidden email]>
wrote:

> It might make sense to backport
>
> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
> side: https://github.com/apache/flink/pull/2742
>
> as well. This will allow us to activate the quarantine monitoring per
> default in 1.1.4 without risking to kill all TMs in case of a JM failure.
>
> Cheers,
> Till
>
> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <[hidden email]> wrote:
>
> > As a quick update: the "pending review" issues have all been resolved.
> >
> > The open issues are still open:
> >
> > - FLINK-4904: Add a limit for how much data may be spilled in
> > checkpoint alignments => fix pending
> > - FLINK-4910: Introduce safety net for closing file system streams
> >
> > Any updates here?
> >
> > – Ufuk
> >
> >
> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
> > <[hidden email]> wrote:
> > > Benefit of a backport, as I see it, is increased stability. The danger
> > is potentially breaking some code that was casting FileSystems to
> subtypes
> > like LocalFileSytem. I don’t know how common that would be in user code.
> > >
> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <[hidden email]>:
> > >>
> > >> Thanks for all your feedback.
> > >>
> > >> If there are no objections, I would like to stick to the mentioned
> > >> issues in this thread and create RC1 as soon as they are all
> > >> addressed. This will probably not be this week though, but it looks
> > >> good for next week.
> > >>
> > >> DONE
> > >> =====
> > >> - FLINK-4619: Answer client if savepoint restore fails
> > >> - FLINK-4715: Safety net for stuck task cancellation
> > >> - FLINK-4510: Always create CheckpointCoordinator
> > >> - FLINK-4894: Don't block on buffer request after broadcast event
> > >> - FLINK-4298: Add proper repository for Closure dependencies
> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be
> > determined
> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case
> > >> they notice quarantine
> > >> - FLINK-4875: Use correct operator name
> > >> - FLINK-4913: Include user jars in system class loader
> > >>
> > >> PENDING REVIEW
> > >> ===============
> > >> - FLINK-4445: Add option to ignore unmatched state when restoring from
> > >> savepoint => https://github.com/apache/flink/pull/2713
> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> > >> => https://github.com/apache/flink/pull/2711
> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701
> > >>
> > >> OPEN
> > >> =====
> > >> - FLINK-4904: Add a limit for how much data may be spilled in
> > >> checkpoint alignments => fix pending
> > >> - FLINK-4910: Introduce safety net for closing file system streams =>
> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to
> > >> backport this or not?
> > >>
> > >>
> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]>
> wrote:
> > >>> +1 for this release,
> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
> > [metrics]
> > >>> Use correct operator name
> > >>>
> > >>> Dan
> > >>>
> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]>
> > wrote:
> > >>>
> > >>>> I'll work on FLINK-3347. Additionally I would like to get in
> > >>>>
> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> > >>>> ExecutionGraph fail when in state Restarting
> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
> > >>>> ExecutionGraph.scheduleOrUpdateConsumers
> > >>>> can fail the ExecutionGraph
> > >>>>
> > >>>> Cheers,
> > >>>> Till
> > >>>>
> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]>
> > wrote:
> > >>>>
> > >>>>> Concerning backporting the "I/O streams safety net" - we need to
> make
> > >>>> sure
> > >>>>> that this does not change any behavior that users may implicitly
> > expect.
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <
> [hidden email]
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1 for a 1.1.4 release
> > >>>>>>
> > >>>>>> We could backport putting user jars into the system class loader
> for
> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of
> > duplicate
> > >>>>>> class loading issues users experienced in practice.
> > >>>>>>
> > >>>>>> We already have the following commits on the release-1.1 branch:
> > >>>>>>
> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
> > ContinuousEventTimeTrigger
> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
> > driver
> > >>>>>> found for jdbc:calcite"
> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> > >>>> updates
> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
> > >>>>> InputStreamFSInputWrapper
> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> > >>>>> InputFormats.
> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of
> CsvOutputFormat
> > >>>> about
> > >>>>>> incorrect default of allowNullValues
> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
> > examples.
> > >>>>>>
> > >>>>>> -Max
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
> > [hidden email]
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>> +1
> > >>>>>>>
> > >>>>>>> Looking forward this release !
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> ⁣
> > >>>>>>>
> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> > >>>> [hidden email]>
> > >>>>>> wrote:
> > >>>>>>>> +1 for a bugfix release soon.
> > >>>>>>>>
> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <
> [hidden email]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks fort starting this Ufuk.
> > >>>>>>>>>
> > >>>>>>>>> I would like to add the following issues to 1.1.4:
> > >>>>>>>>>
> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
> > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository
> for
> > >>>>>>>> Closure
> > >>>>>>>>> dependencies.
> > >>>>>>>>>
> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix
> pending)*
> > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when
> > state
> > >>>>>>>> size
> > >>>>>>>>> cannot be determined
> > >>>>>>>>>
> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
> > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need
> to
> > >>>>>>>> restart
> > >>>>>>>>> in case they notice quarantine
> > >>>>>>>>>
> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
> > >>>>>>>> alignments
> > >>>>>>>>> *(fix
> > >>>>>>>>> is work in progress)*
> > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data
> may
> > >>>> be
> > >>>>>>>>> spilled in checkpoint alignments
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit,
> the
> > >>>>>>>> fourth one
> > >>>>>>>>> later today.
> > >>>>>>>>> The third one (akka) is still pending.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Stephan
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
> > >>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hey all,
> > >>>>>>>>>>
> > >>>>>>>>>> I would like to start the discussion for kicking off the next
> > bug
> > >>>>>>>> fix
> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC
> by
> > >>>>>>>> end
> > >>>>>>>>>> of this week?
> > >>>>>>>>>>
> > >>>>>>>>>> Users reported some instabilities/inconveniences that would be
> > >>>> good
> > >>>>>>>> to
> > >>>>>>>>> fix.
> > >>>>>>>>>>
> > >>>>>>>>>> Personally, I would like to backport the following fixes:
> > >>>>>>>>>>
> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> > >>>>> client
> > >>>>>>>> if
> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs
> > minimal
> > >>>>>>>>>> adjustment for 1.1)
> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety
> > net
> > >>>>>>>> for
> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
> > for
> > >>>>>>>>>> tests to finish of backport)
> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> > >>>>> create
> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs
> minimal
> > >>>>>>>>>> adjustments for 1.1)
> > >>>>>>>>>>
> > >>>>>>>>>> Furthermore, I would like to address the following:
> > >>>>>>>>>>
> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
> > option
> > >>>>> to
> > >>>>>>>>>> ignore unmatched state when restoring from savepoint
> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> > >>>> block
> > >>>>>>>> on
> > >>>>>>>>>> buffer request after broadcast event
> > >>>>>>>>>>
> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it
> > >>>>>>>> would
> > >>>>>>>>>> only add an optional flag to savepoint restoring and should
> have
> > >>>>>>>> been
> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >
> >
>

Stephan Ewen

Re: [DISCUSS] Releasing Flink 1.1.4

I opened a pull request for the backport of [FLINK-4904]
<https://issues.apache.org/jira/browse/FLINK-4904>

https://github.com/apache/flink/pull/2773

On Tue, Nov 8, 2016 at 2:00 PM, Stephan Ewen <[hidden email]> wrote:

> The issue FLINK-4904 (Add a limit for how much data may be spilled in
> checkpoint alignments) is doen for master and I am currently backporting
> it. Hope to finish that this week...
>
> Stephan
>
>
> On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann <[hidden email]>
> wrote:
>
>> It might make sense to backport
>>
>> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
>> side: https://github.com/apache/flink/pull/2742
>>
>> as well. This will allow us to activate the quarantine monitoring per
>> default in 1.1.4 without risking to kill all TMs in case of a JM failure.
>>
>> Cheers,
>> Till
>>
>> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > As a quick update: the "pending review" issues have all been resolved.
>> >
>> > The open issues are still open:
>> >
>> > - FLINK-4904: Add a limit for how much data may be spilled in
>> > checkpoint alignments => fix pending
>> > - FLINK-4910: Introduce safety net for closing file system streams
>> >
>> > Any updates here?
>> >
>> > – Ufuk
>> >
>> >
>> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
>> > <[hidden email]> wrote:
>> > > Benefit of a backport, as I see it, is increased stability. The danger
>> > is potentially breaking some code that was casting FileSystems to
>> subtypes
>> > like LocalFileSytem. I don’t know how common that would be in user code.
>> > >
>> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <[hidden email]>:
>> > >>
>> > >> Thanks for all your feedback.
>> > >>
>> > >> If there are no objections, I would like to stick to the mentioned
>> > >> issues in this thread and create RC1 as soon as they are all
>> > >> addressed. This will probably not be this week though, but it looks
>> > >> good for next week.
>> > >>
>> > >> DONE
>> > >> =====
>> > >> - FLINK-4619: Answer client if savepoint restore fails
>> > >> - FLINK-4715: Safety net for stuck task cancellation
>> > >> - FLINK-4510: Always create CheckpointCoordinator
>> > >> - FLINK-4894: Don't block on buffer request after broadcast event
>> > >> - FLINK-4298: Add proper repository for Closure dependencies
>> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be
>> > determined
>> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in
>> case
>> > >> they notice quarantine
>> > >> - FLINK-4875: Use correct operator name
>> > >> - FLINK-4913: Include user jars in system class loader
>> > >>
>> > >> PENDING REVIEW
>> > >> ===============
>> > >> - FLINK-4445: Add option to ignore unmatched state when restoring
>> from
>> > >> savepoint => https://github.com/apache/flink/pull/2713
>> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
>> > >> => https://github.com/apache/flink/pull/2711
>> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
>> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701
>> > >>
>> > >> OPEN
>> > >> =====
>> > >> - FLINK-4904: Add a limit for how much data may be spilled in
>> > >> checkpoint alignments => fix pending
>> > >> - FLINK-4910: Introduce safety net for closing file system streams =>
>> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to
>> > >> backport this or not?
>> > >>
>> > >>
>> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <[hidden email]>
>> wrote:
>> > >>> +1 for this release,
>> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
>> > [metrics]
>> > >>> Use correct operator name
>> > >>>
>> > >>> Dan
>> > >>>
>> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <[hidden email]
>> >
>> > wrote:
>> > >>>
>> > >>>> I'll work on FLINK-3347. Additionally I would like to get in
>> > >>>>
>> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
>> > >>>> ExecutionGraph fail when in state Restarting
>> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
>> > >>>> ExecutionGraph.scheduleOrUpdateConsumers
>> > >>>> can fail the ExecutionGraph
>> > >>>>
>> > >>>> Cheers,
>> > >>>> Till
>> > >>>>
>> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <[hidden email]>
>> > wrote:
>> > >>>>
>> > >>>>> Concerning backporting the "I/O streams safety net" - we need to
>> make
>> > >>>> sure
>> > >>>>> that this does not change any behavior that users may implicitly
>> > expect.
>> > >>>>>
>> > >>>>>
>> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <
>> [hidden email]
>> > >
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> +1 for a 1.1.4 release
>> > >>>>>>
>> > >>>>>> We could backport putting user jars into the system class loader
>> for
>> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
>> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of
>> > duplicate
>> > >>>>>> class loading issues users experienced in practice.
>> > >>>>>>
>> > >>>>>> We already have the following commits on the release-1.1 branch:
>> > >>>>>>
>> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
>> > ContinuousEventTimeTrigger
>> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
>> > driver
>> > >>>>>> found for jdbc:calcite"
>> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
>> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
>> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
>> > >>>> updates
>> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
>> > >>>>> InputStreamFSInputWrapper
>> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
>> > >>>>> InputFormats.
>> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of
>> CsvOutputFormat
>> > >>>> about
>> > >>>>>> incorrect default of allowNullValues
>> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
>> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
>> > examples.
>> > >>>>>>
>> > >>>>>> -Max
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
>> > [hidden email]
>> > >>>>>
>> > >>>>>> wrote:
>> > >>>>>>> +1
>> > >>>>>>>
>> > >>>>>>> Looking forward this release !
>> > >>>>>>>
>> > >>>>>>> Regards
>> > >>>>>>> JB
>> > >>>>>>>
>> > >>>>>>> ⁣
>> > >>>>>>>
>> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
>> > >>>> [hidden email]>
>> > >>>>>> wrote:
>> > >>>>>>>> +1 for a bugfix release soon.
>> > >>>>>>>>
>> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <
>> [hidden email]>
>> > >>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> Thanks fort starting this Ufuk.
>> > >>>>>>>>>
>> > >>>>>>>>> I would like to add the following issues to 1.1.4:
>> > >>>>>>>>>
>> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
>> > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository
>> for
>> > >>>>>>>> Closure
>> > >>>>>>>>> dependencies.
>> > >>>>>>>>>
>> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix
>> pending)*
>> > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when
>> > state
>> > >>>>>>>> size
>> > >>>>>>>>> cannot be determined
>> > >>>>>>>>>
>> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
>> > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem)
>> need to
>> > >>>>>>>> restart
>> > >>>>>>>>> in case they notice quarantine
>> > >>>>>>>>>
>> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
>> > >>>>>>>> alignments
>> > >>>>>>>>> *(fix
>> > >>>>>>>>> is work in progress)*
>> > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data
>> may
>> > >>>> be
>> > >>>>>>>>> spilled in checkpoint alignments
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit,
>> the
>> > >>>>>>>> fourth one
>> > >>>>>>>>> later today.
>> > >>>>>>>>> The third one (akka) is still pending.
>> > >>>>>>>>>
>> > >>>>>>>>> Best,
>> > >>>>>>>>> Stephan
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <[hidden email]>
>> > >>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> Hey all,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I would like to start the discussion for kicking off the next
>> > bug
>> > >>>>>>>> fix
>> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a
>> RC by
>> > >>>>>>>> end
>> > >>>>>>>>>> of this week?
>> > >>>>>>>>>>
>> > >>>>>>>>>> Users reported some instabilities/inconveniences that would
>> be
>> > >>>> good
>> > >>>>>>>> to
>> > >>>>>>>>> fix.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Personally, I would like to backport the following fixes:
>> > >>>>>>>>>>
>> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
>> > >>>>> client
>> > >>>>>>>> if
>> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs
>> > minimal
>> > >>>>>>>>>> adjustment for 1.1)
>> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety
>> > net
>> > >>>>>>>> for
>> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
>> > for
>> > >>>>>>>>>> tests to finish of backport)
>> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
>> > >>>>> create
>> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs
>> minimal
>> > >>>>>>>>>> adjustments for 1.1)
>> > >>>>>>>>>>
>> > >>>>>>>>>> Furthermore, I would like to address the following:
>> > >>>>>>>>>>
>> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
>> > option
>> > >>>>> to
>> > >>>>>>>>>> ignore unmatched state when restoring from savepoint
>> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
>> > >>>> block
>> > >>>>>>>> on
>> > >>>>>>>>>> buffer request after broadcast event
>> > >>>>>>>>>>
>> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that
>> it
>> > >>>>>>>> would
>> > >>>>>>>>>> only add an optional flag to savepoint restoring and should
>> have
>> > >>>>>>>> been
>> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
>> > >>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>
>> > >
>> >
>>
>
>

Ufuk Celebi-2

Re: [DISCUSS] Releasing Flink 1.1.4

The last fixes are finally in. Thanks to everyone who participated in the discussion.

I will now create the release artifacts and start the vote tomorrow (CET).

– Ufuk

On 8 November 2016 at 19:02:46, Stephan Ewen ([hidden email]) wrote:

> I opened a pull request for the backport of [FLINK-4904]
>
>
> https://github.com/apache/flink/pull/2773
>
>
> On Tue, Nov 8, 2016 at 2:00 PM, Stephan Ewen wrote:
>
> > The issue FLINK-4904 (Add a limit for how much data may be spilled in
> > checkpoint alignments) is doen for master and I am currently backporting
> > it. Hope to finish that this week...
> >
> > Stephan
> >
> >
> > On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann
> > wrote:
> >
> >> It might make sense to backport
> >>
> >> - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM
> >> side: https://github.com/apache/flink/pull/2742
> >>
> >> as well. This will allow us to activate the quarantine monitoring per
> >> default in 1.1.4 without risking to kill all TMs in case of a JM failure.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi wrote:
> >>
> >> > As a quick update: the "pending review" issues have all been resolved.
> >> >
> >> > The open issues are still open:
> >> >
> >> > - FLINK-4904: Add a limit for how much data may be spilled in
> >> > checkpoint alignments => fix pending
> >> > - FLINK-4910: Introduce safety net for closing file system streams
> >> >
> >> > Any updates here?
> >> >
> >> > – Ufuk
> >> >
> >> >
> >> > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter
> >> > wrote:
> >> > > Benefit of a backport, as I see it, is increased stability. The danger
> >> > is potentially breaking some code that was casting FileSystems to
> >> subtypes
> >> > like LocalFileSytem. I don’t know how common that would be in user code.
> >> > >
> >> > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi :
> >> > >>
> >> > >> Thanks for all your feedback.
> >> > >>
> >> > >> If there are no objections, I would like to stick to the mentioned
> >> > >> issues in this thread and create RC1 as soon as they are all
> >> > >> addressed. This will probably not be this week though, but it looks
> >> > >> good for next week.
> >> > >>
> >> > >> DONE
> >> > >> =====
> >> > >> - FLINK-4619: Answer client if savepoint restore fails
> >> > >> - FLINK-4715: Safety net for stuck task cancellation
> >> > >> - FLINK-4510: Always create CheckpointCoordinator
> >> > >> - FLINK-4894: Don't block on buffer request after broadcast event
> >> > >> - FLINK-4298: Add proper repository for Closure dependencies
> >> > >> - FLINK-4218: Do not fail checkpoints when state size cannot be
> >> > determined
> >> > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in
> >> case
> >> > >> they notice quarantine
> >> > >> - FLINK-4875: Use correct operator name
> >> > >> - FLINK-4913: Include user jars in system class loader
> >> > >>
> >> > >> PENDING REVIEW
> >> > >> ===============
> >> > >> - FLINK-4445: Add option to ignore unmatched state when restoring
> >> from
> >> > >> savepoint => https://github.com/apache/flink/pull/2713
> >> > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting
> >> > >> => https://github.com/apache/flink/pull/2711
> >> > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the
> >> > >> ExecutionGraph => https://github.com/apache/flink/pull/2701
> >> > >>
> >> > >> OPEN
> >> > >> =====
> >> > >> - FLINK-4904: Add a limit for how much data may be spilled in
> >> > >> checkpoint alignments => fix pending
> >> > >> - FLINK-4910: Introduce safety net for closing file system streams =>
> >> > >> @Stephan, Stefan: What's the conclusion of your discussion whether to
> >> > >> backport this or not?
> >> > >>
> >> > >>
> >> > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress
> >> wrote:
> >> > >>> +1 for this release,
> >> > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875]
> >> > [metrics]
> >> > >>> Use correct operator name
> >> > >>>
> >> > >>> Dan
> >> > >>>
> >> > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann > >> >
> >> > wrote:
> >> > >>>
> >> > >>>> I'll work on FLINK-3347. Additionally I would like to get in
> >> > >>>>
> >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let
> >> > >>>> ExecutionGraph fail when in state Restarting
> >> > >>>> - https://issues.apache.org/jira/browse/FLINK-4933:
> >> > >>>> ExecutionGraph.scheduleOrUpdateConsumers
> >> > >>>> can fail the ExecutionGraph
> >> > >>>>
> >> > >>>> Cheers,
> >> > >>>> Till
> >> > >>>>
> >> > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen
> >> > wrote:
> >> > >>>>
> >> > >>>>> Concerning backporting the "I/O streams safety net" - we need to
> >> make
> >> > >>>> sure
> >> > >>>>> that this does not change any behavior that users may implicitly
> >> > expect.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels <
> >> [hidden email]
> >> > >
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> +1 for a 1.1.4 release
> >> > >>>>>>
> >> > >>>>>> We could backport putting user jars into the system class loader
> >> for
> >> > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692
> >> > >>>>>> Arguably, this is somewhat a new feature but it gets rid of
> >> > duplicate
> >> > >>>>>> class loading issues users experienced in practice.
> >> > >>>>>>
> >> > >>>>>> We already have the following commits on the release-1.1 branch:
> >> > >>>>>>
> >> > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in
> >> > ContinuousEventTimeTrigger
> >> > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable
> >> > driver
> >> > >>>>>> found for jdbc:calcite"
> >> > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator
> >> > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis
> >> > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent
> >> > >>>> updates
> >> > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in
> >> > >>>>> InputStreamFSInputWrapper
> >> > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for
> >> > >>>>> InputFormats.
> >> > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of
> >> CsvOutputFormat
> >> > >>>> about
> >> > >>>>>> incorrect default of allowNullValues
> >> > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability
> >> > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI
> >> > examples.
> >> > >>>>>>
> >> > >>>>>> -Max
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré <
> >> > [hidden email]
> >> > >>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>> +1
> >> > >>>>>>>
> >> > >>>>>>> Looking forward this release !
> >> > >>>>>>>
> >> > >>>>>>> Regards
> >> > >>>>>>> JB
> >> > >>>>>>>
> >> > >>>>>>> ⁣
> >> > >>>>>>>
> >> > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger <
> >> > >>>> [hidden email]>
> >> > >>>>>> wrote:
> >> > >>>>>>>> +1 for a bugfix release soon.
> >> > >>>>>>>>
> >> > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen <
> >> [hidden email]>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Thanks fort starting this Ufuk.
> >> > >>>>>>>>>
> >> > >>>>>>>>> I would like to add the following issues to 1.1.4:
> >> > >>>>>>>>>
> >> > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)*
> >> > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository
> >> for
> >> > >>>>>>>> Closure
> >> > >>>>>>>>> dependencies.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Stability on S3 considering eventual consistency *(fix
> >> pending)*
> >> > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when
> >> > state
> >> > >>>>>>>> size
> >> > >>>>>>>>> cannot be determined
> >> > >>>>>>>>>
> >> > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)*
> >> > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem)
> >> need to
> >> > >>>>>>>> restart
> >> > >>>>>>>>> in case they notice quarantine
> >> > >>>>>>>>>
> >> > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint
> >> > >>>>>>>> alignments
> >> > >>>>>>>>> *(fix
> >> > >>>>>>>>> is work in progress)*
> >> > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data
> >> may
> >> > >>>> be
> >> > >>>>>>>>> spilled in checkpoint alignments
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit,
> >> the
> >> > >>>>>>>> fourth one
> >> > >>>>>>>>> later today.
> >> > >>>>>>>>> The third one (akka) is still pending.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Best,
> >> > >>>>>>>>> Stephan
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi
> >> > >>>> wrote:
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hey all,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I would like to start the discussion for kicking off the next
> >> > bug
> >> > >>>>>>>> fix
> >> > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a
> >> RC by
> >> > >>>>>>>> end
> >> > >>>>>>>>>> of this week?
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Users reported some instabilities/inconveniences that would
> >> be
> >> > >>>> good
> >> > >>>>>>>> to
> >> > >>>>>>>>> fix.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Personally, I would like to backport the following fixes:
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer
> >> > >>>>> client
> >> > >>>>>>>> if
> >> > >>>>>>>>>> savepoint restore fails (Already merged for master, needs
> >> > minimal
> >> > >>>>>>>>>> adjustment for 1.1)
> >> > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety
> >> > net
> >> > >>>>>>>> for
> >> > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting
> >> > for
> >> > >>>>>>>>>> tests to finish of backport)
> >> > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always
> >> > >>>>> create
> >> > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs
> >> minimal
> >> > >>>>>>>>>> adjustments for 1.1)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Furthermore, I would like to address the following:
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add
> >> > option
> >> > >>>>> to
> >> > >>>>>>>>>> ignore unmatched state when restoring from savepoint
> >> > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't
> >> > >>>> block
> >> > >>>>>>>> on
> >> > >>>>>>>>>> buffer request after broadcast event
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that
> >> it
> >> > >>>>>>>> would
> >> > >>>>>>>>>> only add an optional flag to savepoint restoring and should
> >> have
> >> > >>>>>>>> been
> >> > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >
> >> >
> >>
> >
> >
>