(DEPRECATED) Apache Flink Mailing List archive.

Master test stability poor

Classic

List

Threaded

13 messages Options

Till Rohrmann

Master test stability poor

Hi Flink community,

I just wanted to raise awareness that in the last 16 days there was just a
single Travis build of master which passed all tests. This indicates that
we have some serious problems with our test stability or even worse a
problem with the master itself. Having an unstable master makes it really
hard to assess whether new changes actually broke something or whether the
failing test was unrelated.

We have currently 37 open issues labeled with test-stability and most of
them have a critical priority. Therefore, I would propose that we try to
tackle them as soon as possible in order to improve our testing stability.

Cheers,
Till

Ufuk Celebi-2

Re: Master test stability poor

Hi Till,

thank you for bringing this up. We really need to fix this.

Filing JIRAs with critical priority was how we tried to solve it in
the past, but obviously it did not work. There seems to be a mismatch
between assigned and actual priorities.

As a first step, I would volunteer to gather a list of tests, which
have failed in the last weeks and make sure that we have JIRAs for
them.

As a next step, we should coordinate how to resolve those issues
(maybe prioritized by failure frequency) to get master stable again.

– Ufuk

On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote:

> Hi Flink community,
>
> I just wanted to raise awareness that in the last 16 days there was just a
> single Travis build of master which passed all tests. This indicates that
> we have some serious problems with our test stability or even worse a
> problem with the master itself. Having an unstable master makes it really
> hard to assess whether new changes actually broke something or whether the
> failing test was unrelated.
>
> We have currently 37 open issues labeled with test-stability and most of
> them have a critical priority. Therefore, I would propose that we try to
> tackle them as soon as possible in order to improve our testing stability.
>
> Cheers,
> Till

Greg Hogan

Re: Master test stability poor

We have also started running over Travis' 2 hour limit for the longest build.

Greg

> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>
> Hi Till,
>
> thank you for bringing this up. We really need to fix this.
>
> Filing JIRAs with critical priority was how we tried to solve it in
> the past, but obviously it did not work. There seems to be a mismatch
> between assigned and actual priorities.
>
> As a first step, I would volunteer to gather a list of tests, which
> have failed in the last weeks and make sure that we have JIRAs for
> them.
>
> As a next step, we should coordinate how to resolve those issues
> (maybe prioritized by failure frequency) to get master stable again.
>
> – Ufuk
>
>
>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote:
>> Hi Flink community,
>>
>> I just wanted to raise awareness that in the last 16 days there was just a
>> single Travis build of master which passed all tests. This indicates that
>> we have some serious problems with our test stability or even worse a
>> problem with the master itself. Having an unstable master makes it really
>> hard to assess whether new changes actually broke something or whether the
>> failing test was unrelated.
>>
>> We have currently 37 open issues labeled with test-stability and most of
>> them have a critical priority. Therefore, I would propose that we try to
>> tackle them as soon as possible in order to improve our testing stability.
>>
>> Cheers,
>> Till

Flavio Pompermaier

Re: Master test stability poor

We just issued a PR about this (FLINK-1827 - https://github.com/apache/
flink/pull/1915) that improves test stability (and allow to skip entirely
their compilation when it's not required) except for the ml library that
has still some one error to solve ( in the hadoop-1 build and in the
ml-library) but I think that would not be so diffucult to fix..it should be
caused by some missing compile dependency that was introduced by hadoop2

On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:

> We have also started running over Travis' 2 hour limit for the longest
> build.
>
> Greg
>
>
> > On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
> >
> > Hi Till,
> >
> > thank you for bringing this up. We really need to fix this.
> >
> > Filing JIRAs with critical priority was how we tried to solve it in
> > the past, but obviously it did not work. There seems to be a mismatch
> > between assigned and actual priorities.
> >
> > As a first step, I would volunteer to gather a list of tests, which
> > have failed in the last weeks and make sure that we have JIRAs for
> > them.
> >
> > As a next step, we should coordinate how to resolve those issues
> > (maybe prioritized by failure frequency) to get master stable again.
> >
> > – Ufuk
> >
> >
> >> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]>
> wrote:
> >> Hi Flink community,
> >>
> >> I just wanted to raise awareness that in the last 16 days there was
> just a
> >> single Travis build of master which passed all tests. This indicates
> that
> >> we have some serious problems with our test stability or even worse a
> >> problem with the master itself. Having an unstable master makes it
> really
> >> hard to assess whether new changes actually broke something or whether
> the
> >> failing test was unrelated.
> >>
> >> We have currently 37 open issues labeled with test-stability and most of
> >> them have a critical priority. Therefore, I would propose that we try to
> >> tackle them as soon as possible in order to improve our testing
> stability.
> >>
> >> Cheers,
> >> Till
>

Ufuk Celebi-2

Re: Master test stability poor

In reply to this post by Greg Hogan

Along the lines of what Greg already mentioned, I would like to
re-iterate that Travis is often a problem too:
- long build times and we are reaching the time limit
- unreliable I/O
- unreliable resolving of build dependencies

@Max: I think you wanted to look into whether we can use Apache's
Jenkins server for our builds instead of Travis. Did you ever get
around at looking into it? If yes: What's your opinion on replacing
Travis with Jenkins? Is it a viable option? Would it improve the
Travis-specific problems?

On the other hand, the very slow Travis machines also helped
discovering some hard-to-catch race conditions.

– Ufuk

On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:

> We have also started running over Travis' 2 hour limit for the longest build.
>
> Greg
>
>
>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>
>> Hi Till,
>>
>> thank you for bringing this up. We really need to fix this.
>>
>> Filing JIRAs with critical priority was how we tried to solve it in
>> the past, but obviously it did not work. There seems to be a mismatch
>> between assigned and actual priorities.
>>
>> As a first step, I would volunteer to gather a list of tests, which
>> have failed in the last weeks and make sure that we have JIRAs for
>> them.
>>
>> As a next step, we should coordinate how to resolve those issues
>> (maybe prioritized by failure frequency) to get master stable again.
>>
>> – Ufuk
>>
>>
>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote:
>>> Hi Flink community,
>>>
>>> I just wanted to raise awareness that in the last 16 days there was just a
>>> single Travis build of master which passed all tests. This indicates that
>>> we have some serious problems with our test stability or even worse a
>>> problem with the master itself. Having an unstable master makes it really
>>> hard to assess whether new changes actually broke something or whether the
>>> failing test was unrelated.
>>>
>>> We have currently 37 open issues labeled with test-stability and most of
>>> them have a critical priority. Therefore, I would propose that we try to
>>> tackle them as soon as possible in order to improve our testing stability.
>>>
>>> Cheers,
>>> Till

Robert Metzger

Re: Master test stability poor

I'm not sure if the issues is as big as it seems on a first sight.
The reason why all the builds of master are red on travis is that the cache
of the 5th build is invalid. We have to ask infra to delete the caches and
then they'll be green again.

On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:

> Along the lines of what Greg already mentioned, I would like to
> re-iterate that Travis is often a problem too:
> - long build times and we are reaching the time limit
> - unreliable I/O
> - unreliable resolving of build dependencies
>
> @Max: I think you wanted to look into whether we can use Apache's
> Jenkins server for our builds instead of Travis. Did you ever get
> around at looking into it? If yes: What's your opinion on replacing
> Travis with Jenkins? Is it a viable option? Would it improve the
> Travis-specific problems?
>
> On the other hand, the very slow Travis machines also helped
> discovering some hard-to-catch race conditions.
>
> – Ufuk
>
>
> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
> > We have also started running over Travis' 2 hour limit for the longest
> build.
> >
> > Greg
> >
> >
> >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
> >>
> >> Hi Till,
> >>
> >> thank you for bringing this up. We really need to fix this.
> >>
> >> Filing JIRAs with critical priority was how we tried to solve it in
> >> the past, but obviously it did not work. There seems to be a mismatch
> >> between assigned and actual priorities.
> >>
> >> As a first step, I would volunteer to gather a list of tests, which
> >> have failed in the last weeks and make sure that we have JIRAs for
> >> them.
> >>
> >> As a next step, we should coordinate how to resolve those issues
> >> (maybe prioritized by failure frequency) to get master stable again.
> >>
> >> – Ufuk
> >>
> >>
> >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]>
> wrote:
> >>> Hi Flink community,
> >>>
> >>> I just wanted to raise awareness that in the last 16 days there was
> just a
> >>> single Travis build of master which passed all tests. This indicates
> that
> >>> we have some serious problems with our test stability or even worse a
> >>> problem with the master itself. Having an unstable master makes it
> really
> >>> hard to assess whether new changes actually broke something or whether
> the
> >>> failing test was unrelated.
> >>>
> >>> We have currently 37 open issues labeled with test-stability and most
> of
> >>> them have a critical priority. Therefore, I would propose that we try
> to
> >>> tackle them as soon as possible in order to improve our testing
> stability.
> >>>
> >>> Cheers,
> >>> Till
>

Till Rohrmann

Re: Master test stability poor

That is good to hear that we can so easily solve most of the failing
builds. We should then iterate over the open test-stability issues to see
whether they are still valid after we've merged PR 1915.

On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote:

> I'm not sure if the issues is as big as it seems on a first sight.
> The reason why all the builds of master are red on travis is that the cache
> of the 5th build is invalid. We have to ask infra to delete the caches and
> then they'll be green again.
>
> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>
> > Along the lines of what Greg already mentioned, I would like to
> > re-iterate that Travis is often a problem too:
> > - long build times and we are reaching the time limit
> > - unreliable I/O
> > - unreliable resolving of build dependencies
> >
> > @Max: I think you wanted to look into whether we can use Apache's
> > Jenkins server for our builds instead of Travis. Did you ever get
> > around at looking into it? If yes: What's your opinion on replacing
> > Travis with Jenkins? Is it a viable option? Would it improve the
> > Travis-specific problems?
> >
> > On the other hand, the very slow Travis machines also helped
> > discovering some hard-to-catch race conditions.
> >
> > – Ufuk
> >
> >
> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
> > > We have also started running over Travis' 2 hour limit for the longest
> > build.
> > >
> > > Greg
> > >
> > >
> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
> > >>
> > >> Hi Till,
> > >>
> > >> thank you for bringing this up. We really need to fix this.
> > >>
> > >> Filing JIRAs with critical priority was how we tried to solve it in
> > >> the past, but obviously it did not work. There seems to be a mismatch
> > >> between assigned and actual priorities.
> > >>
> > >> As a first step, I would volunteer to gather a list of tests, which
> > >> have failed in the last weeks and make sure that we have JIRAs for
> > >> them.
> > >>
> > >> As a next step, we should coordinate how to resolve those issues
> > >> (maybe prioritized by failure frequency) to get master stable again.
> > >>
> > >> – Ufuk
> > >>
> > >>
> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
> [hidden email]>
> > wrote:
> > >>> Hi Flink community,
> > >>>
> > >>> I just wanted to raise awareness that in the last 16 days there was
> > just a
> > >>> single Travis build of master which passed all tests. This indicates
> > that
> > >>> we have some serious problems with our test stability or even worse a
> > >>> problem with the master itself. Having an unstable master makes it
> > really
> > >>> hard to assess whether new changes actually broke something or
> whether
> > the
> > >>> failing test was unrelated.
> > >>>
> > >>> We have currently 37 open issues labeled with test-stability and most
> > of
> > >>> them have a critical priority. Therefore, I would propose that we try
> > to
> > >>> tackle them as soon as possible in order to improve our testing
> > stability.
> > >>>
> > >>> Cheers,
> > >>> Till
> >
>

Ufuk Celebi-2

Re: Master test stability poor

Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773

@Robert: I agree, but still we see failing builds over and over again.
At best it is annoying, at worst it "hides" new bugs being introduced.

On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote:

> That is good to hear that we can so easily solve most of the failing
> builds. We should then iterate over the open test-stability issues to see
> whether they are still valid after we've merged PR 1915.
>
> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote:
>
>> I'm not sure if the issues is as big as it seems on a first sight.
>> The reason why all the builds of master are red on travis is that the cache
>> of the 5th build is invalid. We have to ask infra to delete the caches and
>> then they'll be green again.
>>
>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > Along the lines of what Greg already mentioned, I would like to
>> > re-iterate that Travis is often a problem too:
>> > - long build times and we are reaching the time limit
>> > - unreliable I/O
>> > - unreliable resolving of build dependencies
>> >
>> > @Max: I think you wanted to look into whether we can use Apache's
>> > Jenkins server for our builds instead of Travis. Did you ever get
>> > around at looking into it? If yes: What's your opinion on replacing
>> > Travis with Jenkins? Is it a viable option? Would it improve the
>> > Travis-specific problems?
>> >
>> > On the other hand, the very slow Travis machines also helped
>> > discovering some hard-to-catch race conditions.
>> >
>> > – Ufuk
>> >
>> >
>> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
>> > > We have also started running over Travis' 2 hour limit for the longest
>> > build.
>> > >
>> > > Greg
>> > >
>> > >
>> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>> > >>
>> > >> Hi Till,
>> > >>
>> > >> thank you for bringing this up. We really need to fix this.
>> > >>
>> > >> Filing JIRAs with critical priority was how we tried to solve it in
>> > >> the past, but obviously it did not work. There seems to be a mismatch
>> > >> between assigned and actual priorities.
>> > >>
>> > >> As a first step, I would volunteer to gather a list of tests, which
>> > >> have failed in the last weeks and make sure that we have JIRAs for
>> > >> them.
>> > >>
>> > >> As a next step, we should coordinate how to resolve those issues
>> > >> (maybe prioritized by failure frequency) to get master stable again.
>> > >>
>> > >> – Ufuk
>> > >>
>> > >>
>> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>> [hidden email]>
>> > wrote:
>> > >>> Hi Flink community,
>> > >>>
>> > >>> I just wanted to raise awareness that in the last 16 days there was
>> > just a
>> > >>> single Travis build of master which passed all tests. This indicates
>> > that
>> > >>> we have some serious problems with our test stability or even worse a
>> > >>> problem with the master itself. Having an unstable master makes it
>> > really
>> > >>> hard to assess whether new changes actually broke something or
>> whether
>> > the
>> > >>> failing test was unrelated.
>> > >>>
>> > >>> We have currently 37 open issues labeled with test-stability and most
>> > of
>> > >>> them have a critical priority. Therefore, I would propose that we try
>> > to
>> > >>> tackle them as soon as possible in order to improve our testing
>> > stability.
>> > >>>
>> > >>> Cheers,
>> > >>> Till
>> >
>>

mxm

Re: Master test stability poor

+1 for making an effort to tackle test stability problems and
potential involved bugs.

On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
> @Max: I think you wanted to look into whether we can use Apache's
> Jenkins server for our builds instead of Travis. Did you ever get
> around at looking into it? If yes: What's your opinion on replacing
> Travis with Jenkins? Is it a viable option? Would it improve the
> Travis-specific problems?

I've experimented with the ASF Jenkins installation while setting up
our nightly snapshot builds. I've observed that the build servers are
pretty busy. I don't know how busy they are compared to the Travis
servers and whether we could have more stable builds using Jenkins. I
guess we would have to try over a period of time.

I was hesitant to enable Jenkins for pull requests because I didn't
want to spam the ASF servers with builds. Also, there are some
remaining steps for a good integration like making the Yarn logs
available (not hard to do though).

What do you think about enabling Jenkins builds for the master and see
how that goes?

On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote:

> Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773
>
> @Robert: I agree, but still we see failing builds over and over again.
> At best it is annoying, at worst it "hides" new bugs being introduced.
>
> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote:
>> That is good to hear that we can so easily solve most of the failing
>> builds. We should then iterate over the open test-stability issues to see
>> whether they are still valid after we've merged PR 1915.
>>
>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote:
>>
>>> I'm not sure if the issues is as big as it seems on a first sight.
>>> The reason why all the builds of master are red on travis is that the cache
>>> of the 5th build is invalid. We have to ask infra to delete the caches and
>>> then they'll be green again.
>>>
>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>
>>> > Along the lines of what Greg already mentioned, I would like to
>>> > re-iterate that Travis is often a problem too:
>>> > - long build times and we are reaching the time limit
>>> > - unreliable I/O
>>> > - unreliable resolving of build dependencies
>>> >
>>> > @Max: I think you wanted to look into whether we can use Apache's
>>> > Jenkins server for our builds instead of Travis. Did you ever get
>>> > around at looking into it? If yes: What's your opinion on replacing
>>> > Travis with Jenkins? Is it a viable option? Would it improve the
>>> > Travis-specific problems?
>>> >
>>> > On the other hand, the very slow Travis machines also helped
>>> > discovering some hard-to-catch race conditions.
>>> >
>>> > – Ufuk
>>> >
>>> >
>>> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
>>> > > We have also started running over Travis' 2 hour limit for the longest
>>> > build.
>>> > >
>>> > > Greg
>>> > >
>>> > >
>>> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>> > >>
>>> > >> Hi Till,
>>> > >>
>>> > >> thank you for bringing this up. We really need to fix this.
>>> > >>
>>> > >> Filing JIRAs with critical priority was how we tried to solve it in
>>> > >> the past, but obviously it did not work. There seems to be a mismatch
>>> > >> between assigned and actual priorities.
>>> > >>
>>> > >> As a first step, I would volunteer to gather a list of tests, which
>>> > >> have failed in the last weeks and make sure that we have JIRAs for
>>> > >> them.
>>> > >>
>>> > >> As a next step, we should coordinate how to resolve those issues
>>> > >> (maybe prioritized by failure frequency) to get master stable again.
>>> > >>
>>> > >> – Ufuk
>>> > >>
>>> > >>
>>> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>>> [hidden email]>
>>> > wrote:
>>> > >>> Hi Flink community,
>>> > >>>
>>> > >>> I just wanted to raise awareness that in the last 16 days there was
>>> > just a
>>> > >>> single Travis build of master which passed all tests. This indicates
>>> > that
>>> > >>> we have some serious problems with our test stability or even worse a
>>> > >>> problem with the master itself. Having an unstable master makes it
>>> > really
>>> > >>> hard to assess whether new changes actually broke something or
>>> whether
>>> > the
>>> > >>> failing test was unrelated.
>>> > >>>
>>> > >>> We have currently 37 open issues labeled with test-stability and most
>>> > of
>>> > >>> them have a critical priority. Therefore, I would propose that we try
>>> > to
>>> > >>> tackle them as soon as possible in order to improve our testing
>>> > stability.
>>> > >>>
>>> > >>> Cheers,
>>> > >>> Till
>>> >
>>>

Ufuk Celebi-2

Re: Master test stability poor

Caches have been cleared again (see
https://issues.apache.org/jira/browse/INFRA-11773) The first time did
not help. This second request was more an act of desparation. :-(
Let's see what happens now.

On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> wrote:

> +1 for making an effort to tackle test stability problems and
> potential involved bugs.
>
> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>> @Max: I think you wanted to look into whether we can use Apache's
>> Jenkins server for our builds instead of Travis. Did you ever get
>> around at looking into it? If yes: What's your opinion on replacing
>> Travis with Jenkins? Is it a viable option? Would it improve the
>> Travis-specific problems?
>
> I've experimented with the ASF Jenkins installation while setting up
> our nightly snapshot builds. I've observed that the build servers are
> pretty busy. I don't know how busy they are compared to the Travis
> servers and whether we could have more stable builds using Jenkins. I
> guess we would have to try over a period of time.
>
> I was hesitant to enable Jenkins for pull requests because I didn't
> want to spam the ASF servers with builds. Also, there are some
> remaining steps for a good integration like making the Yarn logs
> available (not hard to do though).
>
> What do you think about enabling Jenkins builds for the master and see
> how that goes?
>
> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote:
>> Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773
>>
>> @Robert: I agree, but still we see failing builds over and over again.
>> At best it is annoying, at worst it "hides" new bugs being introduced.
>>
>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote:
>>> That is good to hear that we can so easily solve most of the failing
>>> builds. We should then iterate over the open test-stability issues to see
>>> whether they are still valid after we've merged PR 1915.
>>>
>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote:
>>>
>>>> I'm not sure if the issues is as big as it seems on a first sight.
>>>> The reason why all the builds of master are red on travis is that the cache
>>>> of the 5th build is invalid. We have to ask infra to delete the caches and
>>>> then they'll be green again.
>>>>
>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>
>>>> > Along the lines of what Greg already mentioned, I would like to
>>>> > re-iterate that Travis is often a problem too:
>>>> > - long build times and we are reaching the time limit
>>>> > - unreliable I/O
>>>> > - unreliable resolving of build dependencies
>>>> >
>>>> > @Max: I think you wanted to look into whether we can use Apache's
>>>> > Jenkins server for our builds instead of Travis. Did you ever get
>>>> > around at looking into it? If yes: What's your opinion on replacing
>>>> > Travis with Jenkins? Is it a viable option? Would it improve the
>>>> > Travis-specific problems?
>>>> >
>>>> > On the other hand, the very slow Travis machines also helped
>>>> > discovering some hard-to-catch race conditions.
>>>> >
>>>> > – Ufuk
>>>> >
>>>> >
>>>> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
>>>> > > We have also started running over Travis' 2 hour limit for the longest
>>>> > build.
>>>> > >
>>>> > > Greg
>>>> > >
>>>> > >
>>>> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>>> > >>
>>>> > >> Hi Till,
>>>> > >>
>>>> > >> thank you for bringing this up. We really need to fix this.
>>>> > >>
>>>> > >> Filing JIRAs with critical priority was how we tried to solve it in
>>>> > >> the past, but obviously it did not work. There seems to be a mismatch
>>>> > >> between assigned and actual priorities.
>>>> > >>
>>>> > >> As a first step, I would volunteer to gather a list of tests, which
>>>> > >> have failed in the last weeks and make sure that we have JIRAs for
>>>> > >> them.
>>>> > >>
>>>> > >> As a next step, we should coordinate how to resolve those issues
>>>> > >> (maybe prioritized by failure frequency) to get master stable again.
>>>> > >>
>>>> > >> – Ufuk
>>>> > >>
>>>> > >>
>>>> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>>>> [hidden email]>
>>>> > wrote:
>>>> > >>> Hi Flink community,
>>>> > >>>
>>>> > >>> I just wanted to raise awareness that in the last 16 days there was
>>>> > just a
>>>> > >>> single Travis build of master which passed all tests. This indicates
>>>> > that
>>>> > >>> we have some serious problems with our test stability or even worse a
>>>> > >>> problem with the master itself. Having an unstable master makes it
>>>> > really
>>>> > >>> hard to assess whether new changes actually broke something or
>>>> whether
>>>> > the
>>>> > >>> failing test was unrelated.
>>>> > >>>
>>>> > >>> We have currently 37 open issues labeled with test-stability and most
>>>> > of
>>>> > >>> them have a critical priority. Therefore, I would propose that we try
>>>> > to
>>>> > >>> tackle them as soon as possible in order to improve our testing
>>>> > stability.
>>>> > >>>
>>>> > >>> Cheers,
>>>> > >>> Till
>>>> >
>>>>

Chesnay Schepler-3

Re: Master test stability poor

If this doesn't work we may want to think about disabling the
problematic profile temporarily.

On 23.05.2016 09:53, Ufuk Celebi wrote:

> Caches have been cleared again (see
> https://issues.apache.org/jira/browse/INFRA-11773) The first time did
> not help. This second request was more an act of desparation. :-(
> Let's see what happens now.
>
> On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> wrote:
>> +1 for making an effort to tackle test stability problems and
>> potential involved bugs.
>>
>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>> @Max: I think you wanted to look into whether we can use Apache's
>>> Jenkins server for our builds instead of Travis. Did you ever get
>>> around at looking into it? If yes: What's your opinion on replacing
>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>> Travis-specific problems?
>> I've experimented with the ASF Jenkins installation while setting up
>> our nightly snapshot builds. I've observed that the build servers are
>> pretty busy. I don't know how busy they are compared to the Travis
>> servers and whether we could have more stable builds using Jenkins. I
>> guess we would have to try over a period of time.
>>
>> I was hesitant to enable Jenkins for pull requests because I didn't
>> want to spam the ASF servers with builds. Also, there are some
>> remaining steps for a good integration like making the Yarn logs
>> available (not hard to do though).
>>
>> What do you think about enabling Jenkins builds for the master and see
>> how that goes?
>>
>> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote:
>>> Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773
>>>
>>> @Robert: I agree, but still we see failing builds over and over again.
>>> At best it is annoying, at worst it "hides" new bugs being introduced.
>>>
>>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote:
>>>> That is good to hear that we can so easily solve most of the failing
>>>> builds. We should then iterate over the open test-stability issues to see
>>>> whether they are still valid after we've merged PR 1915.
>>>>
>>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote:
>>>>
>>>>> I'm not sure if the issues is as big as it seems on a first sight.
>>>>> The reason why all the builds of master are red on travis is that the cache
>>>>> of the 5th build is invalid. We have to ask infra to delete the caches and
>>>>> then they'll be green again.
>>>>>
>>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>>
>>>>>> Along the lines of what Greg already mentioned, I would like to
>>>>>> re-iterate that Travis is often a problem too:
>>>>>> - long build times and we are reaching the time limit
>>>>>> - unreliable I/O
>>>>>> - unreliable resolving of build dependencies
>>>>>>
>>>>>> @Max: I think you wanted to look into whether we can use Apache's
>>>>>> Jenkins server for our builds instead of Travis. Did you ever get
>>>>>> around at looking into it? If yes: What's your opinion on replacing
>>>>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>>>>> Travis-specific problems?
>>>>>>
>>>>>> On the other hand, the very slow Travis machines also helped
>>>>>> discovering some hard-to-catch race conditions.
>>>>>>
>>>>>> – Ufuk
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote:
>>>>>>> We have also started running over Travis' 2 hour limit for the longest
>>>>>> build.
>>>>>>> Greg
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Hi Till,
>>>>>>>>
>>>>>>>> thank you for bringing this up. We really need to fix this.
>>>>>>>>
>>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in
>>>>>>>> the past, but obviously it did not work. There seems to be a mismatch
>>>>>>>> between assigned and actual priorities.
>>>>>>>>
>>>>>>>> As a first step, I would volunteer to gather a list of tests, which
>>>>>>>> have failed in the last weeks and make sure that we have JIRAs for
>>>>>>>> them.
>>>>>>>>
>>>>>>>> As a next step, we should coordinate how to resolve those issues
>>>>>>>> (maybe prioritized by failure frequency) to get master stable again.
>>>>>>>>
>>>>>>>> – Ufuk
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>>>>> [hidden email]>
>>>>>> wrote:
>>>>>>>>> Hi Flink community,
>>>>>>>>>
>>>>>>>>> I just wanted to raise awareness that in the last 16 days there was
>>>>>> just a
>>>>>>>>> single Travis build of master which passed all tests. This indicates
>>>>>> that
>>>>>>>>> we have some serious problems with our test stability or even worse a
>>>>>>>>> problem with the master itself. Having an unstable master makes it
>>>>>> really
>>>>>>>>> hard to assess whether new changes actually broke something or
>>>>> whether
>>>>>> the
>>>>>>>>> failing test was unrelated.
>>>>>>>>>
>>>>>>>>> We have currently 37 open issues labeled with test-stability and most
>>>>>> of
>>>>>>>>> them have a critical priority. Therefore, I would propose that we try
>>>>>> to
>>>>>>>>> tackle them as soon as possible in order to improve our testing
>>>>>> stability.
>>>>>>>>> Cheers,
>>>>>>>>> Till

Robert Metzger

Re: Master test stability poor

We could also try to disable the caching of the .m2 directory (I suspect
that it contains broken jar files). The problem is that it this will make
the builds slower on travis because we need to download more.

On Mon, May 23, 2016 at 10:18 AM, Chesnay Schepler <[hidden email]>
wrote:

> If this doesn't work we may want to think about disabling the problematic
> profile temporarily.
>
>
> On 23.05.2016 09:53, Ufuk Celebi wrote:
>
>> Caches have been cleared again (see
>> https://issues.apache.org/jira/browse/INFRA-11773) The first time did
>> not help. This second request was more an act of desparation. :-(
>> Let's see what happens now.
>>
>> On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]>
>> wrote:
>>
>>> +1 for making an effort to tackle test stability problems and
>>> potential involved bugs.
>>>
>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>
>>>> @Max: I think you wanted to look into whether we can use Apache's
>>>> Jenkins server for our builds instead of Travis. Did you ever get
>>>> around at looking into it? If yes: What's your opinion on replacing
>>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>>> Travis-specific problems?
>>>>
>>> I've experimented with the ASF Jenkins installation while setting up
>>> our nightly snapshot builds. I've observed that the build servers are
>>> pretty busy. I don't know how busy they are compared to the Travis
>>> servers and whether we could have more stable builds using Jenkins. I
>>> guess we would have to try over a period of time.
>>>
>>> I was hesitant to enable Jenkins for pull requests because I didn't
>>> want to spam the ASF servers with builds. Also, there are some
>>> remaining steps for a good integration like making the Yarn logs
>>> available (not hard to do though).
>>>
>>> What do you think about enabling Jenkins builds for the master and see
>>> how that goes?
>>>
>>> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote:
>>>
>>>> Filed an issue with INFRA:
>>>> https://issues.apache.org/jira/browse/INFRA-11773
>>>>
>>>> @Robert: I agree, but still we see failing builds over and over again.
>>>> At best it is annoying, at worst it "hides" new bugs being introduced.
>>>>
>>>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> That is good to hear that we can so easily solve most of the failing
>>>>> builds. We should then iterate over the open test-stability issues to
>>>>> see
>>>>> whether they are still valid after we've merged PR 1915.
>>>>>
>>>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]>
>>>>> wrote:
>>>>>
>>>>> I'm not sure if the issues is as big as it seems on a first sight.
>>>>>> The reason why all the builds of master are red on travis is that the
>>>>>> cache
>>>>>> of the 5th build is invalid. We have to ask infra to delete the
>>>>>> caches and
>>>>>> then they'll be green again.
>>>>>>
>>>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>>>
>>>>>> Along the lines of what Greg already mentioned, I would like to
>>>>>>> re-iterate that Travis is often a problem too:
>>>>>>> - long build times and we are reaching the time limit
>>>>>>> - unreliable I/O
>>>>>>> - unreliable resolving of build dependencies
>>>>>>>
>>>>>>> @Max: I think you wanted to look into whether we can use Apache's
>>>>>>> Jenkins server for our builds instead of Travis. Did you ever get
>>>>>>> around at looking into it? If yes: What's your opinion on replacing
>>>>>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>>>>>> Travis-specific problems?
>>>>>>>
>>>>>>> On the other hand, the very slow Travis machines also helped
>>>>>>> discovering some hard-to-catch race conditions.
>>>>>>>
>>>>>>> – Ufuk
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We have also started running over Travis' 2 hour limit for the
>>>>>>>> longest
>>>>>>>>
>>>>>>> build.
>>>>>>>
>>>>>>>> Greg
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Hi Till,
>>>>>>>>>
>>>>>>>>> thank you for bringing this up. We really need to fix this.
>>>>>>>>>
>>>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in
>>>>>>>>> the past, but obviously it did not work. There seems to be a
>>>>>>>>> mismatch
>>>>>>>>> between assigned and actual priorities.
>>>>>>>>>
>>>>>>>>> As a first step, I would volunteer to gather a list of tests, which
>>>>>>>>> have failed in the last weeks and make sure that we have JIRAs for
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> As a next step, we should coordinate how to resolve those issues
>>>>>>>>> (maybe prioritized by failure frequency) to get master stable
>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>> – Ufuk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>>>>>>>>>>
>>>>>>>>> [hidden email]>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Flink community,
>>>>>>>>>>
>>>>>>>>>> I just wanted to raise awareness that in the last 16 days there
>>>>>>>>>> was
>>>>>>>>>>
>>>>>>>>> just a
>>>>>>>
>>>>>>>> single Travis build of master which passed all tests. This indicates
>>>>>>>>>>
>>>>>>>>> that
>>>>>>>
>>>>>>>> we have some serious problems with our test stability or even worse
>>>>>>>>>> a
>>>>>>>>>> problem with the master itself. Having an unstable master makes it
>>>>>>>>>>
>>>>>>>>> really
>>>>>>>
>>>>>>>> hard to assess whether new changes actually broke something or
>>>>>>>>>>
>>>>>>>>> whether
>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> failing test was unrelated.
>>>>>>>>>>
>>>>>>>>>> We have currently 37 open issues labeled with test-stability and
>>>>>>>>>> most
>>>>>>>>>>
>>>>>>>>> of
>>>>>>>
>>>>>>>> them have a critical priority. Therefore, I would propose that we
>>>>>>>>>> try
>>>>>>>>>>
>>>>>>>>> to
>>>>>>>
>>>>>>>> tackle them as soon as possible in order to improve our testing
>>>>>>>>>>
>>>>>>>>> stability.
>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>>
>>>>>>>>>
>

Chesnay Schepler-3

Re: Master test stability poor

if we disable caching, let it run for 1 build and enable it again, will
that effectively clear the .m2 cache?

On 23.05.2016 12:00, Robert Metzger wrote:

> We could also try to disable the caching of the .m2 directory (I suspect
> that it contains broken jar files). The problem is that it this will make
> the builds slower on travis because we need to download more.
>
> On Mon, May 23, 2016 at 10:18 AM, Chesnay Schepler <[hidden email]>
> wrote:
>
>> If this doesn't work we may want to think about disabling the problematic
>> profile temporarily.
>>
>>
>> On 23.05.2016 09:53, Ufuk Celebi wrote:
>>
>>> Caches have been cleared again (see
>>> https://issues.apache.org/jira/browse/INFRA-11773) The first time did
>>> not help. This second request was more an act of desparation. :-(
>>> Let's see what happens now.
>>>
>>> On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]>
>>> wrote:
>>>
>>>> +1 for making an effort to tackle test stability problems and
>>>> potential involved bugs.
>>>>
>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>
>>>>> @Max: I think you wanted to look into whether we can use Apache's
>>>>> Jenkins server for our builds instead of Travis. Did you ever get
>>>>> around at looking into it? If yes: What's your opinion on replacing
>>>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>>>> Travis-specific problems?
>>>>>
>>>> I've experimented with the ASF Jenkins installation while setting up
>>>> our nightly snapshot builds. I've observed that the build servers are
>>>> pretty busy. I don't know how busy they are compared to the Travis
>>>> servers and whether we could have more stable builds using Jenkins. I
>>>> guess we would have to try over a period of time.
>>>>
>>>> I was hesitant to enable Jenkins for pull requests because I didn't
>>>> want to spam the ASF servers with builds. Also, there are some
>>>> remaining steps for a good integration like making the Yarn logs
>>>> available (not hard to do though).
>>>>
>>>> What do you think about enabling Jenkins builds for the master and see
>>>> how that goes?
>>>>
>>>> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>
>>>>> Filed an issue with INFRA:
>>>>> https://issues.apache.org/jira/browse/INFRA-11773
>>>>>
>>>>> @Robert: I agree, but still we see failing builds over and over again.
>>>>> At best it is annoying, at worst it "hides" new bugs being introduced.
>>>>>
>>>>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> That is good to hear that we can so easily solve most of the failing
>>>>>> builds. We should then iterate over the open test-stability issues to
>>>>>> see
>>>>>> whether they are still valid after we've merged PR 1915.
>>>>>>
>>>>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>> I'm not sure if the issues is as big as it seems on a first sight.
>>>>>>> The reason why all the builds of master are red on travis is that the
>>>>>>> cache
>>>>>>> of the 5th build is invalid. We have to ask infra to delete the
>>>>>>> caches and
>>>>>>> then they'll be green again.
>>>>>>>
>>>>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote:
>>>>>>>
>>>>>>> Along the lines of what Greg already mentioned, I would like to
>>>>>>>> re-iterate that Travis is often a problem too:
>>>>>>>> - long build times and we are reaching the time limit
>>>>>>>> - unreliable I/O
>>>>>>>> - unreliable resolving of build dependencies
>>>>>>>>
>>>>>>>> @Max: I think you wanted to look into whether we can use Apache's
>>>>>>>> Jenkins server for our builds instead of Travis. Did you ever get
>>>>>>>> around at looking into it? If yes: What's your opinion on replacing
>>>>>>>> Travis with Jenkins? Is it a viable option? Would it improve the
>>>>>>>> Travis-specific problems?
>>>>>>>>
>>>>>>>> On the other hand, the very slow Travis machines also helped
>>>>>>>> discovering some hard-to-catch race conditions.
>>>>>>>>
>>>>>>>> – Ufuk
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We have also started running over Travis' 2 hour limit for the
>>>>>>>>> longest
>>>>>>>>>
>>>>>>>> build.
>>>>>>>>
>>>>>>>>> Greg
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote:
>>>>>>>>>> Hi Till,
>>>>>>>>>>
>>>>>>>>>> thank you for bringing this up. We really need to fix this.
>>>>>>>>>>
>>>>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in
>>>>>>>>>> the past, but obviously it did not work. There seems to be a
>>>>>>>>>> mismatch
>>>>>>>>>> between assigned and actual priorities.
>>>>>>>>>>
>>>>>>>>>> As a first step, I would volunteer to gather a list of tests, which
>>>>>>>>>> have failed in the last weeks and make sure that we have JIRAs for
>>>>>>>>>> them.
>>>>>>>>>>
>>>>>>>>>> As a next step, we should coordinate how to resolve those issues
>>>>>>>>>> (maybe prioritized by failure frequency) to get master stable
>>>>>>>>>> again.
>>>>>>>>>>
>>>>>>>>>> – Ufuk
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <
>>>>>>>>>> [hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Flink community,
>>>>>>>>>>> I just wanted to raise awareness that in the last 16 days there
>>>>>>>>>>> was
>>>>>>>>>>>
>>>>>>>>>> just a
>>>>>>>>> single Travis build of master which passed all tests. This indicates
>>>>>>>>>> that
>>>>>>>>> we have some serious problems with our test stability or even worse
>>>>>>>>>>> a
>>>>>>>>>>> problem with the master itself. Having an unstable master makes it
>>>>>>>>>>>
>>>>>>>>>> really
>>>>>>>>> hard to assess whether new changes actually broke something or
>>>>>>>>>> whether
>>>>>>>> the
>>>>>>>>
>>>>>>>>> failing test was unrelated.
>>>>>>>>>>> We have currently 37 open issues labeled with test-stability and
>>>>>>>>>>> most
>>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>> them have a critical priority. Therefore, I would propose that we
>>>>>>>>>>> try
>>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>> tackle them as soon as possible in order to improve our testing
>>>>>>>>>> stability.
>>>>>>>>> Cheers,
>>>>>>>>>>> Till
>>>>>>>>>>>