Hi Flink community,
I just wanted to raise awareness that in the last 16 days there was just a single Travis build of master which passed all tests. This indicates that we have some serious problems with our test stability or even worse a problem with the master itself. Having an unstable master makes it really hard to assess whether new changes actually broke something or whether the failing test was unrelated. We have currently 37 open issues labeled with test-stability and most of them have a critical priority. Therefore, I would propose that we try to tackle them as soon as possible in order to improve our testing stability. Cheers, Till |
Hi Till,
thank you for bringing this up. We really need to fix this. Filing JIRAs with critical priority was how we tried to solve it in the past, but obviously it did not work. There seems to be a mismatch between assigned and actual priorities. As a first step, I would volunteer to gather a list of tests, which have failed in the last weeks and make sure that we have JIRAs for them. As a next step, we should coordinate how to resolve those issues (maybe prioritized by failure frequency) to get master stable again. – Ufuk On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote: > Hi Flink community, > > I just wanted to raise awareness that in the last 16 days there was just a > single Travis build of master which passed all tests. This indicates that > we have some serious problems with our test stability or even worse a > problem with the master itself. Having an unstable master makes it really > hard to assess whether new changes actually broke something or whether the > failing test was unrelated. > > We have currently 37 open issues labeled with test-stability and most of > them have a critical priority. Therefore, I would propose that we try to > tackle them as soon as possible in order to improve our testing stability. > > Cheers, > Till |
We have also started running over Travis' 2 hour limit for the longest build.
Greg > On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: > > Hi Till, > > thank you for bringing this up. We really need to fix this. > > Filing JIRAs with critical priority was how we tried to solve it in > the past, but obviously it did not work. There seems to be a mismatch > between assigned and actual priorities. > > As a first step, I would volunteer to gather a list of tests, which > have failed in the last weeks and make sure that we have JIRAs for > them. > > As a next step, we should coordinate how to resolve those issues > (maybe prioritized by failure frequency) to get master stable again. > > – Ufuk > > >> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote: >> Hi Flink community, >> >> I just wanted to raise awareness that in the last 16 days there was just a >> single Travis build of master which passed all tests. This indicates that >> we have some serious problems with our test stability or even worse a >> problem with the master itself. Having an unstable master makes it really >> hard to assess whether new changes actually broke something or whether the >> failing test was unrelated. >> >> We have currently 37 open issues labeled with test-stability and most of >> them have a critical priority. Therefore, I would propose that we try to >> tackle them as soon as possible in order to improve our testing stability. >> >> Cheers, >> Till |
We just issued a PR about this (FLINK-1827 - https://github.com/apache/
flink/pull/1915) that improves test stability (and allow to skip entirely their compilation when it's not required) except for the ml library that has still some one error to solve ( in the hadoop-1 build and in the ml-library) but I think that would not be so diffucult to fix..it should be caused by some missing compile dependency that was introduced by hadoop2 On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: > We have also started running over Travis' 2 hour limit for the longest > build. > > Greg > > > > On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: > > > > Hi Till, > > > > thank you for bringing this up. We really need to fix this. > > > > Filing JIRAs with critical priority was how we tried to solve it in > > the past, but obviously it did not work. There seems to be a mismatch > > between assigned and actual priorities. > > > > As a first step, I would volunteer to gather a list of tests, which > > have failed in the last weeks and make sure that we have JIRAs for > > them. > > > > As a next step, we should coordinate how to resolve those issues > > (maybe prioritized by failure frequency) to get master stable again. > > > > – Ufuk > > > > > >> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> > wrote: > >> Hi Flink community, > >> > >> I just wanted to raise awareness that in the last 16 days there was > just a > >> single Travis build of master which passed all tests. This indicates > that > >> we have some serious problems with our test stability or even worse a > >> problem with the master itself. Having an unstable master makes it > really > >> hard to assess whether new changes actually broke something or whether > the > >> failing test was unrelated. > >> > >> We have currently 37 open issues labeled with test-stability and most of > >> them have a critical priority. Therefore, I would propose that we try to > >> tackle them as soon as possible in order to improve our testing > stability. > >> > >> Cheers, > >> Till > |
In reply to this post by Greg Hogan
Along the lines of what Greg already mentioned, I would like to
re-iterate that Travis is often a problem too: - long build times and we are reaching the time limit - unreliable I/O - unreliable resolving of build dependencies @Max: I think you wanted to look into whether we can use Apache's Jenkins server for our builds instead of Travis. Did you ever get around at looking into it? If yes: What's your opinion on replacing Travis with Jenkins? Is it a viable option? Would it improve the Travis-specific problems? On the other hand, the very slow Travis machines also helped discovering some hard-to-catch race conditions. – Ufuk On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: > We have also started running over Travis' 2 hour limit for the longest build. > > Greg > > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >> >> Hi Till, >> >> thank you for bringing this up. We really need to fix this. >> >> Filing JIRAs with critical priority was how we tried to solve it in >> the past, but obviously it did not work. There seems to be a mismatch >> between assigned and actual priorities. >> >> As a first step, I would volunteer to gather a list of tests, which >> have failed in the last weeks and make sure that we have JIRAs for >> them. >> >> As a next step, we should coordinate how to resolve those issues >> (maybe prioritized by failure frequency) to get master stable again. >> >> – Ufuk >> >> >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> wrote: >>> Hi Flink community, >>> >>> I just wanted to raise awareness that in the last 16 days there was just a >>> single Travis build of master which passed all tests. This indicates that >>> we have some serious problems with our test stability or even worse a >>> problem with the master itself. Having an unstable master makes it really >>> hard to assess whether new changes actually broke something or whether the >>> failing test was unrelated. >>> >>> We have currently 37 open issues labeled with test-stability and most of >>> them have a critical priority. Therefore, I would propose that we try to >>> tackle them as soon as possible in order to improve our testing stability. >>> >>> Cheers, >>> Till |
I'm not sure if the issues is as big as it seems on a first sight.
The reason why all the builds of master are red on travis is that the cache of the 5th build is invalid. We have to ask infra to delete the caches and then they'll be green again. On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: > Along the lines of what Greg already mentioned, I would like to > re-iterate that Travis is often a problem too: > - long build times and we are reaching the time limit > - unreliable I/O > - unreliable resolving of build dependencies > > @Max: I think you wanted to look into whether we can use Apache's > Jenkins server for our builds instead of Travis. Did you ever get > around at looking into it? If yes: What's your opinion on replacing > Travis with Jenkins? Is it a viable option? Would it improve the > Travis-specific problems? > > On the other hand, the very slow Travis machines also helped > discovering some hard-to-catch race conditions. > > – Ufuk > > > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: > > We have also started running over Travis' 2 hour limit for the longest > build. > > > > Greg > > > > > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: > >> > >> Hi Till, > >> > >> thank you for bringing this up. We really need to fix this. > >> > >> Filing JIRAs with critical priority was how we tried to solve it in > >> the past, but obviously it did not work. There seems to be a mismatch > >> between assigned and actual priorities. > >> > >> As a first step, I would volunteer to gather a list of tests, which > >> have failed in the last weeks and make sure that we have JIRAs for > >> them. > >> > >> As a next step, we should coordinate how to resolve those issues > >> (maybe prioritized by failure frequency) to get master stable again. > >> > >> – Ufuk > >> > >> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann <[hidden email]> > wrote: > >>> Hi Flink community, > >>> > >>> I just wanted to raise awareness that in the last 16 days there was > just a > >>> single Travis build of master which passed all tests. This indicates > that > >>> we have some serious problems with our test stability or even worse a > >>> problem with the master itself. Having an unstable master makes it > really > >>> hard to assess whether new changes actually broke something or whether > the > >>> failing test was unrelated. > >>> > >>> We have currently 37 open issues labeled with test-stability and most > of > >>> them have a critical priority. Therefore, I would propose that we try > to > >>> tackle them as soon as possible in order to improve our testing > stability. > >>> > >>> Cheers, > >>> Till > |
That is good to hear that we can so easily solve most of the failing
builds. We should then iterate over the open test-stability issues to see whether they are still valid after we've merged PR 1915. On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote: > I'm not sure if the issues is as big as it seems on a first sight. > The reason why all the builds of master are red on travis is that the cache > of the 5th build is invalid. We have to ask infra to delete the caches and > then they'll be green again. > > On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: > > > Along the lines of what Greg already mentioned, I would like to > > re-iterate that Travis is often a problem too: > > - long build times and we are reaching the time limit > > - unreliable I/O > > - unreliable resolving of build dependencies > > > > @Max: I think you wanted to look into whether we can use Apache's > > Jenkins server for our builds instead of Travis. Did you ever get > > around at looking into it? If yes: What's your opinion on replacing > > Travis with Jenkins? Is it a viable option? Would it improve the > > Travis-specific problems? > > > > On the other hand, the very slow Travis machines also helped > > discovering some hard-to-catch race conditions. > > > > – Ufuk > > > > > > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: > > > We have also started running over Travis' 2 hour limit for the longest > > build. > > > > > > Greg > > > > > > > > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: > > >> > > >> Hi Till, > > >> > > >> thank you for bringing this up. We really need to fix this. > > >> > > >> Filing JIRAs with critical priority was how we tried to solve it in > > >> the past, but obviously it did not work. There seems to be a mismatch > > >> between assigned and actual priorities. > > >> > > >> As a first step, I would volunteer to gather a list of tests, which > > >> have failed in the last weeks and make sure that we have JIRAs for > > >> them. > > >> > > >> As a next step, we should coordinate how to resolve those issues > > >> (maybe prioritized by failure frequency) to get master stable again. > > >> > > >> – Ufuk > > >> > > >> > > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < > [hidden email]> > > wrote: > > >>> Hi Flink community, > > >>> > > >>> I just wanted to raise awareness that in the last 16 days there was > > just a > > >>> single Travis build of master which passed all tests. This indicates > > that > > >>> we have some serious problems with our test stability or even worse a > > >>> problem with the master itself. Having an unstable master makes it > > really > > >>> hard to assess whether new changes actually broke something or > whether > > the > > >>> failing test was unrelated. > > >>> > > >>> We have currently 37 open issues labeled with test-stability and most > > of > > >>> them have a critical priority. Therefore, I would propose that we try > > to > > >>> tackle them as soon as possible in order to improve our testing > > stability. > > >>> > > >>> Cheers, > > >>> Till > > > |
Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773
@Robert: I agree, but still we see failing builds over and over again. At best it is annoying, at worst it "hides" new bugs being introduced. On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote: > That is good to hear that we can so easily solve most of the failing > builds. We should then iterate over the open test-stability issues to see > whether they are still valid after we've merged PR 1915. > > On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote: > >> I'm not sure if the issues is as big as it seems on a first sight. >> The reason why all the builds of master are red on travis is that the cache >> of the 5th build is invalid. We have to ask infra to delete the caches and >> then they'll be green again. >> >> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >> >> > Along the lines of what Greg already mentioned, I would like to >> > re-iterate that Travis is often a problem too: >> > - long build times and we are reaching the time limit >> > - unreliable I/O >> > - unreliable resolving of build dependencies >> > >> > @Max: I think you wanted to look into whether we can use Apache's >> > Jenkins server for our builds instead of Travis. Did you ever get >> > around at looking into it? If yes: What's your opinion on replacing >> > Travis with Jenkins? Is it a viable option? Would it improve the >> > Travis-specific problems? >> > >> > On the other hand, the very slow Travis machines also helped >> > discovering some hard-to-catch race conditions. >> > >> > – Ufuk >> > >> > >> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: >> > > We have also started running over Travis' 2 hour limit for the longest >> > build. >> > > >> > > Greg >> > > >> > > >> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >> > >> >> > >> Hi Till, >> > >> >> > >> thank you for bringing this up. We really need to fix this. >> > >> >> > >> Filing JIRAs with critical priority was how we tried to solve it in >> > >> the past, but obviously it did not work. There seems to be a mismatch >> > >> between assigned and actual priorities. >> > >> >> > >> As a first step, I would volunteer to gather a list of tests, which >> > >> have failed in the last weeks and make sure that we have JIRAs for >> > >> them. >> > >> >> > >> As a next step, we should coordinate how to resolve those issues >> > >> (maybe prioritized by failure frequency) to get master stable again. >> > >> >> > >> – Ufuk >> > >> >> > >> >> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >> [hidden email]> >> > wrote: >> > >>> Hi Flink community, >> > >>> >> > >>> I just wanted to raise awareness that in the last 16 days there was >> > just a >> > >>> single Travis build of master which passed all tests. This indicates >> > that >> > >>> we have some serious problems with our test stability or even worse a >> > >>> problem with the master itself. Having an unstable master makes it >> > really >> > >>> hard to assess whether new changes actually broke something or >> whether >> > the >> > >>> failing test was unrelated. >> > >>> >> > >>> We have currently 37 open issues labeled with test-stability and most >> > of >> > >>> them have a critical priority. Therefore, I would propose that we try >> > to >> > >>> tackle them as soon as possible in order to improve our testing >> > stability. >> > >>> >> > >>> Cheers, >> > >>> Till >> > >> |
+1 for making an effort to tackle test stability problems and
potential involved bugs. On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: > @Max: I think you wanted to look into whether we can use Apache's > Jenkins server for our builds instead of Travis. Did you ever get > around at looking into it? If yes: What's your opinion on replacing > Travis with Jenkins? Is it a viable option? Would it improve the > Travis-specific problems? I've experimented with the ASF Jenkins installation while setting up our nightly snapshot builds. I've observed that the build servers are pretty busy. I don't know how busy they are compared to the Travis servers and whether we could have more stable builds using Jenkins. I guess we would have to try over a period of time. I was hesitant to enable Jenkins for pull requests because I didn't want to spam the ASF servers with builds. Also, there are some remaining steps for a good integration like making the Yarn logs available (not hard to do though). What do you think about enabling Jenkins builds for the master and see how that goes? On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote: > Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773 > > @Robert: I agree, but still we see failing builds over and over again. > At best it is annoying, at worst it "hides" new bugs being introduced. > > On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote: >> That is good to hear that we can so easily solve most of the failing >> builds. We should then iterate over the open test-stability issues to see >> whether they are still valid after we've merged PR 1915. >> >> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote: >> >>> I'm not sure if the issues is as big as it seems on a first sight. >>> The reason why all the builds of master are red on travis is that the cache >>> of the 5th build is invalid. We have to ask infra to delete the caches and >>> then they'll be green again. >>> >>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>> >>> > Along the lines of what Greg already mentioned, I would like to >>> > re-iterate that Travis is often a problem too: >>> > - long build times and we are reaching the time limit >>> > - unreliable I/O >>> > - unreliable resolving of build dependencies >>> > >>> > @Max: I think you wanted to look into whether we can use Apache's >>> > Jenkins server for our builds instead of Travis. Did you ever get >>> > around at looking into it? If yes: What's your opinion on replacing >>> > Travis with Jenkins? Is it a viable option? Would it improve the >>> > Travis-specific problems? >>> > >>> > On the other hand, the very slow Travis machines also helped >>> > discovering some hard-to-catch race conditions. >>> > >>> > – Ufuk >>> > >>> > >>> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: >>> > > We have also started running over Travis' 2 hour limit for the longest >>> > build. >>> > > >>> > > Greg >>> > > >>> > > >>> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >>> > >> >>> > >> Hi Till, >>> > >> >>> > >> thank you for bringing this up. We really need to fix this. >>> > >> >>> > >> Filing JIRAs with critical priority was how we tried to solve it in >>> > >> the past, but obviously it did not work. There seems to be a mismatch >>> > >> between assigned and actual priorities. >>> > >> >>> > >> As a first step, I would volunteer to gather a list of tests, which >>> > >> have failed in the last weeks and make sure that we have JIRAs for >>> > >> them. >>> > >> >>> > >> As a next step, we should coordinate how to resolve those issues >>> > >> (maybe prioritized by failure frequency) to get master stable again. >>> > >> >>> > >> – Ufuk >>> > >> >>> > >> >>> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >>> [hidden email]> >>> > wrote: >>> > >>> Hi Flink community, >>> > >>> >>> > >>> I just wanted to raise awareness that in the last 16 days there was >>> > just a >>> > >>> single Travis build of master which passed all tests. This indicates >>> > that >>> > >>> we have some serious problems with our test stability or even worse a >>> > >>> problem with the master itself. Having an unstable master makes it >>> > really >>> > >>> hard to assess whether new changes actually broke something or >>> whether >>> > the >>> > >>> failing test was unrelated. >>> > >>> >>> > >>> We have currently 37 open issues labeled with test-stability and most >>> > of >>> > >>> them have a critical priority. Therefore, I would propose that we try >>> > to >>> > >>> tackle them as soon as possible in order to improve our testing >>> > stability. >>> > >>> >>> > >>> Cheers, >>> > >>> Till >>> > >>> |
Caches have been cleared again (see
https://issues.apache.org/jira/browse/INFRA-11773) The first time did not help. This second request was more an act of desparation. :-( Let's see what happens now. On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> wrote: > +1 for making an effort to tackle test stability problems and > potential involved bugs. > > On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >> @Max: I think you wanted to look into whether we can use Apache's >> Jenkins server for our builds instead of Travis. Did you ever get >> around at looking into it? If yes: What's your opinion on replacing >> Travis with Jenkins? Is it a viable option? Would it improve the >> Travis-specific problems? > > I've experimented with the ASF Jenkins installation while setting up > our nightly snapshot builds. I've observed that the build servers are > pretty busy. I don't know how busy they are compared to the Travis > servers and whether we could have more stable builds using Jenkins. I > guess we would have to try over a period of time. > > I was hesitant to enable Jenkins for pull requests because I didn't > want to spam the ASF servers with builds. Also, there are some > remaining steps for a good integration like making the Yarn logs > available (not hard to do though). > > What do you think about enabling Jenkins builds for the master and see > how that goes? > > On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote: >> Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773 >> >> @Robert: I agree, but still we see failing builds over and over again. >> At best it is annoying, at worst it "hides" new bugs being introduced. >> >> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote: >>> That is good to hear that we can so easily solve most of the failing >>> builds. We should then iterate over the open test-stability issues to see >>> whether they are still valid after we've merged PR 1915. >>> >>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote: >>> >>>> I'm not sure if the issues is as big as it seems on a first sight. >>>> The reason why all the builds of master are red on travis is that the cache >>>> of the 5th build is invalid. We have to ask infra to delete the caches and >>>> then they'll be green again. >>>> >>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>>> >>>> > Along the lines of what Greg already mentioned, I would like to >>>> > re-iterate that Travis is often a problem too: >>>> > - long build times and we are reaching the time limit >>>> > - unreliable I/O >>>> > - unreliable resolving of build dependencies >>>> > >>>> > @Max: I think you wanted to look into whether we can use Apache's >>>> > Jenkins server for our builds instead of Travis. Did you ever get >>>> > around at looking into it? If yes: What's your opinion on replacing >>>> > Travis with Jenkins? Is it a viable option? Would it improve the >>>> > Travis-specific problems? >>>> > >>>> > On the other hand, the very slow Travis machines also helped >>>> > discovering some hard-to-catch race conditions. >>>> > >>>> > – Ufuk >>>> > >>>> > >>>> > On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: >>>> > > We have also started running over Travis' 2 hour limit for the longest >>>> > build. >>>> > > >>>> > > Greg >>>> > > >>>> > > >>>> > >> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >>>> > >> >>>> > >> Hi Till, >>>> > >> >>>> > >> thank you for bringing this up. We really need to fix this. >>>> > >> >>>> > >> Filing JIRAs with critical priority was how we tried to solve it in >>>> > >> the past, but obviously it did not work. There seems to be a mismatch >>>> > >> between assigned and actual priorities. >>>> > >> >>>> > >> As a first step, I would volunteer to gather a list of tests, which >>>> > >> have failed in the last weeks and make sure that we have JIRAs for >>>> > >> them. >>>> > >> >>>> > >> As a next step, we should coordinate how to resolve those issues >>>> > >> (maybe prioritized by failure frequency) to get master stable again. >>>> > >> >>>> > >> – Ufuk >>>> > >> >>>> > >> >>>> > >>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >>>> [hidden email]> >>>> > wrote: >>>> > >>> Hi Flink community, >>>> > >>> >>>> > >>> I just wanted to raise awareness that in the last 16 days there was >>>> > just a >>>> > >>> single Travis build of master which passed all tests. This indicates >>>> > that >>>> > >>> we have some serious problems with our test stability or even worse a >>>> > >>> problem with the master itself. Having an unstable master makes it >>>> > really >>>> > >>> hard to assess whether new changes actually broke something or >>>> whether >>>> > the >>>> > >>> failing test was unrelated. >>>> > >>> >>>> > >>> We have currently 37 open issues labeled with test-stability and most >>>> > of >>>> > >>> them have a critical priority. Therefore, I would propose that we try >>>> > to >>>> > >>> tackle them as soon as possible in order to improve our testing >>>> > stability. >>>> > >>> >>>> > >>> Cheers, >>>> > >>> Till >>>> > >>>> |
If this doesn't work we may want to think about disabling the
problematic profile temporarily. On 23.05.2016 09:53, Ufuk Celebi wrote: > Caches have been cleared again (see > https://issues.apache.org/jira/browse/INFRA-11773) The first time did > not help. This second request was more an act of desparation. :-( > Let's see what happens now. > > On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> wrote: >> +1 for making an effort to tackle test stability problems and >> potential involved bugs. >> >> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>> @Max: I think you wanted to look into whether we can use Apache's >>> Jenkins server for our builds instead of Travis. Did you ever get >>> around at looking into it? If yes: What's your opinion on replacing >>> Travis with Jenkins? Is it a viable option? Would it improve the >>> Travis-specific problems? >> I've experimented with the ASF Jenkins installation while setting up >> our nightly snapshot builds. I've observed that the build servers are >> pretty busy. I don't know how busy they are compared to the Travis >> servers and whether we could have more stable builds using Jenkins. I >> guess we would have to try over a period of time. >> >> I was hesitant to enable Jenkins for pull requests because I didn't >> want to spam the ASF servers with builds. Also, there are some >> remaining steps for a good integration like making the Yarn logs >> available (not hard to do though). >> >> What do you think about enabling Jenkins builds for the master and see >> how that goes? >> >> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote: >>> Filed an issue with INFRA: https://issues.apache.org/jira/browse/INFRA-11773 >>> >>> @Robert: I agree, but still we see failing builds over and over again. >>> At best it is annoying, at worst it "hides" new bugs being introduced. >>> >>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> wrote: >>>> That is good to hear that we can so easily solve most of the failing >>>> builds. We should then iterate over the open test-stability issues to see >>>> whether they are still valid after we've merged PR 1915. >>>> >>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> wrote: >>>> >>>>> I'm not sure if the issues is as big as it seems on a first sight. >>>>> The reason why all the builds of master are red on travis is that the cache >>>>> of the 5th build is invalid. We have to ask infra to delete the caches and >>>>> then they'll be green again. >>>>> >>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>>>> >>>>>> Along the lines of what Greg already mentioned, I would like to >>>>>> re-iterate that Travis is often a problem too: >>>>>> - long build times and we are reaching the time limit >>>>>> - unreliable I/O >>>>>> - unreliable resolving of build dependencies >>>>>> >>>>>> @Max: I think you wanted to look into whether we can use Apache's >>>>>> Jenkins server for our builds instead of Travis. Did you ever get >>>>>> around at looking into it? If yes: What's your opinion on replacing >>>>>> Travis with Jenkins? Is it a viable option? Would it improve the >>>>>> Travis-specific problems? >>>>>> >>>>>> On the other hand, the very slow Travis machines also helped >>>>>> discovering some hard-to-catch race conditions. >>>>>> >>>>>> – Ufuk >>>>>> >>>>>> >>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> wrote: >>>>>>> We have also started running over Travis' 2 hour limit for the longest >>>>>> build. >>>>>>> Greg >>>>>>> >>>>>>> >>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >>>>>>>> >>>>>>>> Hi Till, >>>>>>>> >>>>>>>> thank you for bringing this up. We really need to fix this. >>>>>>>> >>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in >>>>>>>> the past, but obviously it did not work. There seems to be a mismatch >>>>>>>> between assigned and actual priorities. >>>>>>>> >>>>>>>> As a first step, I would volunteer to gather a list of tests, which >>>>>>>> have failed in the last weeks and make sure that we have JIRAs for >>>>>>>> them. >>>>>>>> >>>>>>>> As a next step, we should coordinate how to resolve those issues >>>>>>>> (maybe prioritized by failure frequency) to get master stable again. >>>>>>>> >>>>>>>> – Ufuk >>>>>>>> >>>>>>>> >>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >>>>> [hidden email]> >>>>>> wrote: >>>>>>>>> Hi Flink community, >>>>>>>>> >>>>>>>>> I just wanted to raise awareness that in the last 16 days there was >>>>>> just a >>>>>>>>> single Travis build of master which passed all tests. This indicates >>>>>> that >>>>>>>>> we have some serious problems with our test stability or even worse a >>>>>>>>> problem with the master itself. Having an unstable master makes it >>>>>> really >>>>>>>>> hard to assess whether new changes actually broke something or >>>>> whether >>>>>> the >>>>>>>>> failing test was unrelated. >>>>>>>>> >>>>>>>>> We have currently 37 open issues labeled with test-stability and most >>>>>> of >>>>>>>>> them have a critical priority. Therefore, I would propose that we try >>>>>> to >>>>>>>>> tackle them as soon as possible in order to improve our testing >>>>>> stability. >>>>>>>>> Cheers, >>>>>>>>> Till |
We could also try to disable the caching of the .m2 directory (I suspect
that it contains broken jar files). The problem is that it this will make the builds slower on travis because we need to download more. On Mon, May 23, 2016 at 10:18 AM, Chesnay Schepler <[hidden email]> wrote: > If this doesn't work we may want to think about disabling the problematic > profile temporarily. > > > On 23.05.2016 09:53, Ufuk Celebi wrote: > >> Caches have been cleared again (see >> https://issues.apache.org/jira/browse/INFRA-11773) The first time did >> not help. This second request was more an act of desparation. :-( >> Let's see what happens now. >> >> On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> >> wrote: >> >>> +1 for making an effort to tackle test stability problems and >>> potential involved bugs. >>> >>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>> >>>> @Max: I think you wanted to look into whether we can use Apache's >>>> Jenkins server for our builds instead of Travis. Did you ever get >>>> around at looking into it? If yes: What's your opinion on replacing >>>> Travis with Jenkins? Is it a viable option? Would it improve the >>>> Travis-specific problems? >>>> >>> I've experimented with the ASF Jenkins installation while setting up >>> our nightly snapshot builds. I've observed that the build servers are >>> pretty busy. I don't know how busy they are compared to the Travis >>> servers and whether we could have more stable builds using Jenkins. I >>> guess we would have to try over a period of time. >>> >>> I was hesitant to enable Jenkins for pull requests because I didn't >>> want to spam the ASF servers with builds. Also, there are some >>> remaining steps for a good integration like making the Yarn logs >>> available (not hard to do though). >>> >>> What do you think about enabling Jenkins builds for the master and see >>> how that goes? >>> >>> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote: >>> >>>> Filed an issue with INFRA: >>>> https://issues.apache.org/jira/browse/INFRA-11773 >>>> >>>> @Robert: I agree, but still we see failing builds over and over again. >>>> At best it is annoying, at worst it "hides" new bugs being introduced. >>>> >>>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> >>>> wrote: >>>> >>>>> That is good to hear that we can so easily solve most of the failing >>>>> builds. We should then iterate over the open test-stability issues to >>>>> see >>>>> whether they are still valid after we've merged PR 1915. >>>>> >>>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> >>>>> wrote: >>>>> >>>>> I'm not sure if the issues is as big as it seems on a first sight. >>>>>> The reason why all the builds of master are red on travis is that the >>>>>> cache >>>>>> of the 5th build is invalid. We have to ask infra to delete the >>>>>> caches and >>>>>> then they'll be green again. >>>>>> >>>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>>>>> >>>>>> Along the lines of what Greg already mentioned, I would like to >>>>>>> re-iterate that Travis is often a problem too: >>>>>>> - long build times and we are reaching the time limit >>>>>>> - unreliable I/O >>>>>>> - unreliable resolving of build dependencies >>>>>>> >>>>>>> @Max: I think you wanted to look into whether we can use Apache's >>>>>>> Jenkins server for our builds instead of Travis. Did you ever get >>>>>>> around at looking into it? If yes: What's your opinion on replacing >>>>>>> Travis with Jenkins? Is it a viable option? Would it improve the >>>>>>> Travis-specific problems? >>>>>>> >>>>>>> On the other hand, the very slow Travis machines also helped >>>>>>> discovering some hard-to-catch race conditions. >>>>>>> >>>>>>> – Ufuk >>>>>>> >>>>>>> >>>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> We have also started running over Travis' 2 hour limit for the >>>>>>>> longest >>>>>>>> >>>>>>> build. >>>>>>> >>>>>>>> Greg >>>>>>>> >>>>>>>> >>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >>>>>>>>> >>>>>>>>> Hi Till, >>>>>>>>> >>>>>>>>> thank you for bringing this up. We really need to fix this. >>>>>>>>> >>>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in >>>>>>>>> the past, but obviously it did not work. There seems to be a >>>>>>>>> mismatch >>>>>>>>> between assigned and actual priorities. >>>>>>>>> >>>>>>>>> As a first step, I would volunteer to gather a list of tests, which >>>>>>>>> have failed in the last weeks and make sure that we have JIRAs for >>>>>>>>> them. >>>>>>>>> >>>>>>>>> As a next step, we should coordinate how to resolve those issues >>>>>>>>> (maybe prioritized by failure frequency) to get master stable >>>>>>>>> again. >>>>>>>>> >>>>>>>>> – Ufuk >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >>>>>>>>>> >>>>>>>>> [hidden email]> >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Flink community, >>>>>>>>>> >>>>>>>>>> I just wanted to raise awareness that in the last 16 days there >>>>>>>>>> was >>>>>>>>>> >>>>>>>>> just a >>>>>>> >>>>>>>> single Travis build of master which passed all tests. This indicates >>>>>>>>>> >>>>>>>>> that >>>>>>> >>>>>>>> we have some serious problems with our test stability or even worse >>>>>>>>>> a >>>>>>>>>> problem with the master itself. Having an unstable master makes it >>>>>>>>>> >>>>>>>>> really >>>>>>> >>>>>>>> hard to assess whether new changes actually broke something or >>>>>>>>>> >>>>>>>>> whether >>>>>> >>>>>>> the >>>>>>> >>>>>>>> failing test was unrelated. >>>>>>>>>> >>>>>>>>>> We have currently 37 open issues labeled with test-stability and >>>>>>>>>> most >>>>>>>>>> >>>>>>>>> of >>>>>>> >>>>>>>> them have a critical priority. Therefore, I would propose that we >>>>>>>>>> try >>>>>>>>>> >>>>>>>>> to >>>>>>> >>>>>>>> tackle them as soon as possible in order to improve our testing >>>>>>>>>> >>>>>>>>> stability. >>>>>>> >>>>>>>> Cheers, >>>>>>>>>> Till >>>>>>>>>> >>>>>>>>> > |
if we disable caching, let it run for 1 build and enable it again, will
that effectively clear the .m2 cache? On 23.05.2016 12:00, Robert Metzger wrote: > We could also try to disable the caching of the .m2 directory (I suspect > that it contains broken jar files). The problem is that it this will make > the builds slower on travis because we need to download more. > > On Mon, May 23, 2016 at 10:18 AM, Chesnay Schepler <[hidden email]> > wrote: > >> If this doesn't work we may want to think about disabling the problematic >> profile temporarily. >> >> >> On 23.05.2016 09:53, Ufuk Celebi wrote: >> >>> Caches have been cleared again (see >>> https://issues.apache.org/jira/browse/INFRA-11773) The first time did >>> not help. This second request was more an act of desparation. :-( >>> Let's see what happens now. >>> >>> On Wed, Apr 27, 2016 at 3:24 PM, Maximilian Michels <[hidden email]> >>> wrote: >>> >>>> +1 for making an effort to tackle test stability problems and >>>> potential involved bugs. >>>> >>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>>> >>>>> @Max: I think you wanted to look into whether we can use Apache's >>>>> Jenkins server for our builds instead of Travis. Did you ever get >>>>> around at looking into it? If yes: What's your opinion on replacing >>>>> Travis with Jenkins? Is it a viable option? Would it improve the >>>>> Travis-specific problems? >>>>> >>>> I've experimented with the ASF Jenkins installation while setting up >>>> our nightly snapshot builds. I've observed that the build servers are >>>> pretty busy. I don't know how busy they are compared to the Travis >>>> servers and whether we could have more stable builds using Jenkins. I >>>> guess we would have to try over a period of time. >>>> >>>> I was hesitant to enable Jenkins for pull requests because I didn't >>>> want to spam the ASF servers with builds. Also, there are some >>>> remaining steps for a good integration like making the Yarn logs >>>> available (not hard to do though). >>>> >>>> What do you think about enabling Jenkins builds for the master and see >>>> how that goes? >>>> >>>> On Wed, Apr 27, 2016 at 2:54 PM, Ufuk Celebi <[hidden email]> wrote: >>>> >>>>> Filed an issue with INFRA: >>>>> https://issues.apache.org/jira/browse/INFRA-11773 >>>>> >>>>> @Robert: I agree, but still we see failing builds over and over again. >>>>> At best it is annoying, at worst it "hides" new bugs being introduced. >>>>> >>>>> On Wed, Apr 27, 2016 at 2:41 PM, Till Rohrmann <[hidden email]> >>>>> wrote: >>>>> >>>>>> That is good to hear that we can so easily solve most of the failing >>>>>> builds. We should then iterate over the open test-stability issues to >>>>>> see >>>>>> whether they are still valid after we've merged PR 1915. >>>>>> >>>>>> On Wed, Apr 27, 2016 at 2:25 PM, Robert Metzger <[hidden email]> >>>>>> wrote: >>>>>> >>>>>> I'm not sure if the issues is as big as it seems on a first sight. >>>>>>> The reason why all the builds of master are red on travis is that the >>>>>>> cache >>>>>>> of the 5th build is invalid. We have to ask infra to delete the >>>>>>> caches and >>>>>>> then they'll be green again. >>>>>>> >>>>>>> On Wed, Apr 27, 2016 at 2:13 PM, Ufuk Celebi <[hidden email]> wrote: >>>>>>> >>>>>>> Along the lines of what Greg already mentioned, I would like to >>>>>>>> re-iterate that Travis is often a problem too: >>>>>>>> - long build times and we are reaching the time limit >>>>>>>> - unreliable I/O >>>>>>>> - unreliable resolving of build dependencies >>>>>>>> >>>>>>>> @Max: I think you wanted to look into whether we can use Apache's >>>>>>>> Jenkins server for our builds instead of Travis. Did you ever get >>>>>>>> around at looking into it? If yes: What's your opinion on replacing >>>>>>>> Travis with Jenkins? Is it a viable option? Would it improve the >>>>>>>> Travis-specific problems? >>>>>>>> >>>>>>>> On the other hand, the very slow Travis machines also helped >>>>>>>> discovering some hard-to-catch race conditions. >>>>>>>> >>>>>>>> – Ufuk >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Apr 27, 2016 at 2:01 PM, Greg Hogan <[hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We have also started running over Travis' 2 hour limit for the >>>>>>>>> longest >>>>>>>>> >>>>>>>> build. >>>>>>>> >>>>>>>>> Greg >>>>>>>>> >>>>>>>>> >>>>>>>>> On Apr 27, 2016, at 7:53 AM, Ufuk Celebi <[hidden email]> wrote: >>>>>>>>>> Hi Till, >>>>>>>>>> >>>>>>>>>> thank you for bringing this up. We really need to fix this. >>>>>>>>>> >>>>>>>>>> Filing JIRAs with critical priority was how we tried to solve it in >>>>>>>>>> the past, but obviously it did not work. There seems to be a >>>>>>>>>> mismatch >>>>>>>>>> between assigned and actual priorities. >>>>>>>>>> >>>>>>>>>> As a first step, I would volunteer to gather a list of tests, which >>>>>>>>>> have failed in the last weeks and make sure that we have JIRAs for >>>>>>>>>> them. >>>>>>>>>> >>>>>>>>>> As a next step, we should coordinate how to resolve those issues >>>>>>>>>> (maybe prioritized by failure frequency) to get master stable >>>>>>>>>> again. >>>>>>>>>> >>>>>>>>>> – Ufuk >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Apr 27, 2016 at 12:12 PM, Till Rohrmann < >>>>>>>>>> [hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Flink community, >>>>>>>>>>> I just wanted to raise awareness that in the last 16 days there >>>>>>>>>>> was >>>>>>>>>>> >>>>>>>>>> just a >>>>>>>>> single Travis build of master which passed all tests. This indicates >>>>>>>>>> that >>>>>>>>> we have some serious problems with our test stability or even worse >>>>>>>>>>> a >>>>>>>>>>> problem with the master itself. Having an unstable master makes it >>>>>>>>>>> >>>>>>>>>> really >>>>>>>>> hard to assess whether new changes actually broke something or >>>>>>>>>> whether >>>>>>>> the >>>>>>>> >>>>>>>>> failing test was unrelated. >>>>>>>>>>> We have currently 37 open issues labeled with test-stability and >>>>>>>>>>> most >>>>>>>>>>> >>>>>>>>>> of >>>>>>>>> them have a critical priority. Therefore, I would propose that we >>>>>>>>>>> try >>>>>>>>>>> >>>>>>>>>> to >>>>>>>>> tackle them as soon as possible in order to improve our testing >>>>>>>>>> stability. >>>>>>>>> Cheers, >>>>>>>>>>> Till >>>>>>>>>>> |
Free forum by Nabble | Edit this page |