Hi all,
as a follow up from our discussion on reducing the build time [1], I would like to propose migrating our build infrastructure to Azure Pipelines (away from Travis). I believe that we have reached the limits of what Travis can provide the Flink community, and I don't want the build system to limit or influence the project's growth. *Benefits:* 1. The free Travis account are limited to 5 parallel builds, with a timeout of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts *for free for open source projects. 2. Azure Pipelines allows us to *add custom build machines* to the pool of 10 free parallel builders. This will allow the Flink community to scale the available build capacity as the project grows. We are dependent on donations from supporting companies, but I believe that it is easier for companies to donate machines than money. Alibaba is willing to provide 10 machines, with 32 cores each to the Flink project for this purpose. In addition, Xiyuan, who's working on adding ARM support for Flink provided me with 2 ARM machines (16 cores each). I want to use the custom, more efficient build machines for building Flink's pull requests and master-pushes. 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to transfer intermediate build artifacts between pipeline stages. This will allow us to make the build more reliable (we are currently abusing the caching mechanism in Travis for this). It also has some basic analytics on test results / flaky tests etc. *Known problems:* - Initially, we might see different build instabilities than before - There's a higher maintenance overhead for the custom build machines (keeping them up to date etc.) - We can not use the build status integration of AZP, because they require write access to the repository's source. The foundation does not allow that [2]. I propose to extend flinkbot / the flink-ci repository. *Current Status:* - I'm able [3] to execute [4] the current custom build scripts on Azure Pipelines: This means that we will have one compile stage, and N testing jobs in the 2nd stage. Currently, we have N=10 testing jobs. The time from the start of a build till all tests have completed is 1h22 minutes. - I'm working on getting the nightly end to end tests to run on the new infrastructure. - I'm working on getting the build to work on our pool of custom machines as well - I'm working on setting up the full matrix of builds (different scala, hadoop etc. versions) for the nightlies *Next Steps:* - I propose to document the entire build system in the Flink Wiki - Once Azure can cover the same pull request tests as Travis, I would set it up to run in parallel (including Flinkbot posting links to Azure). I hope that this phase lasts for 1-2 weeks only, so that we do not have to maintain things concurrently. I will monitor the build stability closely, but would expect some support with debugging potential issues from the contributors. - Once there are no problems with the new setup, we remove the Travis setup. - Independently, I will work on triggering builds from master / release - branch pushes, as well as cron builds from the master branch ... all this will be described in the Wiki. *Timeline:*- Once I have the feeling that people are supportive of the idea, I will start documenting in the Wiki. The first pull requests should show up after a few more days. I will do a one month parental leave starting some time later in December, which will probably delay things a bit. I hope to have everything finished by end of January. I'm happy to hear your thoughts on this work. If nobody objects, I will start documenting the system and prepare everything for the migration. Best, Robert [1] https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E [2] https://issues.apache.org/jira/browse/INFRA-17030 [3] https://github.com/rmetzger/flink/tree/azure_playground [4] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary |
Thanks Robert for the updates! And thanks a lot for all the efforts to
investigate, experiment and tune Azure Pipelines for Flink building. Big +1 for it. It would be great that the community building can be extended with custom machines so that the tests would not be queued for long with daily growing PRs. The increased timeout would be also very helpful. The 50min timeout for free travis accounts is a pain currently, especially when we'd like to run e2e tests in our own travis. And I had to manually split the jobs to make it possible to pass. Thanks, Zhu Zhu Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > Hi all, > > as a follow up from our discussion on reducing the build time [1], I would > like to propose migrating our build infrastructure to Azure Pipelines (away > from Travis). > > I believe that we have reached the limits of what Travis can provide the > Flink community, and I don't want the build system to limit or influence > the project's growth. > > *Benefits:* > 1. The free Travis account are limited to 5 parallel builds, with a timeout > of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts > *for > free for open source projects. > 2. Azure Pipelines allows us to *add custom build machines* to the pool of > 10 free parallel builders. > This will allow the Flink community to scale the available build capacity > as the project grows. We are dependent on donations from supporting > companies, but I believe that it is easier for companies to donate machines > than money. > Alibaba is willing to provide 10 machines, with 32 cores each to the Flink > project for this purpose. > In addition, Xiyuan, who's working on adding ARM support for Flink provided > me with 2 ARM machines (16 cores each). > I want to use the custom, more efficient build machines for building > Flink's pull requests and master-pushes. > 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to > transfer intermediate build artifacts between pipeline stages. This will > allow us to make the build more reliable (we are currently abusing the > caching mechanism in Travis for this). > It also has some basic analytics on test results / flaky tests etc. > > *Known problems:* > - Initially, we might see different build instabilities than before > - There's a higher maintenance overhead for the custom build machines > (keeping them up to date etc.) > - We can not use the build status integration of AZP, because they require > write access to the repository's source. The foundation does not allow that > [2]. > I propose to extend flinkbot / the flink-ci repository. > > *Current Status:* > - I'm able [3] to execute [4] the current custom build scripts on Azure > Pipelines: This means that we will have one compile stage, and N testing > jobs in the 2nd stage. Currently, we have N=10 testing jobs. > The time from the start of a build till all tests have completed is 1h22 > minutes. > - I'm working on getting the nightly end to end tests to run on the new > infrastructure. > - I'm working on getting the build to work on our pool of custom machines > as well > - I'm working on setting up the full matrix of builds (different scala, > hadoop etc. versions) for the nightlies > > *Next Steps:* > - I propose to document the entire build system in the Flink Wiki > - Once Azure can cover the same pull request tests as Travis, I would set > it up to run in parallel (including Flinkbot posting links to Azure). I > hope that this phase lasts for 1-2 weeks only, so that we do not have to > maintain things concurrently. I will monitor the build stability closely, > but would expect some support with debugging potential issues from the > contributors. > - Once there are no problems with the new setup, we remove the Travis > setup. > - Independently, I will work on triggering builds from master / release - > branch pushes, as well as cron builds from the master branch ... all this > will be described in the Wiki. > > > *Timeline:*- Once I have the feeling that people are supportive of the > idea, I will start documenting in the Wiki. The first pull requests should > show up after a few more days. > I will do a one month parental leave starting some time later in December, > which will probably delay things a bit. I hope to have everything finished > by end of January. > > I'm happy to hear your thoughts on this work. > If nobody objects, I will start documenting the system and prepare > everything for the migration. > > Best, > Robert > > > > [1] > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > [2] https://issues.apache.org/jira/browse/INFRA-17030 > [3] https://github.com/rmetzger/flink/tree/azure_playground > [4] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > |
Thanks Robert for driving this. There is another big pain point of current
travis, which is its cache mechanism will fail from time to time. Almost around 50% of the build fails are caused by cache problem. I opened this issue to travis but got no response yet. So big +1 from my side. Just one comment, it's close to 1.10 feature freeze and we will spend some time to make tests stable before release. I wish this replacement can happen after 1.10 release, otherwise it will be a unstable factor during release testing. Best, Kurt On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > Thanks Robert for the updates! And thanks a lot for all the efforts to > investigate, experiment and tune Azure Pipelines for Flink building. > Big +1 for it. > > It would be great that the community building can be extended with custom > machines so that the tests would not be queued for long with daily growing > PRs. > > The increased timeout would be also very helpful. > The 50min timeout for free travis accounts is a pain currently, especially > when we'd like to run e2e tests in our own travis. And I had to manually > split the jobs to make it possible to pass. > > Thanks, > Zhu Zhu > > Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > > > Hi all, > > > > as a follow up from our discussion on reducing the build time [1], I > would > > like to propose migrating our build infrastructure to Azure Pipelines > (away > > from Travis). > > > > I believe that we have reached the limits of what Travis can provide the > > Flink community, and I don't want the build system to limit or influence > > the project's growth. > > > > *Benefits:* > > 1. The free Travis account are limited to 5 parallel builds, with a > timeout > > of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts > > *for > > free for open source projects. > > 2. Azure Pipelines allows us to *add custom build machines* to the pool > of > > 10 free parallel builders. > > This will allow the Flink community to scale the available build capacity > > as the project grows. We are dependent on donations from supporting > > companies, but I believe that it is easier for companies to donate > machines > > than money. > > Alibaba is willing to provide 10 machines, with 32 cores each to the > Flink > > project for this purpose. > > In addition, Xiyuan, who's working on adding ARM support for Flink > provided > > me with 2 ARM machines (16 cores each). > > I want to use the custom, more efficient build machines for building > > Flink's pull requests and master-pushes. > > 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to > > transfer intermediate build artifacts between pipeline stages. This will > > allow us to make the build more reliable (we are currently abusing the > > caching mechanism in Travis for this). > > It also has some basic analytics on test results / flaky tests etc. > > > > *Known problems:* > > - Initially, we might see different build instabilities than before > > - There's a higher maintenance overhead for the custom build machines > > (keeping them up to date etc.) > > - We can not use the build status integration of AZP, because they > require > > write access to the repository's source. The foundation does not allow > that > > [2]. > > I propose to extend flinkbot / the flink-ci repository. > > > > *Current Status:* > > - I'm able [3] to execute [4] the current custom build scripts on Azure > > Pipelines: This means that we will have one compile stage, and N testing > > jobs in the 2nd stage. Currently, we have N=10 testing jobs. > > The time from the start of a build till all tests have completed is 1h22 > > minutes. > > - I'm working on getting the nightly end to end tests to run on the new > > infrastructure. > > - I'm working on getting the build to work on our pool of custom machines > > as well > > - I'm working on setting up the full matrix of builds (different scala, > > hadoop etc. versions) for the nightlies > > > > *Next Steps:* > > - I propose to document the entire build system in the Flink Wiki > > - Once Azure can cover the same pull request tests as Travis, I would set > > it up to run in parallel (including Flinkbot posting links to Azure). I > > hope that this phase lasts for 1-2 weeks only, so that we do not have to > > maintain things concurrently. I will monitor the build stability closely, > > but would expect some support with debugging potential issues from the > > contributors. > > - Once there are no problems with the new setup, we remove the Travis > > setup. > > - Independently, I will work on triggering builds from master / release - > > branch pushes, as well as cron builds from the master branch ... all this > > will be described in the Wiki. > > > > > > *Timeline:*- Once I have the feeling that people are supportive of the > > idea, I will start documenting in the Wiki. The first pull requests > should > > show up after a few more days. > > I will do a one month parental leave starting some time later in > December, > > which will probably delay things a bit. I hope to have everything > finished > > by end of January. > > > > I'm happy to hear your thoughts on this work. > > If nobody objects, I will start documenting the system and prepare > > everything for the migration. > > > > Best, > > Robert > > > > > > > > [1] > > > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > [2] https://issues.apache.org/jira/browse/INFRA-17030 > > [3] https://github.com/rmetzger/flink/tree/azure_playground > > [4] > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > |
From what I've seen so far Azure will provide us a better experience,
so I'd say +1 for the transition as a whole. I'd delay merge at least until the feature branch is cut. Given the parental leave it may even make sense to only start merging in January afterwards, to reduce the total time taken for the transition. Reviews could maybe be made earlier, but I'm wondering whether anyone would even have the time at the moment to do so. On 04/12/2019 12:35, Kurt Young wrote: > Thanks Robert for driving this. There is another big pain point of current > travis, > which is its cache mechanism will fail from time to time. Almost around 50% > of > the build fails are caused by cache problem. I opened this issue to travis > but > got no response yet. So big +1 from my side. > > Just one comment, it's close to 1.10 feature freeze and we will spend some > time > to make tests stable before release. I wish this replacement can happen > after > 1.10 release, otherwise it will be a unstable factor during release > testing. > > Best, > Kurt > > > On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > >> Thanks Robert for the updates! And thanks a lot for all the efforts to >> investigate, experiment and tune Azure Pipelines for Flink building. >> Big +1 for it. >> >> It would be great that the community building can be extended with custom >> machines so that the tests would not be queued for long with daily growing >> PRs. >> >> The increased timeout would be also very helpful. >> The 50min timeout for free travis accounts is a pain currently, especially >> when we'd like to run e2e tests in our own travis. And I had to manually >> split the jobs to make it possible to pass. >> >> Thanks, >> Zhu Zhu >> >> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: >> >>> Hi all, >>> >>> as a follow up from our discussion on reducing the build time [1], I >> would >>> like to propose migrating our build infrastructure to Azure Pipelines >> (away >>> from Travis). >>> >>> I believe that we have reached the limits of what Travis can provide the >>> Flink community, and I don't want the build system to limit or influence >>> the project's growth. >>> >>> *Benefits:* >>> 1. The free Travis account are limited to 5 parallel builds, with a >> timeout >>> of 50 minutes. Azure offers *10 parallel builds with 300 minute timeouts >>> *for >>> free for open source projects. >>> 2. Azure Pipelines allows us to *add custom build machines* to the pool >> of >>> 10 free parallel builders. >>> This will allow the Flink community to scale the available build capacity >>> as the project grows. We are dependent on donations from supporting >>> companies, but I believe that it is easier for companies to donate >> machines >>> than money. >>> Alibaba is willing to provide 10 machines, with 32 cores each to the >> Flink >>> project for this purpose. >>> In addition, Xiyuan, who's working on adding ARM support for Flink >> provided >>> me with 2 ARM machines (16 cores each). >>> I want to use the custom, more efficient build machines for building >>> Flink's pull requests and master-pushes. >>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for example to >>> transfer intermediate build artifacts between pipeline stages. This will >>> allow us to make the build more reliable (we are currently abusing the >>> caching mechanism in Travis for this). >>> It also has some basic analytics on test results / flaky tests etc. >>> >>> *Known problems:* >>> - Initially, we might see different build instabilities than before >>> - There's a higher maintenance overhead for the custom build machines >>> (keeping them up to date etc.) >>> - We can not use the build status integration of AZP, because they >> require >>> write access to the repository's source. The foundation does not allow >> that >>> [2]. >>> I propose to extend flinkbot / the flink-ci repository. >>> >>> *Current Status:* >>> - I'm able [3] to execute [4] the current custom build scripts on Azure >>> Pipelines: This means that we will have one compile stage, and N testing >>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. >>> The time from the start of a build till all tests have completed is 1h22 >>> minutes. >>> - I'm working on getting the nightly end to end tests to run on the new >>> infrastructure. >>> - I'm working on getting the build to work on our pool of custom machines >>> as well >>> - I'm working on setting up the full matrix of builds (different scala, >>> hadoop etc. versions) for the nightlies >>> >>> *Next Steps:* >>> - I propose to document the entire build system in the Flink Wiki >>> - Once Azure can cover the same pull request tests as Travis, I would set >>> it up to run in parallel (including Flinkbot posting links to Azure). I >>> hope that this phase lasts for 1-2 weeks only, so that we do not have to >>> maintain things concurrently. I will monitor the build stability closely, >>> but would expect some support with debugging potential issues from the >>> contributors. >>> - Once there are no problems with the new setup, we remove the Travis >>> setup. >>> - Independently, I will work on triggering builds from master / release - >>> branch pushes, as well as cron builds from the master branch ... all this >>> will be described in the Wiki. >>> >>> >>> *Timeline:*- Once I have the feeling that people are supportive of the >>> idea, I will start documenting in the Wiki. The first pull requests >> should >>> show up after a few more days. >>> I will do a one month parental leave starting some time later in >> December, >>> which will probably delay things a bit. I hope to have everything >> finished >>> by end of January. >>> >>> I'm happy to hear your thoughts on this work. >>> If nobody objects, I will start documenting the system and prepare >>> everything for the migration. >>> >>> Best, >>> Robert >>> >>> >>> >>> [1] >>> >>> >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >>> [2] https://issues.apache.org/jira/browse/INFRA-17030 >>> [3] https://github.com/rmetzger/flink/tree/azure_playground >>> [4] >> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary |
@robert Can you expand how the azure setup interacts with CiBot? Do we
have to continue mirroring builds into flink-ci? How will the cronjob configuration work? We should have a general idea on how to implement this before proceeding. Additionally, moving /all /jobs into flink-ci requires setting up the environment variables we have; can we set these up via files or will we have to give all committers permissions for flink-ci/flink? On 04/12/2019 12:55, Chesnay Schepler wrote: > From what I've seen so far Azure will provide us a better experience, > so I'd say +1 for the transition as a whole. > > I'd delay merge at least until the feature branch is cut. > Given the parental leave it may even make sense to only start merging > in January afterwards, to reduce the total time taken for the transition. > > Reviews could maybe be made earlier, but I'm wondering whether anyone > would even have the time at the moment to do so. > > On 04/12/2019 12:35, Kurt Young wrote: >> Thanks Robert for driving this. There is another big pain point of >> current >> travis, >> which is its cache mechanism will fail from time to time. Almost >> around 50% >> of >> the build fails are caused by cache problem. I opened this issue to >> travis >> but >> got no response yet. So big +1 from my side. >> >> Just one comment, it's close to 1.10 feature freeze and we will spend >> some >> time >> to make tests stable before release. I wish this replacement can happen >> after >> 1.10 release, otherwise it will be a unstable factor during release >> testing. >> >> Best, >> Kurt >> >> >> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: >> >>> Thanks Robert for the updates! And thanks a lot for all the efforts to >>> investigate, experiment and tune Azure Pipelines for Flink building. >>> Big +1 for it. >>> >>> It would be great that the community building can be extended with >>> custom >>> machines so that the tests would not be queued for long with daily >>> growing >>> PRs. >>> >>> The increased timeout would be also very helpful. >>> The 50min timeout for free travis accounts is a pain currently, >>> especially >>> when we'd like to run e2e tests in our own travis. And I had to >>> manually >>> split the jobs to make it possible to pass. >>> >>> Thanks, >>> Zhu Zhu >>> >>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: >>> >>>> Hi all, >>>> >>>> as a follow up from our discussion on reducing the build time [1], I >>> would >>>> like to propose migrating our build infrastructure to Azure Pipelines >>> (away >>>> from Travis). >>>> >>>> I believe that we have reached the limits of what Travis can >>>> provide the >>>> Flink community, and I don't want the build system to limit or >>>> influence >>>> the project's growth. >>>> >>>> *Benefits:* >>>> 1. The free Travis account are limited to 5 parallel builds, with a >>> timeout >>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute >>>> timeouts >>>> *for >>>> free for open source projects. >>>> 2. Azure Pipelines allows us to *add custom build machines* to the >>>> pool >>> of >>>> 10 free parallel builders. >>>> This will allow the Flink community to scale the available build >>>> capacity >>>> as the project grows. We are dependent on donations from supporting >>>> companies, but I believe that it is easier for companies to donate >>> machines >>>> than money. >>>> Alibaba is willing to provide 10 machines, with 32 cores each to the >>> Flink >>>> project for this purpose. >>>> In addition, Xiyuan, who's working on adding ARM support for Flink >>> provided >>>> me with 2 ARM machines (16 cores each). >>>> I want to use the custom, more efficient build machines for building >>>> Flink's pull requests and master-pushes. >>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for >>>> example to >>>> transfer intermediate build artifacts between pipeline stages. This >>>> will >>>> allow us to make the build more reliable (we are currently abusing the >>>> caching mechanism in Travis for this). >>>> It also has some basic analytics on test results / flaky tests etc. >>>> >>>> *Known problems:* >>>> - Initially, we might see different build instabilities than before >>>> - There's a higher maintenance overhead for the custom build machines >>>> (keeping them up to date etc.) >>>> - We can not use the build status integration of AZP, because they >>> require >>>> write access to the repository's source. The foundation does not allow >>> that >>>> [2]. >>>> I propose to extend flinkbot / the flink-ci repository. >>>> >>>> *Current Status:* >>>> - I'm able [3] to execute [4] the current custom build scripts on >>>> Azure >>>> Pipelines: This means that we will have one compile stage, and N >>>> testing >>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. >>>> The time from the start of a build till all tests have completed is >>>> 1h22 >>>> minutes. >>>> - I'm working on getting the nightly end to end tests to run on the >>>> new >>>> infrastructure. >>>> - I'm working on getting the build to work on our pool of custom >>>> machines >>>> as well >>>> - I'm working on setting up the full matrix of builds (different >>>> scala, >>>> hadoop etc. versions) for the nightlies >>>> >>>> *Next Steps:* >>>> - I propose to document the entire build system in the Flink Wiki >>>> - Once Azure can cover the same pull request tests as Travis, I >>>> would set >>>> it up to run in parallel (including Flinkbot posting links to >>>> Azure). I >>>> hope that this phase lasts for 1-2 weeks only, so that we do not >>>> have to >>>> maintain things concurrently. I will monitor the build stability >>>> closely, >>>> but would expect some support with debugging potential issues from the >>>> contributors. >>>> - Once there are no problems with the new setup, we remove the Travis >>>> setup. >>>> - Independently, I will work on triggering builds from master / >>>> release - >>>> branch pushes, as well as cron builds from the master branch ... >>>> all this >>>> will be described in the Wiki. >>>> >>>> >>>> *Timeline:*- Once I have the feeling that people are supportive of the >>>> idea, I will start documenting in the Wiki. The first pull requests >>> should >>>> show up after a few more days. >>>> I will do a one month parental leave starting some time later in >>> December, >>>> which will probably delay things a bit. I hope to have everything >>> finished >>>> by end of January. >>>> >>>> I'm happy to hear your thoughts on this work. >>>> If nobody objects, I will start documenting the system and prepare >>>> everything for the migration. >>>> >>>> Best, >>>> Robert >>>> >>>> >>>> >>>> [1] >>>> >>>> >>> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >>> >>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 >>>> [3] https://github.com/rmetzger/flink/tree/azure_playground >>>> [4] >>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > |
+1 for moving to Azure pipelines as it promises better scalability and
tooling. Looking forward to having faster builds and hence shorter feedback cycles :-) Cheers, Till On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> wrote: > @robert Can you expand how the azure setup interacts with CiBot? Do we > have to continue mirroring builds into flink-ci? How will the cronjob > configuration work? We should have a general idea on how to implement > this before proceeding. > Additionally, moving /all /jobs into flink-ci requires setting up the > environment variables we have; can we set these up via files or will we > have to give all committers permissions for flink-ci/flink? > > On 04/12/2019 12:55, Chesnay Schepler wrote: > > From what I've seen so far Azure will provide us a better experience, > > so I'd say +1 for the transition as a whole. > > > > I'd delay merge at least until the feature branch is cut. > > Given the parental leave it may even make sense to only start merging > > in January afterwards, to reduce the total time taken for the transition. > > > > Reviews could maybe be made earlier, but I'm wondering whether anyone > > would even have the time at the moment to do so. > > > > On 04/12/2019 12:35, Kurt Young wrote: > >> Thanks Robert for driving this. There is another big pain point of > >> current > >> travis, > >> which is its cache mechanism will fail from time to time. Almost > >> around 50% > >> of > >> the build fails are caused by cache problem. I opened this issue to > >> travis > >> but > >> got no response yet. So big +1 from my side. > >> > >> Just one comment, it's close to 1.10 feature freeze and we will spend > >> some > >> time > >> to make tests stable before release. I wish this replacement can happen > >> after > >> 1.10 release, otherwise it will be a unstable factor during release > >> testing. > >> > >> Best, > >> Kurt > >> > >> > >> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > >> > >>> Thanks Robert for the updates! And thanks a lot for all the efforts to > >>> investigate, experiment and tune Azure Pipelines for Flink building. > >>> Big +1 for it. > >>> > >>> It would be great that the community building can be extended with > >>> custom > >>> machines so that the tests would not be queued for long with daily > >>> growing > >>> PRs. > >>> > >>> The increased timeout would be also very helpful. > >>> The 50min timeout for free travis accounts is a pain currently, > >>> especially > >>> when we'd like to run e2e tests in our own travis. And I had to > >>> manually > >>> split the jobs to make it possible to pass. > >>> > >>> Thanks, > >>> Zhu Zhu > >>> > >>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > >>> > >>>> Hi all, > >>>> > >>>> as a follow up from our discussion on reducing the build time [1], I > >>> would > >>>> like to propose migrating our build infrastructure to Azure Pipelines > >>> (away > >>>> from Travis). > >>>> > >>>> I believe that we have reached the limits of what Travis can > >>>> provide the > >>>> Flink community, and I don't want the build system to limit or > >>>> influence > >>>> the project's growth. > >>>> > >>>> *Benefits:* > >>>> 1. The free Travis account are limited to 5 parallel builds, with a > >>> timeout > >>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute > >>>> timeouts > >>>> *for > >>>> free for open source projects. > >>>> 2. Azure Pipelines allows us to *add custom build machines* to the > >>>> pool > >>> of > >>>> 10 free parallel builders. > >>>> This will allow the Flink community to scale the available build > >>>> capacity > >>>> as the project grows. We are dependent on donations from supporting > >>>> companies, but I believe that it is easier for companies to donate > >>> machines > >>>> than money. > >>>> Alibaba is willing to provide 10 machines, with 32 cores each to the > >>> Flink > >>>> project for this purpose. > >>>> In addition, Xiyuan, who's working on adding ARM support for Flink > >>> provided > >>>> me with 2 ARM machines (16 cores each). > >>>> I want to use the custom, more efficient build machines for building > >>>> Flink's pull requests and master-pushes. > >>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for > >>>> example to > >>>> transfer intermediate build artifacts between pipeline stages. This > >>>> will > >>>> allow us to make the build more reliable (we are currently abusing the > >>>> caching mechanism in Travis for this). > >>>> It also has some basic analytics on test results / flaky tests etc. > >>>> > >>>> *Known problems:* > >>>> - Initially, we might see different build instabilities than before > >>>> - There's a higher maintenance overhead for the custom build machines > >>>> (keeping them up to date etc.) > >>>> - We can not use the build status integration of AZP, because they > >>> require > >>>> write access to the repository's source. The foundation does not allow > >>> that > >>>> [2]. > >>>> I propose to extend flinkbot / the flink-ci repository. > >>>> > >>>> *Current Status:* > >>>> - I'm able [3] to execute [4] the current custom build scripts on > >>>> Azure > >>>> Pipelines: This means that we will have one compile stage, and N > >>>> testing > >>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. > >>>> The time from the start of a build till all tests have completed is > >>>> 1h22 > >>>> minutes. > >>>> - I'm working on getting the nightly end to end tests to run on the > >>>> new > >>>> infrastructure. > >>>> - I'm working on getting the build to work on our pool of custom > >>>> machines > >>>> as well > >>>> - I'm working on setting up the full matrix of builds (different > >>>> scala, > >>>> hadoop etc. versions) for the nightlies > >>>> > >>>> *Next Steps:* > >>>> - I propose to document the entire build system in the Flink Wiki > >>>> - Once Azure can cover the same pull request tests as Travis, I > >>>> would set > >>>> it up to run in parallel (including Flinkbot posting links to > >>>> Azure). I > >>>> hope that this phase lasts for 1-2 weeks only, so that we do not > >>>> have to > >>>> maintain things concurrently. I will monitor the build stability > >>>> closely, > >>>> but would expect some support with debugging potential issues from the > >>>> contributors. > >>>> - Once there are no problems with the new setup, we remove the Travis > >>>> setup. > >>>> - Independently, I will work on triggering builds from master / > >>>> release - > >>>> branch pushes, as well as cron builds from the master branch ... > >>>> all this > >>>> will be described in the Wiki. > >>>> > >>>> > >>>> *Timeline:*- Once I have the feeling that people are supportive of the > >>>> idea, I will start documenting in the Wiki. The first pull requests > >>> should > >>>> show up after a few more days. > >>>> I will do a one month parental leave starting some time later in > >>> December, > >>>> which will probably delay things a bit. I hope to have everything > >>> finished > >>>> by end of January. > >>>> > >>>> I'm happy to hear your thoughts on this work. > >>>> If nobody objects, I will start documenting the system and prepare > >>>> everything for the migration. > >>>> > >>>> Best, > >>>> Robert > >>>> > >>>> > >>>> > >>>> [1] > >>>> > >>>> > >>> > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > >>> > >>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > >>>> [3] https://github.com/rmetzger/flink/tree/azure_playground > >>>> [4] > >>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > > > > > > |
+1
Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: > +1 for moving to Azure pipelines as it promises better scalability and > tooling. Looking forward to having faster builds and hence shorter feedback > cycles :-) > > Cheers, > Till > > On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> > wrote: > > > @robert Can you expand how the azure setup interacts with CiBot? Do we > > have to continue mirroring builds into flink-ci? How will the cronjob > > configuration work? We should have a general idea on how to implement > > this before proceeding. > > Additionally, moving /all /jobs into flink-ci requires setting up the > > environment variables we have; can we set these up via files or will we > > have to give all committers permissions for flink-ci/flink? > > > > On 04/12/2019 12:55, Chesnay Schepler wrote: > > > From what I've seen so far Azure will provide us a better experience, > > > so I'd say +1 for the transition as a whole. > > > > > > I'd delay merge at least until the feature branch is cut. > > > Given the parental leave it may even make sense to only start merging > > > in January afterwards, to reduce the total time taken for the > transition. > > > > > > Reviews could maybe be made earlier, but I'm wondering whether anyone > > > would even have the time at the moment to do so. > > > > > > On 04/12/2019 12:35, Kurt Young wrote: > > >> Thanks Robert for driving this. There is another big pain point of > > >> current > > >> travis, > > >> which is its cache mechanism will fail from time to time. Almost > > >> around 50% > > >> of > > >> the build fails are caused by cache problem. I opened this issue to > > >> travis > > >> but > > >> got no response yet. So big +1 from my side. > > >> > > >> Just one comment, it's close to 1.10 feature freeze and we will spend > > >> some > > >> time > > >> to make tests stable before release. I wish this replacement can > happen > > >> after > > >> 1.10 release, otherwise it will be a unstable factor during release > > >> testing. > > >> > > >> Best, > > >> Kurt > > >> > > >> > > >> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > > >> > > >>> Thanks Robert for the updates! And thanks a lot for all the efforts > to > > >>> investigate, experiment and tune Azure Pipelines for Flink building. > > >>> Big +1 for it. > > >>> > > >>> It would be great that the community building can be extended with > > >>> custom > > >>> machines so that the tests would not be queued for long with daily > > >>> growing > > >>> PRs. > > >>> > > >>> The increased timeout would be also very helpful. > > >>> The 50min timeout for free travis accounts is a pain currently, > > >>> especially > > >>> when we'd like to run e2e tests in our own travis. And I had to > > >>> manually > > >>> split the jobs to make it possible to pass. > > >>> > > >>> Thanks, > > >>> Zhu Zhu > > >>> > > >>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > > >>> > > >>>> Hi all, > > >>>> > > >>>> as a follow up from our discussion on reducing the build time [1], I > > >>> would > > >>>> like to propose migrating our build infrastructure to Azure > Pipelines > > >>> (away > > >>>> from Travis). > > >>>> > > >>>> I believe that we have reached the limits of what Travis can > > >>>> provide the > > >>>> Flink community, and I don't want the build system to limit or > > >>>> influence > > >>>> the project's growth. > > >>>> > > >>>> *Benefits:* > > >>>> 1. The free Travis account are limited to 5 parallel builds, with a > > >>> timeout > > >>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute > > >>>> timeouts > > >>>> *for > > >>>> free for open source projects. > > >>>> 2. Azure Pipelines allows us to *add custom build machines* to the > > >>>> pool > > >>> of > > >>>> 10 free parallel builders. > > >>>> This will allow the Flink community to scale the available build > > >>>> capacity > > >>>> as the project grows. We are dependent on donations from supporting > > >>>> companies, but I believe that it is easier for companies to donate > > >>> machines > > >>>> than money. > > >>>> Alibaba is willing to provide 10 machines, with 32 cores each to the > > >>> Flink > > >>>> project for this purpose. > > >>>> In addition, Xiyuan, who's working on adding ARM support for Flink > > >>> provided > > >>>> me with 2 ARM machines (16 cores each). > > >>>> I want to use the custom, more efficient build machines for building > > >>>> Flink's pull requests and master-pushes. > > >>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for > > >>>> example to > > >>>> transfer intermediate build artifacts between pipeline stages. This > > >>>> will > > >>>> allow us to make the build more reliable (we are currently abusing > the > > >>>> caching mechanism in Travis for this). > > >>>> It also has some basic analytics on test results / flaky tests etc. > > >>>> > > >>>> *Known problems:* > > >>>> - Initially, we might see different build instabilities than before > > >>>> - There's a higher maintenance overhead for the custom build > machines > > >>>> (keeping them up to date etc.) > > >>>> - We can not use the build status integration of AZP, because they > > >>> require > > >>>> write access to the repository's source. The foundation does not > allow > > >>> that > > >>>> [2]. > > >>>> I propose to extend flinkbot / the flink-ci repository. > > >>>> > > >>>> *Current Status:* > > >>>> - I'm able [3] to execute [4] the current custom build scripts on > > >>>> Azure > > >>>> Pipelines: This means that we will have one compile stage, and N > > >>>> testing > > >>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. > > >>>> The time from the start of a build till all tests have completed is > > >>>> 1h22 > > >>>> minutes. > > >>>> - I'm working on getting the nightly end to end tests to run on the > > >>>> new > > >>>> infrastructure. > > >>>> - I'm working on getting the build to work on our pool of custom > > >>>> machines > > >>>> as well > > >>>> - I'm working on setting up the full matrix of builds (different > > >>>> scala, > > >>>> hadoop etc. versions) for the nightlies > > >>>> > > >>>> *Next Steps:* > > >>>> - I propose to document the entire build system in the Flink Wiki > > >>>> - Once Azure can cover the same pull request tests as Travis, I > > >>>> would set > > >>>> it up to run in parallel (including Flinkbot posting links to > > >>>> Azure). I > > >>>> hope that this phase lasts for 1-2 weeks only, so that we do not > > >>>> have to > > >>>> maintain things concurrently. I will monitor the build stability > > >>>> closely, > > >>>> but would expect some support with debugging potential issues from > the > > >>>> contributors. > > >>>> - Once there are no problems with the new setup, we remove the > Travis > > >>>> setup. > > >>>> - Independently, I will work on triggering builds from master / > > >>>> release - > > >>>> branch pushes, as well as cron builds from the master branch ... > > >>>> all this > > >>>> will be described in the Wiki. > > >>>> > > >>>> > > >>>> *Timeline:*- Once I have the feeling that people are supportive of > the > > >>>> idea, I will start documenting in the Wiki. The first pull requests > > >>> should > > >>>> show up after a few more days. > > >>>> I will do a one month parental leave starting some time later in > > >>> December, > > >>>> which will probably delay things a bit. I hope to have everything > > >>> finished > > >>>> by end of January. > > >>>> > > >>>> I'm happy to hear your thoughts on this work. > > >>>> If nobody objects, I will start documenting the system and prepare > > >>>> everything for the migration. > > >>>> > > >>>> Best, > > >>>> Robert > > >>>> > > >>>> > > >>>> > > >>>> [1] > > >>>> > > >>>> > > >>> > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > >>> > > >>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > > >>>> [3] https://github.com/rmetzger/flink/tree/azure_playground > > >>>> [4] > > >>> > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > > > > > > > > > > > > -- Best Regards Jeff Zhang |
+1 for Azure pipeline because it promises better performance.
However, I have 2 concerns: 1) Travis provides personal free service for testing personal branches. Usually, contributors use this feature to test PoC or run CRON jobs for pull requests. Using local machine will cost a lot of time. Does AZP provides the same free service? 2) Currently, we deployed a webhook [1] to receive Travis CI build notifications [2] and send to [hidden email] mailing list. We need to figure out a way how to send Azure build results to the mailing list. And this [3] might be the way to go. [hidden email] mailing list Best, Jark [1]: https://github.com/wuchong/flink-notification-bot [2]: https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications [3]: https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: > +1 > > Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: > > > +1 for moving to Azure pipelines as it promises better scalability and > > tooling. Looking forward to having faster builds and hence shorter > feedback > > cycles :-) > > > > Cheers, > > Till > > > > On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> > > wrote: > > > > > @robert Can you expand how the azure setup interacts with CiBot? Do we > > > have to continue mirroring builds into flink-ci? How will the cronjob > > > configuration work? We should have a general idea on how to implement > > > this before proceeding. > > > Additionally, moving /all /jobs into flink-ci requires setting up the > > > environment variables we have; can we set these up via files or will we > > > have to give all committers permissions for flink-ci/flink? > > > > > > On 04/12/2019 12:55, Chesnay Schepler wrote: > > > > From what I've seen so far Azure will provide us a better experience, > > > > so I'd say +1 for the transition as a whole. > > > > > > > > I'd delay merge at least until the feature branch is cut. > > > > Given the parental leave it may even make sense to only start merging > > > > in January afterwards, to reduce the total time taken for the > > transition. > > > > > > > > Reviews could maybe be made earlier, but I'm wondering whether anyone > > > > would even have the time at the moment to do so. > > > > > > > > On 04/12/2019 12:35, Kurt Young wrote: > > > >> Thanks Robert for driving this. There is another big pain point of > > > >> current > > > >> travis, > > > >> which is its cache mechanism will fail from time to time. Almost > > > >> around 50% > > > >> of > > > >> the build fails are caused by cache problem. I opened this issue to > > > >> travis > > > >> but > > > >> got no response yet. So big +1 from my side. > > > >> > > > >> Just one comment, it's close to 1.10 feature freeze and we will > spend > > > >> some > > > >> time > > > >> to make tests stable before release. I wish this replacement can > > happen > > > >> after > > > >> 1.10 release, otherwise it will be a unstable factor during release > > > >> testing. > > > >> > > > >> Best, > > > >> Kurt > > > >> > > > >> > > > >> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > > > >> > > > >>> Thanks Robert for the updates! And thanks a lot for all the efforts > > to > > > >>> investigate, experiment and tune Azure Pipelines for Flink > building. > > > >>> Big +1 for it. > > > >>> > > > >>> It would be great that the community building can be extended with > > > >>> custom > > > >>> machines so that the tests would not be queued for long with daily > > > >>> growing > > > >>> PRs. > > > >>> > > > >>> The increased timeout would be also very helpful. > > > >>> The 50min timeout for free travis accounts is a pain currently, > > > >>> especially > > > >>> when we'd like to run e2e tests in our own travis. And I had to > > > >>> manually > > > >>> split the jobs to make it possible to pass. > > > >>> > > > >>> Thanks, > > > >>> Zhu Zhu > > > >>> > > > >>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > > > >>> > > > >>>> Hi all, > > > >>>> > > > >>>> as a follow up from our discussion on reducing the build time > [1], I > > > >>> would > > > >>>> like to propose migrating our build infrastructure to Azure > > Pipelines > > > >>> (away > > > >>>> from Travis). > > > >>>> > > > >>>> I believe that we have reached the limits of what Travis can > > > >>>> provide the > > > >>>> Flink community, and I don't want the build system to limit or > > > >>>> influence > > > >>>> the project's growth. > > > >>>> > > > >>>> *Benefits:* > > > >>>> 1. The free Travis account are limited to 5 parallel builds, with > a > > > >>> timeout > > > >>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute > > > >>>> timeouts > > > >>>> *for > > > >>>> free for open source projects. > > > >>>> 2. Azure Pipelines allows us to *add custom build machines* to the > > > >>>> pool > > > >>> of > > > >>>> 10 free parallel builders. > > > >>>> This will allow the Flink community to scale the available build > > > >>>> capacity > > > >>>> as the project grows. We are dependent on donations from > supporting > > > >>>> companies, but I believe that it is easier for companies to donate > > > >>> machines > > > >>>> than money. > > > >>>> Alibaba is willing to provide 10 machines, with 32 cores each to > the > > > >>> Flink > > > >>>> project for this purpose. > > > >>>> In addition, Xiyuan, who's working on adding ARM support for Flink > > > >>> provided > > > >>>> me with 2 ARM machines (16 cores each). > > > >>>> I want to use the custom, more efficient build machines for > building > > > >>>> Flink's pull requests and master-pushes. > > > >>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for > > > >>>> example to > > > >>>> transfer intermediate build artifacts between pipeline stages. > This > > > >>>> will > > > >>>> allow us to make the build more reliable (we are currently abusing > > the > > > >>>> caching mechanism in Travis for this). > > > >>>> It also has some basic analytics on test results / flaky tests > etc. > > > >>>> > > > >>>> *Known problems:* > > > >>>> - Initially, we might see different build instabilities than > before > > > >>>> - There's a higher maintenance overhead for the custom build > > machines > > > >>>> (keeping them up to date etc.) > > > >>>> - We can not use the build status integration of AZP, because they > > > >>> require > > > >>>> write access to the repository's source. The foundation does not > > allow > > > >>> that > > > >>>> [2]. > > > >>>> I propose to extend flinkbot / the flink-ci repository. > > > >>>> > > > >>>> *Current Status:* > > > >>>> - I'm able [3] to execute [4] the current custom build scripts on > > > >>>> Azure > > > >>>> Pipelines: This means that we will have one compile stage, and N > > > >>>> testing > > > >>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. > > > >>>> The time from the start of a build till all tests have completed > is > > > >>>> 1h22 > > > >>>> minutes. > > > >>>> - I'm working on getting the nightly end to end tests to run on > the > > > >>>> new > > > >>>> infrastructure. > > > >>>> - I'm working on getting the build to work on our pool of custom > > > >>>> machines > > > >>>> as well > > > >>>> - I'm working on setting up the full matrix of builds (different > > > >>>> scala, > > > >>>> hadoop etc. versions) for the nightlies > > > >>>> > > > >>>> *Next Steps:* > > > >>>> - I propose to document the entire build system in the Flink Wiki > > > >>>> - Once Azure can cover the same pull request tests as Travis, I > > > >>>> would set > > > >>>> it up to run in parallel (including Flinkbot posting links to > > > >>>> Azure). I > > > >>>> hope that this phase lasts for 1-2 weeks only, so that we do not > > > >>>> have to > > > >>>> maintain things concurrently. I will monitor the build stability > > > >>>> closely, > > > >>>> but would expect some support with debugging potential issues from > > the > > > >>>> contributors. > > > >>>> - Once there are no problems with the new setup, we remove the > > Travis > > > >>>> setup. > > > >>>> - Independently, I will work on triggering builds from master / > > > >>>> release - > > > >>>> branch pushes, as well as cron builds from the master branch ... > > > >>>> all this > > > >>>> will be described in the Wiki. > > > >>>> > > > >>>> > > > >>>> *Timeline:*- Once I have the feeling that people are supportive of > > the > > > >>>> idea, I will start documenting in the Wiki. The first pull > requests > > > >>> should > > > >>>> show up after a few more days. > > > >>>> I will do a one month parental leave starting some time later in > > > >>> December, > > > >>>> which will probably delay things a bit. I hope to have everything > > > >>> finished > > > >>>> by end of January. > > > >>>> > > > >>>> I'm happy to hear your thoughts on this work. > > > >>>> If nobody objects, I will start documenting the system and prepare > > > >>>> everything for the migration. > > > >>>> > > > >>>> Best, > > > >>>> Robert > > > >>>> > > > >>>> > > > >>>> > > > >>>> [1] > > > >>>> > > > >>>> > > > >>> > > > > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > > >>> > > > >>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > > > >>>> [3] https://github.com/rmetzger/flink/tree/azure_playground > > > >>>> [4] > > > >>> > > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > > > > > > > > > > > > > > > > > > > > > -- > Best Regards > > Jeff Zhang > |
+1
Thanks for the effort! The tooling seems to be quite a bit nicer and I like that we can grow by adding more machines. Best, Aljoscha > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > > +1 for Azure pipeline because it promises better performance. > > However, I have 2 concerns: > > 1) Travis provides personal free service for testing personal branches. > Usually, contributors use this feature to test PoC or run CRON jobs for > pull requests. > Using local machine will cost a lot of time. Does AZP provides the same > free service? > 2) Currently, we deployed a webhook [1] to receive Travis CI build > notifications [2] and send to [hidden email] mailing list. > We need to figure out a way how to send Azure build results to the > mailing list. And this [3] might be the way to go. > > [hidden email] mailing list > > Best, > Jark > > [1]: https://github.com/wuchong/flink-notification-bot > [2]: > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > [3]: > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > > > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: > >> +1 >> >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: >> >>> +1 for moving to Azure pipelines as it promises better scalability and >>> tooling. Looking forward to having faster builds and hence shorter >> feedback >>> cycles :-) >>> >>> Cheers, >>> Till >>> >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> >>> wrote: >>> >>>> @robert Can you expand how the azure setup interacts with CiBot? Do we >>>> have to continue mirroring builds into flink-ci? How will the cronjob >>>> configuration work? We should have a general idea on how to implement >>>> this before proceeding. >>>> Additionally, moving /all /jobs into flink-ci requires setting up the >>>> environment variables we have; can we set these up via files or will we >>>> have to give all committers permissions for flink-ci/flink? >>>> >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: >>>>> From what I've seen so far Azure will provide us a better experience, >>>>> so I'd say +1 for the transition as a whole. >>>>> >>>>> I'd delay merge at least until the feature branch is cut. >>>>> Given the parental leave it may even make sense to only start merging >>>>> in January afterwards, to reduce the total time taken for the >>> transition. >>>>> >>>>> Reviews could maybe be made earlier, but I'm wondering whether anyone >>>>> would even have the time at the moment to do so. >>>>> >>>>> On 04/12/2019 12:35, Kurt Young wrote: >>>>>> Thanks Robert for driving this. There is another big pain point of >>>>>> current >>>>>> travis, >>>>>> which is its cache mechanism will fail from time to time. Almost >>>>>> around 50% >>>>>> of >>>>>> the build fails are caused by cache problem. I opened this issue to >>>>>> travis >>>>>> but >>>>>> got no response yet. So big +1 from my side. >>>>>> >>>>>> Just one comment, it's close to 1.10 feature freeze and we will >> spend >>>>>> some >>>>>> time >>>>>> to make tests stable before release. I wish this replacement can >>> happen >>>>>> after >>>>>> 1.10 release, otherwise it will be a unstable factor during release >>>>>> testing. >>>>>> >>>>>> Best, >>>>>> Kurt >>>>>> >>>>>> >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: >>>>>> >>>>>>> Thanks Robert for the updates! And thanks a lot for all the efforts >>> to >>>>>>> investigate, experiment and tune Azure Pipelines for Flink >> building. >>>>>>> Big +1 for it. >>>>>>> >>>>>>> It would be great that the community building can be extended with >>>>>>> custom >>>>>>> machines so that the tests would not be queued for long with daily >>>>>>> growing >>>>>>> PRs. >>>>>>> >>>>>>> The increased timeout would be also very helpful. >>>>>>> The 50min timeout for free travis accounts is a pain currently, >>>>>>> especially >>>>>>> when we'd like to run e2e tests in our own travis. And I had to >>>>>>> manually >>>>>>> split the jobs to make it possible to pass. >>>>>>> >>>>>>> Thanks, >>>>>>> Zhu Zhu >>>>>>> >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> as a follow up from our discussion on reducing the build time >> [1], I >>>>>>> would >>>>>>>> like to propose migrating our build infrastructure to Azure >>> Pipelines >>>>>>> (away >>>>>>>> from Travis). >>>>>>>> >>>>>>>> I believe that we have reached the limits of what Travis can >>>>>>>> provide the >>>>>>>> Flink community, and I don't want the build system to limit or >>>>>>>> influence >>>>>>>> the project's growth. >>>>>>>> >>>>>>>> *Benefits:* >>>>>>>> 1. The free Travis account are limited to 5 parallel builds, with >> a >>>>>>> timeout >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute >>>>>>>> timeouts >>>>>>>> *for >>>>>>>> free for open source projects. >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to the >>>>>>>> pool >>>>>>> of >>>>>>>> 10 free parallel builders. >>>>>>>> This will allow the Flink community to scale the available build >>>>>>>> capacity >>>>>>>> as the project grows. We are dependent on donations from >> supporting >>>>>>>> companies, but I believe that it is easier for companies to donate >>>>>>> machines >>>>>>>> than money. >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each to >> the >>>>>>> Flink >>>>>>>> project for this purpose. >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for Flink >>>>>>> provided >>>>>>>> me with 2 ARM machines (16 cores each). >>>>>>>> I want to use the custom, more efficient build machines for >> building >>>>>>>> Flink's pull requests and master-pushes. >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for >>>>>>>> example to >>>>>>>> transfer intermediate build artifacts between pipeline stages. >> This >>>>>>>> will >>>>>>>> allow us to make the build more reliable (we are currently abusing >>> the >>>>>>>> caching mechanism in Travis for this). >>>>>>>> It also has some basic analytics on test results / flaky tests >> etc. >>>>>>>> >>>>>>>> *Known problems:* >>>>>>>> - Initially, we might see different build instabilities than >> before >>>>>>>> - There's a higher maintenance overhead for the custom build >>> machines >>>>>>>> (keeping them up to date etc.) >>>>>>>> - We can not use the build status integration of AZP, because they >>>>>>> require >>>>>>>> write access to the repository's source. The foundation does not >>> allow >>>>>>> that >>>>>>>> [2]. >>>>>>>> I propose to extend flinkbot / the flink-ci repository. >>>>>>>> >>>>>>>> *Current Status:* >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts on >>>>>>>> Azure >>>>>>>> Pipelines: This means that we will have one compile stage, and N >>>>>>>> testing >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. >>>>>>>> The time from the start of a build till all tests have completed >> is >>>>>>>> 1h22 >>>>>>>> minutes. >>>>>>>> - I'm working on getting the nightly end to end tests to run on >> the >>>>>>>> new >>>>>>>> infrastructure. >>>>>>>> - I'm working on getting the build to work on our pool of custom >>>>>>>> machines >>>>>>>> as well >>>>>>>> - I'm working on setting up the full matrix of builds (different >>>>>>>> scala, >>>>>>>> hadoop etc. versions) for the nightlies >>>>>>>> >>>>>>>> *Next Steps:* >>>>>>>> - I propose to document the entire build system in the Flink Wiki >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I >>>>>>>> would set >>>>>>>> it up to run in parallel (including Flinkbot posting links to >>>>>>>> Azure). I >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do not >>>>>>>> have to >>>>>>>> maintain things concurrently. I will monitor the build stability >>>>>>>> closely, >>>>>>>> but would expect some support with debugging potential issues from >>> the >>>>>>>> contributors. >>>>>>>> - Once there are no problems with the new setup, we remove the >>> Travis >>>>>>>> setup. >>>>>>>> - Independently, I will work on triggering builds from master / >>>>>>>> release - >>>>>>>> branch pushes, as well as cron builds from the master branch ... >>>>>>>> all this >>>>>>>> will be described in the Wiki. >>>>>>>> >>>>>>>> >>>>>>>> *Timeline:*- Once I have the feeling that people are supportive of >>> the >>>>>>>> idea, I will start documenting in the Wiki. The first pull >> requests >>>>>>> should >>>>>>>> show up after a few more days. >>>>>>>> I will do a one month parental leave starting some time later in >>>>>>> December, >>>>>>>> which will probably delay things a bit. I hope to have everything >>>>>>> finished >>>>>>>> by end of January. >>>>>>>> >>>>>>>> I'm happy to hear your thoughts on this work. >>>>>>>> If nobody objects, I will start documenting the system and prepare >>>>>>>> everything for the migration. >>>>>>>> >>>>>>>> Best, >>>>>>>> Robert >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>>>> >>>>>>> >>>> >>> >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >>>>>>> >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground >>>>>>>> [4] >>>>>>> >>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary >>>>> >>>>> >>>>> >>>> >>>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> |
+1 I had good experiences with Azure pipelines in the past.
On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek <[hidden email]> wrote: > +1 > > Thanks for the effort! The tooling seems to be quite a bit nicer and I > like that we can grow by adding more machines. > > Best, > Aljoscha > > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > > > > +1 for Azure pipeline because it promises better performance. > > > > However, I have 2 concerns: > > > > 1) Travis provides personal free service for testing personal branches. > > Usually, contributors use this feature to test PoC or run CRON jobs for > > pull requests. > > Using local machine will cost a lot of time. Does AZP provides the > same > > free service? > > 2) Currently, we deployed a webhook [1] to receive Travis CI build > > notifications [2] and send to [hidden email] mailing list. > > We need to figure out a way how to send Azure build results to the > > mailing list. And this [3] might be the way to go. > > > > [hidden email] mailing list > > > > Best, > > Jark > > > > [1]: https://github.com/wuchong/flink-notification-bot > > [2]: > > > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > > [3]: > > > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > > > > > > > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: > > > >> +1 > >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: > >> > >>> +1 for moving to Azure pipelines as it promises better scalability and > >>> tooling. Looking forward to having faster builds and hence shorter > >> feedback > >>> cycles :-) > >>> > >>> Cheers, > >>> Till > >>> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> > >>> wrote: > >>> > >>>> @robert Can you expand how the azure setup interacts with CiBot? Do we > >>>> have to continue mirroring builds into flink-ci? How will the cronjob > >>>> configuration work? We should have a general idea on how to implement > >>>> this before proceeding. > >>>> Additionally, moving /all /jobs into flink-ci requires setting up the > >>>> environment variables we have; can we set these up via files or will > we > >>>> have to give all committers permissions for flink-ci/flink? > >>>> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > >>>>> From what I've seen so far Azure will provide us a better experience, > >>>>> so I'd say +1 for the transition as a whole. > >>>>> > >>>>> I'd delay merge at least until the feature branch is cut. > >>>>> Given the parental leave it may even make sense to only start merging > >>>>> in January afterwards, to reduce the total time taken for the > >>> transition. > >>>>> > >>>>> Reviews could maybe be made earlier, but I'm wondering whether anyone > >>>>> would even have the time at the moment to do so. > >>>>> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: > >>>>>> Thanks Robert for driving this. There is another big pain point of > >>>>>> current > >>>>>> travis, > >>>>>> which is its cache mechanism will fail from time to time. Almost > >>>>>> around 50% > >>>>>> of > >>>>>> the build fails are caused by cache problem. I opened this issue to > >>>>>> travis > >>>>>> but > >>>>>> got no response yet. So big +1 from my side. > >>>>>> > >>>>>> Just one comment, it's close to 1.10 feature freeze and we will > >> spend > >>>>>> some > >>>>>> time > >>>>>> to make tests stable before release. I wish this replacement can > >>> happen > >>>>>> after > >>>>>> 1.10 release, otherwise it will be a unstable factor during release > >>>>>> testing. > >>>>>> > >>>>>> Best, > >>>>>> Kurt > >>>>>> > >>>>>> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > >>>>>> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all the efforts > >>> to > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink > >> building. > >>>>>>> Big +1 for it. > >>>>>>> > >>>>>>> It would be great that the community building can be extended with > >>>>>>> custom > >>>>>>> machines so that the tests would not be queued for long with daily > >>>>>>> growing > >>>>>>> PRs. > >>>>>>> > >>>>>>> The increased timeout would be also very helpful. > >>>>>>> The 50min timeout for free travis accounts is a pain currently, > >>>>>>> especially > >>>>>>> when we'd like to run e2e tests in our own travis. And I had to > >>>>>>> manually > >>>>>>> split the jobs to make it possible to pass. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Zhu Zhu > >>>>>>> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > >>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> as a follow up from our discussion on reducing the build time > >> [1], I > >>>>>>> would > >>>>>>>> like to propose migrating our build infrastructure to Azure > >>> Pipelines > >>>>>>> (away > >>>>>>>> from Travis). > >>>>>>>> > >>>>>>>> I believe that we have reached the limits of what Travis can > >>>>>>>> provide the > >>>>>>>> Flink community, and I don't want the build system to limit or > >>>>>>>> influence > >>>>>>>> the project's growth. > >>>>>>>> > >>>>>>>> *Benefits:* > >>>>>>>> 1. The free Travis account are limited to 5 parallel builds, with > >> a > >>>>>>> timeout > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute > >>>>>>>> timeouts > >>>>>>>> *for > >>>>>>>> free for open source projects. > >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to the > >>>>>>>> pool > >>>>>>> of > >>>>>>>> 10 free parallel builders. > >>>>>>>> This will allow the Flink community to scale the available build > >>>>>>>> capacity > >>>>>>>> as the project grows. We are dependent on donations from > >> supporting > >>>>>>>> companies, but I believe that it is easier for companies to donate > >>>>>>> machines > >>>>>>>> than money. > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each to > >> the > >>>>>>> Flink > >>>>>>>> project for this purpose. > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for Flink > >>>>>>> provided > >>>>>>>> me with 2 ARM machines (16 cores each). > >>>>>>>> I want to use the custom, more efficient build machines for > >> building > >>>>>>>> Flink's pull requests and master-pushes. > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for > >>>>>>>> example to > >>>>>>>> transfer intermediate build artifacts between pipeline stages. > >> This > >>>>>>>> will > >>>>>>>> allow us to make the build more reliable (we are currently abusing > >>> the > >>>>>>>> caching mechanism in Travis for this). > >>>>>>>> It also has some basic analytics on test results / flaky tests > >> etc. > >>>>>>>> > >>>>>>>> *Known problems:* > >>>>>>>> - Initially, we might see different build instabilities than > >> before > >>>>>>>> - There's a higher maintenance overhead for the custom build > >>> machines > >>>>>>>> (keeping them up to date etc.) > >>>>>>>> - We can not use the build status integration of AZP, because they > >>>>>>> require > >>>>>>>> write access to the repository's source. The foundation does not > >>> allow > >>>>>>> that > >>>>>>>> [2]. > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. > >>>>>>>> > >>>>>>>> *Current Status:* > >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts on > >>>>>>>> Azure > >>>>>>>> Pipelines: This means that we will have one compile stage, and N > >>>>>>>> testing > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. > >>>>>>>> The time from the start of a build till all tests have completed > >> is > >>>>>>>> 1h22 > >>>>>>>> minutes. > >>>>>>>> - I'm working on getting the nightly end to end tests to run on > >> the > >>>>>>>> new > >>>>>>>> infrastructure. > >>>>>>>> - I'm working on getting the build to work on our pool of custom > >>>>>>>> machines > >>>>>>>> as well > >>>>>>>> - I'm working on setting up the full matrix of builds (different > >>>>>>>> scala, > >>>>>>>> hadoop etc. versions) for the nightlies > >>>>>>>> > >>>>>>>> *Next Steps:* > >>>>>>>> - I propose to document the entire build system in the Flink Wiki > >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I > >>>>>>>> would set > >>>>>>>> it up to run in parallel (including Flinkbot posting links to > >>>>>>>> Azure). I > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do not > >>>>>>>> have to > >>>>>>>> maintain things concurrently. I will monitor the build stability > >>>>>>>> closely, > >>>>>>>> but would expect some support with debugging potential issues from > >>> the > >>>>>>>> contributors. > >>>>>>>> - Once there are no problems with the new setup, we remove the > >>> Travis > >>>>>>>> setup. > >>>>>>>> - Independently, I will work on triggering builds from master / > >>>>>>>> release - > >>>>>>>> branch pushes, as well as cron builds from the master branch ... > >>>>>>>> all this > >>>>>>>> will be described in the Wiki. > >>>>>>>> > >>>>>>>> > >>>>>>>> *Timeline:*- Once I have the feeling that people are supportive of > >>> the > >>>>>>>> idea, I will start documenting in the Wiki. The first pull > >> requests > >>>>>>> should > >>>>>>>> show up after a few more days. > >>>>>>>> I will do a one month parental leave starting some time later in > >>>>>>> December, > >>>>>>>> which will probably delay things a bit. I hope to have everything > >>>>>>> finished > >>>>>>>> by end of January. > >>>>>>>> > >>>>>>>> I'm happy to hear your thoughts on this work. > >>>>>>>> If nobody objects, I will start documenting the system and prepare > >>>>>>>> everything for the migration. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Robert > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>> > >>> > >> > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > >>>>>>> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground > >>>>>>>> [4] > >>>>>>> > >>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > >> -- > >> Best Regards > >> > >> Jeff Zhang > >> > > |
Thank you all for the positive feedback. I will start putting together a
page in the wiki. @Jark: Azure Pipelines provides a free services, that is even better than what Travis provides for free: 10 parallel builds with 6 hours timeouts. @Chesnay: I will answer your questions in the yet-to-be-written documentation in the wiki. On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email]> wrote: > +1 I had good experiences with Azure pipelines in the past. > > On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek <[hidden email]> > wrote: > > > +1 > > > > Thanks for the effort! The tooling seems to be quite a bit nicer and I > > like that we can grow by adding more machines. > > > > Best, > > Aljoscha > > > > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > > > > > > +1 for Azure pipeline because it promises better performance. > > > > > > However, I have 2 concerns: > > > > > > 1) Travis provides personal free service for testing personal branches. > > > Usually, contributors use this feature to test PoC or run CRON jobs for > > > pull requests. > > > Using local machine will cost a lot of time. Does AZP provides the > > same > > > free service? > > > 2) Currently, we deployed a webhook [1] to receive Travis CI build > > > notifications [2] and send to [hidden email] mailing list. > > > We need to figure out a way how to send Azure build results to the > > > mailing list. And this [3] might be the way to go. > > > > > > [hidden email] mailing list > > > > > > Best, > > > Jark > > > > > > [1]: https://github.com/wuchong/flink-notification-bot > > > [2]: > > > > > > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > > > [3]: > > > > > > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > > > > > > > > > > > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: > > > > > >> +1 > > >> > > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: > > >> > > >>> +1 for moving to Azure pipelines as it promises better scalability > and > > >>> tooling. Looking forward to having faster builds and hence shorter > > >> feedback > > >>> cycles :-) > > >>> > > >>> Cheers, > > >>> Till > > >>> > > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email]> > > >>> wrote: > > >>> > > >>>> @robert Can you expand how the azure setup interacts with CiBot? Do > we > > >>>> have to continue mirroring builds into flink-ci? How will the > cronjob > > >>>> configuration work? We should have a general idea on how to > implement > > >>>> this before proceeding. > > >>>> Additionally, moving /all /jobs into flink-ci requires setting up > the > > >>>> environment variables we have; can we set these up via files or will > > we > > >>>> have to give all committers permissions for flink-ci/flink? > > >>>> > > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > > >>>>> From what I've seen so far Azure will provide us a better > experience, > > >>>>> so I'd say +1 for the transition as a whole. > > >>>>> > > >>>>> I'd delay merge at least until the feature branch is cut. > > >>>>> Given the parental leave it may even make sense to only start > merging > > >>>>> in January afterwards, to reduce the total time taken for the > > >>> transition. > > >>>>> > > >>>>> Reviews could maybe be made earlier, but I'm wondering whether > anyone > > >>>>> would even have the time at the moment to do so. > > >>>>> > > >>>>> On 04/12/2019 12:35, Kurt Young wrote: > > >>>>>> Thanks Robert for driving this. There is another big pain point of > > >>>>>> current > > >>>>>> travis, > > >>>>>> which is its cache mechanism will fail from time to time. Almost > > >>>>>> around 50% > > >>>>>> of > > >>>>>> the build fails are caused by cache problem. I opened this issue > to > > >>>>>> travis > > >>>>>> but > > >>>>>> got no response yet. So big +1 from my side. > > >>>>>> > > >>>>>> Just one comment, it's close to 1.10 feature freeze and we will > > >> spend > > >>>>>> some > > >>>>>> time > > >>>>>> to make tests stable before release. I wish this replacement can > > >>> happen > > >>>>>> after > > >>>>>> 1.10 release, otherwise it will be a unstable factor during > release > > >>>>>> testing. > > >>>>>> > > >>>>>> Best, > > >>>>>> Kurt > > >>>>>> > > >>>>>> > > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> wrote: > > >>>>>> > > >>>>>>> Thanks Robert for the updates! And thanks a lot for all the > efforts > > >>> to > > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink > > >> building. > > >>>>>>> Big +1 for it. > > >>>>>>> > > >>>>>>> It would be great that the community building can be extended > with > > >>>>>>> custom > > >>>>>>> machines so that the tests would not be queued for long with > daily > > >>>>>>> growing > > >>>>>>> PRs. > > >>>>>>> > > >>>>>>> The increased timeout would be also very helpful. > > >>>>>>> The 50min timeout for free travis accounts is a pain currently, > > >>>>>>> especially > > >>>>>>> when we'd like to run e2e tests in our own travis. And I had to > > >>>>>>> manually > > >>>>>>> split the jobs to make it possible to pass. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Zhu Zhu > > >>>>>>> > > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: > > >>>>>>> > > >>>>>>>> Hi all, > > >>>>>>>> > > >>>>>>>> as a follow up from our discussion on reducing the build time > > >> [1], I > > >>>>>>> would > > >>>>>>>> like to propose migrating our build infrastructure to Azure > > >>> Pipelines > > >>>>>>> (away > > >>>>>>>> from Travis). > > >>>>>>>> > > >>>>>>>> I believe that we have reached the limits of what Travis can > > >>>>>>>> provide the > > >>>>>>>> Flink community, and I don't want the build system to limit or > > >>>>>>>> influence > > >>>>>>>> the project's growth. > > >>>>>>>> > > >>>>>>>> *Benefits:* > > >>>>>>>> 1. The free Travis account are limited to 5 parallel builds, > with > > >> a > > >>>>>>> timeout > > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute > > >>>>>>>> timeouts > > >>>>>>>> *for > > >>>>>>>> free for open source projects. > > >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to > the > > >>>>>>>> pool > > >>>>>>> of > > >>>>>>>> 10 free parallel builders. > > >>>>>>>> This will allow the Flink community to scale the available build > > >>>>>>>> capacity > > >>>>>>>> as the project grows. We are dependent on donations from > > >> supporting > > >>>>>>>> companies, but I believe that it is easier for companies to > donate > > >>>>>>> machines > > >>>>>>>> than money. > > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each to > > >> the > > >>>>>>> Flink > > >>>>>>>> project for this purpose. > > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for > Flink > > >>>>>>> provided > > >>>>>>>> me with 2 ARM machines (16 cores each). > > >>>>>>>> I want to use the custom, more efficient build machines for > > >> building > > >>>>>>>> Flink's pull requests and master-pushes. > > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for > > >>>>>>>> example to > > >>>>>>>> transfer intermediate build artifacts between pipeline stages. > > >> This > > >>>>>>>> will > > >>>>>>>> allow us to make the build more reliable (we are currently > abusing > > >>> the > > >>>>>>>> caching mechanism in Travis for this). > > >>>>>>>> It also has some basic analytics on test results / flaky tests > > >> etc. > > >>>>>>>> > > >>>>>>>> *Known problems:* > > >>>>>>>> - Initially, we might see different build instabilities than > > >> before > > >>>>>>>> - There's a higher maintenance overhead for the custom build > > >>> machines > > >>>>>>>> (keeping them up to date etc.) > > >>>>>>>> - We can not use the build status integration of AZP, because > they > > >>>>>>> require > > >>>>>>>> write access to the repository's source. The foundation does not > > >>> allow > > >>>>>>> that > > >>>>>>>> [2]. > > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. > > >>>>>>>> > > >>>>>>>> *Current Status:* > > >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts > on > > >>>>>>>> Azure > > >>>>>>>> Pipelines: This means that we will have one compile stage, and N > > >>>>>>>> testing > > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. > > >>>>>>>> The time from the start of a build till all tests have completed > > >> is > > >>>>>>>> 1h22 > > >>>>>>>> minutes. > > >>>>>>>> - I'm working on getting the nightly end to end tests to run on > > >> the > > >>>>>>>> new > > >>>>>>>> infrastructure. > > >>>>>>>> - I'm working on getting the build to work on our pool of custom > > >>>>>>>> machines > > >>>>>>>> as well > > >>>>>>>> - I'm working on setting up the full matrix of builds (different > > >>>>>>>> scala, > > >>>>>>>> hadoop etc. versions) for the nightlies > > >>>>>>>> > > >>>>>>>> *Next Steps:* > > >>>>>>>> - I propose to document the entire build system in the Flink > Wiki > > >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I > > >>>>>>>> would set > > >>>>>>>> it up to run in parallel (including Flinkbot posting links to > > >>>>>>>> Azure). I > > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do not > > >>>>>>>> have to > > >>>>>>>> maintain things concurrently. I will monitor the build stability > > >>>>>>>> closely, > > >>>>>>>> but would expect some support with debugging potential issues > from > > >>> the > > >>>>>>>> contributors. > > >>>>>>>> - Once there are no problems with the new setup, we remove the > > >>> Travis > > >>>>>>>> setup. > > >>>>>>>> - Independently, I will work on triggering builds from master / > > >>>>>>>> release - > > >>>>>>>> branch pushes, as well as cron builds from the master branch ... > > >>>>>>>> all this > > >>>>>>>> will be described in the Wiki. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> *Timeline:*- Once I have the feeling that people are supportive > of > > >>> the > > >>>>>>>> idea, I will start documenting in the Wiki. The first pull > > >> requests > > >>>>>>> should > > >>>>>>>> show up after a few more days. > > >>>>>>>> I will do a one month parental leave starting some time later in > > >>>>>>> December, > > >>>>>>>> which will probably delay things a bit. I hope to have > everything > > >>>>>>> finished > > >>>>>>>> by end of January. > > >>>>>>>> > > >>>>>>>> I'm happy to hear your thoughts on this work. > > >>>>>>>> If nobody objects, I will start documenting the system and > prepare > > >>>>>>>> everything for the migration. > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Robert > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> [1] > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>> > > >>> > > >> > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > >>>>>>> > > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > > >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground > > >>>>>>>> [4] > > >>>>>>> > > >>> > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > >> -- > > >> Best Regards > > >> > > >> Jeff Zhang > > >> > > > > > |
I've created a first draft of my plans in the wiki:
https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines. I'm looking forward to your comments. On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger <[hidden email]> wrote: > Thank you all for the positive feedback. I will start putting together a > page in the wiki. > > @Jark: Azure Pipelines provides a free services, that is even better than > what Travis provides for free: 10 parallel builds with 6 hours timeouts. > > @Chesnay: I will answer your questions in the yet-to-be-written > documentation in the wiki. > > > On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email]> wrote: > >> +1 I had good experiences with Azure pipelines in the past. >> >> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek <[hidden email]> >> wrote: >> >> > +1 >> > >> > Thanks for the effort! The tooling seems to be quite a bit nicer and I >> > like that we can grow by adding more machines. >> > >> > Best, >> > Aljoscha >> > >> > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: >> > > >> > > +1 for Azure pipeline because it promises better performance. >> > > >> > > However, I have 2 concerns: >> > > >> > > 1) Travis provides personal free service for testing personal >> branches. >> > > Usually, contributors use this feature to test PoC or run CRON jobs >> for >> > > pull requests. >> > > Using local machine will cost a lot of time. Does AZP provides the >> > same >> > > free service? >> > > 2) Currently, we deployed a webhook [1] to receive Travis CI build >> > > notifications [2] and send to [hidden email] mailing list. >> > > We need to figure out a way how to send Azure build results to the >> > > mailing list. And this [3] might be the way to go. >> > > >> > > [hidden email] mailing list >> > > >> > > Best, >> > > Jark >> > > >> > > [1]: https://github.com/wuchong/flink-notification-bot >> > > [2]: >> > > >> > >> https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications >> > > [3]: >> > > >> > >> https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops >> > > >> > > >> > > >> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: >> > > >> > >> +1 >> > >> >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: >> > >> >> > >>> +1 for moving to Azure pipelines as it promises better scalability >> and >> > >>> tooling. Looking forward to having faster builds and hence shorter >> > >> feedback >> > >>> cycles :-) >> > >>> >> > >>> Cheers, >> > >>> Till >> > >>> >> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email] >> > >> > >>> wrote: >> > >>> >> > >>>> @robert Can you expand how the azure setup interacts with CiBot? >> Do we >> > >>>> have to continue mirroring builds into flink-ci? How will the >> cronjob >> > >>>> configuration work? We should have a general idea on how to >> implement >> > >>>> this before proceeding. >> > >>>> Additionally, moving /all /jobs into flink-ci requires setting up >> the >> > >>>> environment variables we have; can we set these up via files or >> will >> > we >> > >>>> have to give all committers permissions for flink-ci/flink? >> > >>>> >> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: >> > >>>>> From what I've seen so far Azure will provide us a better >> experience, >> > >>>>> so I'd say +1 for the transition as a whole. >> > >>>>> >> > >>>>> I'd delay merge at least until the feature branch is cut. >> > >>>>> Given the parental leave it may even make sense to only start >> merging >> > >>>>> in January afterwards, to reduce the total time taken for the >> > >>> transition. >> > >>>>> >> > >>>>> Reviews could maybe be made earlier, but I'm wondering whether >> anyone >> > >>>>> would even have the time at the moment to do so. >> > >>>>> >> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: >> > >>>>>> Thanks Robert for driving this. There is another big pain point >> of >> > >>>>>> current >> > >>>>>> travis, >> > >>>>>> which is its cache mechanism will fail from time to time. Almost >> > >>>>>> around 50% >> > >>>>>> of >> > >>>>>> the build fails are caused by cache problem. I opened this issue >> to >> > >>>>>> travis >> > >>>>>> but >> > >>>>>> got no response yet. So big +1 from my side. >> > >>>>>> >> > >>>>>> Just one comment, it's close to 1.10 feature freeze and we will >> > >> spend >> > >>>>>> some >> > >>>>>> time >> > >>>>>> to make tests stable before release. I wish this replacement can >> > >>> happen >> > >>>>>> after >> > >>>>>> 1.10 release, otherwise it will be a unstable factor during >> release >> > >>>>>> testing. >> > >>>>>> >> > >>>>>> Best, >> > >>>>>> Kurt >> > >>>>>> >> > >>>>>> >> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> >> wrote: >> > >>>>>> >> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all the >> efforts >> > >>> to >> > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink >> > >> building. >> > >>>>>>> Big +1 for it. >> > >>>>>>> >> > >>>>>>> It would be great that the community building can be extended >> with >> > >>>>>>> custom >> > >>>>>>> machines so that the tests would not be queued for long with >> daily >> > >>>>>>> growing >> > >>>>>>> PRs. >> > >>>>>>> >> > >>>>>>> The increased timeout would be also very helpful. >> > >>>>>>> The 50min timeout for free travis accounts is a pain currently, >> > >>>>>>> especially >> > >>>>>>> when we'd like to run e2e tests in our own travis. And I had to >> > >>>>>>> manually >> > >>>>>>> split the jobs to make it possible to pass. >> > >>>>>>> >> > >>>>>>> Thanks, >> > >>>>>>> Zhu Zhu >> > >>>>>>> >> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: >> > >>>>>>> >> > >>>>>>>> Hi all, >> > >>>>>>>> >> > >>>>>>>> as a follow up from our discussion on reducing the build time >> > >> [1], I >> > >>>>>>> would >> > >>>>>>>> like to propose migrating our build infrastructure to Azure >> > >>> Pipelines >> > >>>>>>> (away >> > >>>>>>>> from Travis). >> > >>>>>>>> >> > >>>>>>>> I believe that we have reached the limits of what Travis can >> > >>>>>>>> provide the >> > >>>>>>>> Flink community, and I don't want the build system to limit or >> > >>>>>>>> influence >> > >>>>>>>> the project's growth. >> > >>>>>>>> >> > >>>>>>>> *Benefits:* >> > >>>>>>>> 1. The free Travis account are limited to 5 parallel builds, >> with >> > >> a >> > >>>>>>> timeout >> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute >> > >>>>>>>> timeouts >> > >>>>>>>> *for >> > >>>>>>>> free for open source projects. >> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to >> the >> > >>>>>>>> pool >> > >>>>>>> of >> > >>>>>>>> 10 free parallel builders. >> > >>>>>>>> This will allow the Flink community to scale the available >> build >> > >>>>>>>> capacity >> > >>>>>>>> as the project grows. We are dependent on donations from >> > >> supporting >> > >>>>>>>> companies, but I believe that it is easier for companies to >> donate >> > >>>>>>> machines >> > >>>>>>>> than money. >> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each >> to >> > >> the >> > >>>>>>> Flink >> > >>>>>>>> project for this purpose. >> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for >> Flink >> > >>>>>>> provided >> > >>>>>>>> me with 2 ARM machines (16 cores each). >> > >>>>>>>> I want to use the custom, more efficient build machines for >> > >> building >> > >>>>>>>> Flink's pull requests and master-pushes. >> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for >> > >>>>>>>> example to >> > >>>>>>>> transfer intermediate build artifacts between pipeline stages. >> > >> This >> > >>>>>>>> will >> > >>>>>>>> allow us to make the build more reliable (we are currently >> abusing >> > >>> the >> > >>>>>>>> caching mechanism in Travis for this). >> > >>>>>>>> It also has some basic analytics on test results / flaky tests >> > >> etc. >> > >>>>>>>> >> > >>>>>>>> *Known problems:* >> > >>>>>>>> - Initially, we might see different build instabilities than >> > >> before >> > >>>>>>>> - There's a higher maintenance overhead for the custom build >> > >>> machines >> > >>>>>>>> (keeping them up to date etc.) >> > >>>>>>>> - We can not use the build status integration of AZP, because >> they >> > >>>>>>> require >> > >>>>>>>> write access to the repository's source. The foundation does >> not >> > >>> allow >> > >>>>>>> that >> > >>>>>>>> [2]. >> > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. >> > >>>>>>>> >> > >>>>>>>> *Current Status:* >> > >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts >> on >> > >>>>>>>> Azure >> > >>>>>>>> Pipelines: This means that we will have one compile stage, and >> N >> > >>>>>>>> testing >> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. >> > >>>>>>>> The time from the start of a build till all tests have >> completed >> > >> is >> > >>>>>>>> 1h22 >> > >>>>>>>> minutes. >> > >>>>>>>> - I'm working on getting the nightly end to end tests to run on >> > >> the >> > >>>>>>>> new >> > >>>>>>>> infrastructure. >> > >>>>>>>> - I'm working on getting the build to work on our pool of >> custom >> > >>>>>>>> machines >> > >>>>>>>> as well >> > >>>>>>>> - I'm working on setting up the full matrix of builds >> (different >> > >>>>>>>> scala, >> > >>>>>>>> hadoop etc. versions) for the nightlies >> > >>>>>>>> >> > >>>>>>>> *Next Steps:* >> > >>>>>>>> - I propose to document the entire build system in the Flink >> Wiki >> > >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I >> > >>>>>>>> would set >> > >>>>>>>> it up to run in parallel (including Flinkbot posting links to >> > >>>>>>>> Azure). I >> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do >> not >> > >>>>>>>> have to >> > >>>>>>>> maintain things concurrently. I will monitor the build >> stability >> > >>>>>>>> closely, >> > >>>>>>>> but would expect some support with debugging potential issues >> from >> > >>> the >> > >>>>>>>> contributors. >> > >>>>>>>> - Once there are no problems with the new setup, we remove the >> > >>> Travis >> > >>>>>>>> setup. >> > >>>>>>>> - Independently, I will work on triggering builds from master / >> > >>>>>>>> release - >> > >>>>>>>> branch pushes, as well as cron builds from the master branch >> ... >> > >>>>>>>> all this >> > >>>>>>>> will be described in the Wiki. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> *Timeline:*- Once I have the feeling that people are >> supportive of >> > >>> the >> > >>>>>>>> idea, I will start documenting in the Wiki. The first pull >> > >> requests >> > >>>>>>> should >> > >>>>>>>> show up after a few more days. >> > >>>>>>>> I will do a one month parental leave starting some time later >> in >> > >>>>>>> December, >> > >>>>>>>> which will probably delay things a bit. I hope to have >> everything >> > >>>>>>> finished >> > >>>>>>>> by end of January. >> > >>>>>>>> >> > >>>>>>>> I'm happy to hear your thoughts on this work. >> > >>>>>>>> If nobody objects, I will start documenting the system and >> prepare >> > >>>>>>>> everything for the migration. >> > >>>>>>>> >> > >>>>>>>> Best, >> > >>>>>>>> Robert >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> [1] >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>> >> > >>>> >> > >>> >> > >> >> > >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >> > >>>>>>> >> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 >> > >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground >> > >>>>>>>> [4] >> > >>>>>>> >> > >>> >> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>> >> > >>>> >> > >>> >> > >> >> > >> >> > >> -- >> > >> Best Regards >> > >> >> > >> Jeff Zhang >> > >> >> > >> > >> > |
Hi Robert
Really exciting to see this new more powerful CI tool to get rid of the 50 minutes limit of traivs-CI free account. After reading the wiki, I support idea 2 of AZP-setup version-2. However, after I dig into some failing builds at https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view the logs of some IT cases which would be uploaded by traivs_watchdog to transfer.sh previously. I think this feature is also easy to implement in AZP, right? Best Yun Tang On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: I've created a first draft of my plans in the wiki: https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines. I'm looking forward to your comments. On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger <[hidden email]> wrote: > Thank you all for the positive feedback. I will start putting together a > page in the wiki. > > @Jark: Azure Pipelines provides a free services, that is even better than > what Travis provides for free: 10 parallel builds with 6 hours timeouts. > > @Chesnay: I will answer your questions in the yet-to-be-written > documentation in the wiki. > > > On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email]> wrote: > >> +1 I had good experiences with Azure pipelines in the past. >> >> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek <[hidden email]> >> wrote: >> >> > +1 >> > >> > Thanks for the effort! The tooling seems to be quite a bit nicer and I >> > like that we can grow by adding more machines. >> > >> > Best, >> > Aljoscha >> > >> > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: >> > > >> > > +1 for Azure pipeline because it promises better performance. >> > > >> > > However, I have 2 concerns: >> > > >> > > 1) Travis provides personal free service for testing personal >> branches. >> > > Usually, contributors use this feature to test PoC or run CRON jobs >> for >> > > pull requests. >> > > Using local machine will cost a lot of time. Does AZP provides the >> > same >> > > free service? >> > > 2) Currently, we deployed a webhook [1] to receive Travis CI build >> > > notifications [2] and send to [hidden email] mailing list. >> > > We need to figure out a way how to send Azure build results to the >> > > mailing list. And this [3] might be the way to go. >> > > >> > > [hidden email] mailing list >> > > >> > > Best, >> > > Jark >> > > >> > > [1]: https://github.com/wuchong/flink-notification-bot >> > > [2]: >> > > >> > >> https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications >> > > [3]: >> > > >> > >> https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops >> > > >> > > >> > > >> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> wrote: >> > > >> > >> +1 >> > >> >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: >> > >> >> > >>> +1 for moving to Azure pipelines as it promises better scalability >> and >> > >>> tooling. Looking forward to having faster builds and hence shorter >> > >> feedback >> > >>> cycles :-) >> > >>> >> > >>> Cheers, >> > >>> Till >> > >>> >> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler <[hidden email] >> > >> > >>> wrote: >> > >>> >> > >>>> @robert Can you expand how the azure setup interacts with CiBot? >> Do we >> > >>>> have to continue mirroring builds into flink-ci? How will the >> cronjob >> > >>>> configuration work? We should have a general idea on how to >> implement >> > >>>> this before proceeding. >> > >>>> Additionally, moving /all /jobs into flink-ci requires setting up >> the >> > >>>> environment variables we have; can we set these up via files or >> will >> > we >> > >>>> have to give all committers permissions for flink-ci/flink? >> > >>>> >> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: >> > >>>>> From what I've seen so far Azure will provide us a better >> experience, >> > >>>>> so I'd say +1 for the transition as a whole. >> > >>>>> >> > >>>>> I'd delay merge at least until the feature branch is cut. >> > >>>>> Given the parental leave it may even make sense to only start >> merging >> > >>>>> in January afterwards, to reduce the total time taken for the >> > >>> transition. >> > >>>>> >> > >>>>> Reviews could maybe be made earlier, but I'm wondering whether >> anyone >> > >>>>> would even have the time at the moment to do so. >> > >>>>> >> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: >> > >>>>>> Thanks Robert for driving this. There is another big pain point >> of >> > >>>>>> current >> > >>>>>> travis, >> > >>>>>> which is its cache mechanism will fail from time to time. Almost >> > >>>>>> around 50% >> > >>>>>> of >> > >>>>>> the build fails are caused by cache problem. I opened this issue >> to >> > >>>>>> travis >> > >>>>>> but >> > >>>>>> got no response yet. So big +1 from my side. >> > >>>>>> >> > >>>>>> Just one comment, it's close to 1.10 feature freeze and we will >> > >> spend >> > >>>>>> some >> > >>>>>> time >> > >>>>>> to make tests stable before release. I wish this replacement can >> > >>> happen >> > >>>>>> after >> > >>>>>> 1.10 release, otherwise it will be a unstable factor during >> release >> > >>>>>> testing. >> > >>>>>> >> > >>>>>> Best, >> > >>>>>> Kurt >> > >>>>>> >> > >>>>>> >> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> >> wrote: >> > >>>>>> >> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all the >> efforts >> > >>> to >> > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink >> > >> building. >> > >>>>>>> Big +1 for it. >> > >>>>>>> >> > >>>>>>> It would be great that the community building can be extended >> with >> > >>>>>>> custom >> > >>>>>>> machines so that the tests would not be queued for long with >> daily >> > >>>>>>> growing >> > >>>>>>> PRs. >> > >>>>>>> >> > >>>>>>> The increased timeout would be also very helpful. >> > >>>>>>> The 50min timeout for free travis accounts is a pain currently, >> > >>>>>>> especially >> > >>>>>>> when we'd like to run e2e tests in our own travis. And I had to >> > >>>>>>> manually >> > >>>>>>> split the jobs to make it possible to pass. >> > >>>>>>> >> > >>>>>>> Thanks, >> > >>>>>>> Zhu Zhu >> > >>>>>>> >> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 下午6:36写道: >> > >>>>>>> >> > >>>>>>>> Hi all, >> > >>>>>>>> >> > >>>>>>>> as a follow up from our discussion on reducing the build time >> > >> [1], I >> > >>>>>>> would >> > >>>>>>>> like to propose migrating our build infrastructure to Azure >> > >>> Pipelines >> > >>>>>>> (away >> > >>>>>>>> from Travis). >> > >>>>>>>> >> > >>>>>>>> I believe that we have reached the limits of what Travis can >> > >>>>>>>> provide the >> > >>>>>>>> Flink community, and I don't want the build system to limit or >> > >>>>>>>> influence >> > >>>>>>>> the project's growth. >> > >>>>>>>> >> > >>>>>>>> *Benefits:* >> > >>>>>>>> 1. The free Travis account are limited to 5 parallel builds, >> with >> > >> a >> > >>>>>>> timeout >> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 minute >> > >>>>>>>> timeouts >> > >>>>>>>> *for >> > >>>>>>>> free for open source projects. >> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build machines* to >> the >> > >>>>>>>> pool >> > >>>>>>> of >> > >>>>>>>> 10 free parallel builders. >> > >>>>>>>> This will allow the Flink community to scale the available >> build >> > >>>>>>>> capacity >> > >>>>>>>> as the project grows. We are dependent on donations from >> > >> supporting >> > >>>>>>>> companies, but I believe that it is easier for companies to >> donate >> > >>>>>>> machines >> > >>>>>>>> than money. >> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores each >> to >> > >> the >> > >>>>>>> Flink >> > >>>>>>>> project for this purpose. >> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support for >> Flink >> > >>>>>>> provided >> > >>>>>>>> me with 2 ARM machines (16 cores each). >> > >>>>>>>> I want to use the custom, more efficient build machines for >> > >> building >> > >>>>>>>> Flink's pull requests and master-pushes. >> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, allowing for >> > >>>>>>>> example to >> > >>>>>>>> transfer intermediate build artifacts between pipeline stages. >> > >> This >> > >>>>>>>> will >> > >>>>>>>> allow us to make the build more reliable (we are currently >> abusing >> > >>> the >> > >>>>>>>> caching mechanism in Travis for this). >> > >>>>>>>> It also has some basic analytics on test results / flaky tests >> > >> etc. >> > >>>>>>>> >> > >>>>>>>> *Known problems:* >> > >>>>>>>> - Initially, we might see different build instabilities than >> > >> before >> > >>>>>>>> - There's a higher maintenance overhead for the custom build >> > >>> machines >> > >>>>>>>> (keeping them up to date etc.) >> > >>>>>>>> - We can not use the build status integration of AZP, because >> they >> > >>>>>>> require >> > >>>>>>>> write access to the repository's source. The foundation does >> not >> > >>> allow >> > >>>>>>> that >> > >>>>>>>> [2]. >> > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. >> > >>>>>>>> >> > >>>>>>>> *Current Status:* >> > >>>>>>>> - I'm able [3] to execute [4] the current custom build scripts >> on >> > >>>>>>>> Azure >> > >>>>>>>> Pipelines: This means that we will have one compile stage, and >> N >> > >>>>>>>> testing >> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing jobs. >> > >>>>>>>> The time from the start of a build till all tests have >> completed >> > >> is >> > >>>>>>>> 1h22 >> > >>>>>>>> minutes. >> > >>>>>>>> - I'm working on getting the nightly end to end tests to run on >> > >> the >> > >>>>>>>> new >> > >>>>>>>> infrastructure. >> > >>>>>>>> - I'm working on getting the build to work on our pool of >> custom >> > >>>>>>>> machines >> > >>>>>>>> as well >> > >>>>>>>> - I'm working on setting up the full matrix of builds >> (different >> > >>>>>>>> scala, >> > >>>>>>>> hadoop etc. versions) for the nightlies >> > >>>>>>>> >> > >>>>>>>> *Next Steps:* >> > >>>>>>>> - I propose to document the entire build system in the Flink >> Wiki >> > >>>>>>>> - Once Azure can cover the same pull request tests as Travis, I >> > >>>>>>>> would set >> > >>>>>>>> it up to run in parallel (including Flinkbot posting links to >> > >>>>>>>> Azure). I >> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that we do >> not >> > >>>>>>>> have to >> > >>>>>>>> maintain things concurrently. I will monitor the build >> stability >> > >>>>>>>> closely, >> > >>>>>>>> but would expect some support with debugging potential issues >> from >> > >>> the >> > >>>>>>>> contributors. >> > >>>>>>>> - Once there are no problems with the new setup, we remove the >> > >>> Travis >> > >>>>>>>> setup. >> > >>>>>>>> - Independently, I will work on triggering builds from master / >> > >>>>>>>> release - >> > >>>>>>>> branch pushes, as well as cron builds from the master branch >> ... >> > >>>>>>>> all this >> > >>>>>>>> will be described in the Wiki. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> *Timeline:*- Once I have the feeling that people are >> supportive of >> > >>> the >> > >>>>>>>> idea, I will start documenting in the Wiki. The first pull >> > >> requests >> > >>>>>>> should >> > >>>>>>>> show up after a few more days. >> > >>>>>>>> I will do a one month parental leave starting some time later >> in >> > >>>>>>> December, >> > >>>>>>>> which will probably delay things a bit. I hope to have >> everything >> > >>>>>>> finished >> > >>>>>>>> by end of January. >> > >>>>>>>> >> > >>>>>>>> I'm happy to hear your thoughts on this work. >> > >>>>>>>> If nobody objects, I will start documenting the system and >> prepare >> > >>>>>>>> everything for the migration. >> > >>>>>>>> >> > >>>>>>>> Best, >> > >>>>>>>> Robert >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> [1] >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>> >> > >>>> >> > >>> >> > >> >> > >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >> > >>>>>>> >> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 >> > >>>>>>>> [3] https://github.com/rmetzger/flink/tree/azure_playground >> > >>>>>>>> [4] >> > >>>>>>> >> > >>> >> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>> >> > >>>> >> > >>> >> > >> >> > >> >> > >> -- >> > >> Best Regards >> > >> >> > >> Jeff Zhang >> > >> >> > >> > >> > |
Thanks for your comments Yun.
If there's strong support for idea 2, it would actually make my life easier: the migration would be easier to do. I also noticed that the uploads to transfer.sh were broken, but this should be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). The builds in "flink-ci.flink" (coming from flink-ci/flink) might have troubles with transfer.sh. On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: > Hi Robert > > Really exciting to see this new more powerful CI tool to get rid of the 50 > minutes limit of traivs-CI free account. > > After reading the wiki, I support idea 2 of AZP-setup version-2. > > However, after I dig into some failing builds at > https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view the > logs of some IT cases which would be uploaded by traivs_watchdog to > transfer.sh previously. > I think this feature is also easy to implement in AZP, right? > > Best > Yun Tang > > On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: > > I've created a first draft of my plans in the wiki: > > https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines > . > I'm looking forward to your comments. > > On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger <[hidden email]> > wrote: > > > Thank you all for the positive feedback. I will start putting > together a > > page in the wiki. > > > > @Jark: Azure Pipelines provides a free services, that is even better > than > > what Travis provides for free: 10 parallel builds with 6 hours > timeouts. > > > > @Chesnay: I will answer your questions in the yet-to-be-written > > documentation in the wiki. > > > > > > On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email]> > wrote: > > > >> +1 I had good experiences with Azure pipelines in the past. > >> > >> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < > [hidden email]> > >> wrote: > >> > >> > +1 > >> > > >> > Thanks for the effort! The tooling seems to be quite a bit nicer > and I > >> > like that we can grow by adding more machines. > >> > > >> > Best, > >> > Aljoscha > >> > > >> > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > >> > > > >> > > +1 for Azure pipeline because it promises better performance. > >> > > > >> > > However, I have 2 concerns: > >> > > > >> > > 1) Travis provides personal free service for testing personal > >> branches. > >> > > Usually, contributors use this feature to test PoC or run CRON > jobs > >> for > >> > > pull requests. > >> > > Using local machine will cost a lot of time. Does AZP > provides the > >> > same > >> > > free service? > >> > > 2) Currently, we deployed a webhook [1] to receive Travis CI > build > >> > > notifications [2] and send to [hidden email] mailing > list. > >> > > We need to figure out a way how to send Azure build results > to the > >> > > mailing list. And this [3] might be the way to go. > >> > > > >> > > [hidden email] mailing list > >> > > > >> > > Best, > >> > > Jark > >> > > > >> > > [1]: https://github.com/wuchong/flink-notification-bot > >> > > [2]: > >> > > > >> > > >> > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > >> > > [3]: > >> > > > >> > > >> > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > >> > > > >> > > > >> > > > >> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> > wrote: > >> > > > >> > >> +1 > >> > >> > >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 下午10:43写道: > >> > >> > >> > >>> +1 for moving to Azure pipelines as it promises better > scalability > >> and > >> > >>> tooling. Looking forward to having faster builds and hence > shorter > >> > >> feedback > >> > >>> cycles :-) > >> > >>> > >> > >>> Cheers, > >> > >>> Till > >> > >>> > >> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < > [hidden email] > >> > > >> > >>> wrote: > >> > >>> > >> > >>>> @robert Can you expand how the azure setup interacts with > CiBot? > >> Do we > >> > >>>> have to continue mirroring builds into flink-ci? How will the > >> cronjob > >> > >>>> configuration work? We should have a general idea on how to > >> implement > >> > >>>> this before proceeding. > >> > >>>> Additionally, moving /all /jobs into flink-ci requires > setting up > >> the > >> > >>>> environment variables we have; can we set these up via files > or > >> will > >> > we > >> > >>>> have to give all committers permissions for flink-ci/flink? > >> > >>>> > >> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > >> > >>>>> From what I've seen so far Azure will provide us a better > >> experience, > >> > >>>>> so I'd say +1 for the transition as a whole. > >> > >>>>> > >> > >>>>> I'd delay merge at least until the feature branch is cut. > >> > >>>>> Given the parental leave it may even make sense to only > start > >> merging > >> > >>>>> in January afterwards, to reduce the total time taken for > the > >> > >>> transition. > >> > >>>>> > >> > >>>>> Reviews could maybe be made earlier, but I'm wondering > whether > >> anyone > >> > >>>>> would even have the time at the moment to do so. > >> > >>>>> > >> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: > >> > >>>>>> Thanks Robert for driving this. There is another big pain > point > >> of > >> > >>>>>> current > >> > >>>>>> travis, > >> > >>>>>> which is its cache mechanism will fail from time to time. > Almost > >> > >>>>>> around 50% > >> > >>>>>> of > >> > >>>>>> the build fails are caused by cache problem. I opened this > issue > >> to > >> > >>>>>> travis > >> > >>>>>> but > >> > >>>>>> got no response yet. So big +1 from my side. > >> > >>>>>> > >> > >>>>>> Just one comment, it's close to 1.10 feature freeze and we > will > >> > >> spend > >> > >>>>>> some > >> > >>>>>> time > >> > >>>>>> to make tests stable before release. I wish this > replacement can > >> > >>> happen > >> > >>>>>> after > >> > >>>>>> 1.10 release, otherwise it will be a unstable factor during > >> release > >> > >>>>>> testing. > >> > >>>>>> > >> > >>>>>> Best, > >> > >>>>>> Kurt > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu <[hidden email]> > >> wrote: > >> > >>>>>> > >> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all > the > >> efforts > >> > >>> to > >> > >>>>>>> investigate, experiment and tune Azure Pipelines for Flink > >> > >> building. > >> > >>>>>>> Big +1 for it. > >> > >>>>>>> > >> > >>>>>>> It would be great that the community building can be > extended > >> with > >> > >>>>>>> custom > >> > >>>>>>> machines so that the tests would not be queued for long > with > >> daily > >> > >>>>>>> growing > >> > >>>>>>> PRs. > >> > >>>>>>> > >> > >>>>>>> The increased timeout would be also very helpful. > >> > >>>>>>> The 50min timeout for free travis accounts is a pain > currently, > >> > >>>>>>> especially > >> > >>>>>>> when we'd like to run e2e tests in our own travis. And I > had to > >> > >>>>>>> manually > >> > >>>>>>> split the jobs to make it possible to pass. > >> > >>>>>>> > >> > >>>>>>> Thanks, > >> > >>>>>>> Zhu Zhu > >> > >>>>>>> > >> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 > 下午6:36写道: > >> > >>>>>>> > >> > >>>>>>>> Hi all, > >> > >>>>>>>> > >> > >>>>>>>> as a follow up from our discussion on reducing the build > time > >> > >> [1], I > >> > >>>>>>> would > >> > >>>>>>>> like to propose migrating our build infrastructure to > Azure > >> > >>> Pipelines > >> > >>>>>>> (away > >> > >>>>>>>> from Travis). > >> > >>>>>>>> > >> > >>>>>>>> I believe that we have reached the limits of what Travis > can > >> > >>>>>>>> provide the > >> > >>>>>>>> Flink community, and I don't want the build system to > limit or > >> > >>>>>>>> influence > >> > >>>>>>>> the project's growth. > >> > >>>>>>>> > >> > >>>>>>>> *Benefits:* > >> > >>>>>>>> 1. The free Travis account are limited to 5 parallel > builds, > >> with > >> > >> a > >> > >>>>>>> timeout > >> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with 300 > minute > >> > >>>>>>>> timeouts > >> > >>>>>>>> *for > >> > >>>>>>>> free for open source projects. > >> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build > machines* to > >> the > >> > >>>>>>>> pool > >> > >>>>>>> of > >> > >>>>>>>> 10 free parallel builders. > >> > >>>>>>>> This will allow the Flink community to scale the > available > >> build > >> > >>>>>>>> capacity > >> > >>>>>>>> as the project grows. We are dependent on donations from > >> > >> supporting > >> > >>>>>>>> companies, but I believe that it is easier for companies > to > >> donate > >> > >>>>>>> machines > >> > >>>>>>>> than money. > >> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 cores > each > >> to > >> > >> the > >> > >>>>>>> Flink > >> > >>>>>>>> project for this purpose. > >> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM support > for > >> Flink > >> > >>>>>>> provided > >> > >>>>>>>> me with 2 ARM machines (16 cores each). > >> > >>>>>>>> I want to use the custom, more efficient build machines > for > >> > >> building > >> > >>>>>>>> Flink's pull requests and master-pushes. > >> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, > allowing for > >> > >>>>>>>> example to > >> > >>>>>>>> transfer intermediate build artifacts between pipeline > stages. > >> > >> This > >> > >>>>>>>> will > >> > >>>>>>>> allow us to make the build more reliable (we are > currently > >> abusing > >> > >>> the > >> > >>>>>>>> caching mechanism in Travis for this). > >> > >>>>>>>> It also has some basic analytics on test results / flaky > tests > >> > >> etc. > >> > >>>>>>>> > >> > >>>>>>>> *Known problems:* > >> > >>>>>>>> - Initially, we might see different build instabilities > than > >> > >> before > >> > >>>>>>>> - There's a higher maintenance overhead for the custom > build > >> > >>> machines > >> > >>>>>>>> (keeping them up to date etc.) > >> > >>>>>>>> - We can not use the build status integration of AZP, > because > >> they > >> > >>>>>>> require > >> > >>>>>>>> write access to the repository's source. The foundation > does > >> not > >> > >>> allow > >> > >>>>>>> that > >> > >>>>>>>> [2]. > >> > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. > >> > >>>>>>>> > >> > >>>>>>>> *Current Status:* > >> > >>>>>>>> - I'm able [3] to execute [4] the current custom build > scripts > >> on > >> > >>>>>>>> Azure > >> > >>>>>>>> Pipelines: This means that we will have one compile > stage, and > >> N > >> > >>>>>>>> testing > >> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing > jobs. > >> > >>>>>>>> The time from the start of a build till all tests have > >> completed > >> > >> is > >> > >>>>>>>> 1h22 > >> > >>>>>>>> minutes. > >> > >>>>>>>> - I'm working on getting the nightly end to end tests to > run on > >> > >> the > >> > >>>>>>>> new > >> > >>>>>>>> infrastructure. > >> > >>>>>>>> - I'm working on getting the build to work on our pool of > >> custom > >> > >>>>>>>> machines > >> > >>>>>>>> as well > >> > >>>>>>>> - I'm working on setting up the full matrix of builds > >> (different > >> > >>>>>>>> scala, > >> > >>>>>>>> hadoop etc. versions) for the nightlies > >> > >>>>>>>> > >> > >>>>>>>> *Next Steps:* > >> > >>>>>>>> - I propose to document the entire build system in the > Flink > >> Wiki > >> > >>>>>>>> - Once Azure can cover the same pull request tests as > Travis, I > >> > >>>>>>>> would set > >> > >>>>>>>> it up to run in parallel (including Flinkbot posting > links to > >> > >>>>>>>> Azure). I > >> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that > we do > >> not > >> > >>>>>>>> have to > >> > >>>>>>>> maintain things concurrently. I will monitor the build > >> stability > >> > >>>>>>>> closely, > >> > >>>>>>>> but would expect some support with debugging potential > issues > >> from > >> > >>> the > >> > >>>>>>>> contributors. > >> > >>>>>>>> - Once there are no problems with the new setup, we > remove the > >> > >>> Travis > >> > >>>>>>>> setup. > >> > >>>>>>>> - Independently, I will work on triggering builds from > master / > >> > >>>>>>>> release - > >> > >>>>>>>> branch pushes, as well as cron builds from the master > branch > >> ... > >> > >>>>>>>> all this > >> > >>>>>>>> will be described in the Wiki. > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> *Timeline:*- Once I have the feeling that people are > >> supportive of > >> > >>> the > >> > >>>>>>>> idea, I will start documenting in the Wiki. The first > pull > >> > >> requests > >> > >>>>>>> should > >> > >>>>>>>> show up after a few more days. > >> > >>>>>>>> I will do a one month parental leave starting some time > later > >> in > >> > >>>>>>> December, > >> > >>>>>>>> which will probably delay things a bit. I hope to have > >> everything > >> > >>>>>>> finished > >> > >>>>>>>> by end of January. > >> > >>>>>>>> > >> > >>>>>>>> I'm happy to hear your thoughts on this work. > >> > >>>>>>>> If nobody objects, I will start documenting the system > and > >> prepare > >> > >>>>>>>> everything for the migration. > >> > >>>>>>>> > >> > >>>>>>>> Best, > >> > >>>>>>>> Robert > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>>> [1] > >> > >>>>>>>> > >> > >>>>>>>> > >> > >>>>>>> > >> > >>>> > >> > >>> > >> > >> > >> > > >> > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > >> > >>>>>>> > >> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > >> > >>>>>>>> [3] > https://github.com/rmetzger/flink/tree/azure_playground > >> > >>>>>>>> [4] > >> > >>>>>>> > >> > >>> > >> > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > >> > >>>>> > >> > >>>>> > >> > >>>>> > >> > >>>> > >> > >>>> > >> > >>> > >> > >> > >> > >> > >> > >> -- > >> > >> Best Regards > >> > >> > >> > >> Jeff Zhang > >> > >> > >> > > >> > > >> > > > > > |
Hi Robert,
Thanks for bring up this topic. The 2 ARM machines(16cores) which I donated is just for POC test. We(Huawei) can donate more once moving to official Azure pipeline. :) Robert Metzger <[hidden email]> 于2019年12月6日周五 上午3:25写道: > Thanks for your comments Yun. > If there's strong support for idea 2, it would actually make my > life easier: the migration would be easier to do. > > I also noticed that the uploads to transfer.sh were broken, but this should > be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). The > builds in "flink-ci.flink" (coming from flink-ci/flink) might have troubles > with transfer.sh. > > > On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: > > > Hi Robert > > > > Really exciting to see this new more powerful CI tool to get rid of the > 50 > > minutes limit of traivs-CI free account. > > > > After reading the wiki, I support idea 2 of AZP-setup version-2. > > > > However, after I dig into some failing builds at > > https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view the > > logs of some IT cases which would be uploaded by traivs_watchdog to > > transfer.sh previously. > > I think this feature is also easy to implement in AZP, right? > > > > Best > > Yun Tang > > > > On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: > > > > I've created a first draft of my plans in the wiki: > > > > > https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines > > . > > I'm looking forward to your comments. > > > > On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger <[hidden email]> > > wrote: > > > > > Thank you all for the positive feedback. I will start putting > > together a > > > page in the wiki. > > > > > > @Jark: Azure Pipelines provides a free services, that is even > better > > than > > > what Travis provides for free: 10 parallel builds with 6 hours > > timeouts. > > > > > > @Chesnay: I will answer your questions in the yet-to-be-written > > > documentation in the wiki. > > > > > > > > > On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email]> > > wrote: > > > > > >> +1 I had good experiences with Azure pipelines in the past. > > >> > > >> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < > > [hidden email]> > > >> wrote: > > >> > > >> > +1 > > >> > > > >> > Thanks for the effort! The tooling seems to be quite a bit nicer > > and I > > >> > like that we can grow by adding more machines. > > >> > > > >> > Best, > > >> > Aljoscha > > >> > > > >> > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > > >> > > > > >> > > +1 for Azure pipeline because it promises better performance. > > >> > > > > >> > > However, I have 2 concerns: > > >> > > > > >> > > 1) Travis provides personal free service for testing personal > > >> branches. > > >> > > Usually, contributors use this feature to test PoC or run CRON > > jobs > > >> for > > >> > > pull requests. > > >> > > Using local machine will cost a lot of time. Does AZP > > provides the > > >> > same > > >> > > free service? > > >> > > 2) Currently, we deployed a webhook [1] to receive Travis CI > > build > > >> > > notifications [2] and send to [hidden email] mailing > > list. > > >> > > We need to figure out a way how to send Azure build results > > to the > > >> > > mailing list. And this [3] might be the way to go. > > >> > > > > >> > > [hidden email] mailing list > > >> > > > > >> > > Best, > > >> > > Jark > > >> > > > > >> > > [1]: https://github.com/wuchong/flink-notification-bot > > >> > > [2]: > > >> > > > > >> > > > >> > > > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > > >> > > [3]: > > >> > > > > >> > > > >> > > > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > > >> > > > > >> > > > > >> > > > > >> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> > > wrote: > > >> > > > > >> > >> +1 > > >> > >> > > >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 > 下午10:43写道: > > >> > >> > > >> > >>> +1 for moving to Azure pipelines as it promises better > > scalability > > >> and > > >> > >>> tooling. Looking forward to having faster builds and hence > > shorter > > >> > >> feedback > > >> > >>> cycles :-) > > >> > >>> > > >> > >>> Cheers, > > >> > >>> Till > > >> > >>> > > >> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < > > [hidden email] > > >> > > > >> > >>> wrote: > > >> > >>> > > >> > >>>> @robert Can you expand how the azure setup interacts with > > CiBot? > > >> Do we > > >> > >>>> have to continue mirroring builds into flink-ci? How will > the > > >> cronjob > > >> > >>>> configuration work? We should have a general idea on how to > > >> implement > > >> > >>>> this before proceeding. > > >> > >>>> Additionally, moving /all /jobs into flink-ci requires > > setting up > > >> the > > >> > >>>> environment variables we have; can we set these up via > files > > or > > >> will > > >> > we > > >> > >>>> have to give all committers permissions for flink-ci/flink? > > >> > >>>> > > >> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > > >> > >>>>> From what I've seen so far Azure will provide us a better > > >> experience, > > >> > >>>>> so I'd say +1 for the transition as a whole. > > >> > >>>>> > > >> > >>>>> I'd delay merge at least until the feature branch is cut. > > >> > >>>>> Given the parental leave it may even make sense to only > > start > > >> merging > > >> > >>>>> in January afterwards, to reduce the total time taken for > > the > > >> > >>> transition. > > >> > >>>>> > > >> > >>>>> Reviews could maybe be made earlier, but I'm wondering > > whether > > >> anyone > > >> > >>>>> would even have the time at the moment to do so. > > >> > >>>>> > > >> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: > > >> > >>>>>> Thanks Robert for driving this. There is another big pain > > point > > >> of > > >> > >>>>>> current > > >> > >>>>>> travis, > > >> > >>>>>> which is its cache mechanism will fail from time to time. > > Almost > > >> > >>>>>> around 50% > > >> > >>>>>> of > > >> > >>>>>> the build fails are caused by cache problem. I opened > this > > issue > > >> to > > >> > >>>>>> travis > > >> > >>>>>> but > > >> > >>>>>> got no response yet. So big +1 from my side. > > >> > >>>>>> > > >> > >>>>>> Just one comment, it's close to 1.10 feature freeze and > we > > will > > >> > >> spend > > >> > >>>>>> some > > >> > >>>>>> time > > >> > >>>>>> to make tests stable before release. I wish this > > replacement can > > >> > >>> happen > > >> > >>>>>> after > > >> > >>>>>> 1.10 release, otherwise it will be a unstable factor > during > > >> release > > >> > >>>>>> testing. > > >> > >>>>>> > > >> > >>>>>> Best, > > >> > >>>>>> Kurt > > >> > >>>>>> > > >> > >>>>>> > > >> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu < > [hidden email]> > > >> wrote: > > >> > >>>>>> > > >> > >>>>>>> Thanks Robert for the updates! And thanks a lot for all > > the > > >> efforts > > >> > >>> to > > >> > >>>>>>> investigate, experiment and tune Azure Pipelines for > Flink > > >> > >> building. > > >> > >>>>>>> Big +1 for it. > > >> > >>>>>>> > > >> > >>>>>>> It would be great that the community building can be > > extended > > >> with > > >> > >>>>>>> custom > > >> > >>>>>>> machines so that the tests would not be queued for long > > with > > >> daily > > >> > >>>>>>> growing > > >> > >>>>>>> PRs. > > >> > >>>>>>> > > >> > >>>>>>> The increased timeout would be also very helpful. > > >> > >>>>>>> The 50min timeout for free travis accounts is a pain > > currently, > > >> > >>>>>>> especially > > >> > >>>>>>> when we'd like to run e2e tests in our own travis. And I > > had to > > >> > >>>>>>> manually > > >> > >>>>>>> split the jobs to make it possible to pass. > > >> > >>>>>>> > > >> > >>>>>>> Thanks, > > >> > >>>>>>> Zhu Zhu > > >> > >>>>>>> > > >> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 > > 下午6:36写道: > > >> > >>>>>>> > > >> > >>>>>>>> Hi all, > > >> > >>>>>>>> > > >> > >>>>>>>> as a follow up from our discussion on reducing the > build > > time > > >> > >> [1], I > > >> > >>>>>>> would > > >> > >>>>>>>> like to propose migrating our build infrastructure to > > Azure > > >> > >>> Pipelines > > >> > >>>>>>> (away > > >> > >>>>>>>> from Travis). > > >> > >>>>>>>> > > >> > >>>>>>>> I believe that we have reached the limits of what > Travis > > can > > >> > >>>>>>>> provide the > > >> > >>>>>>>> Flink community, and I don't want the build system to > > limit or > > >> > >>>>>>>> influence > > >> > >>>>>>>> the project's growth. > > >> > >>>>>>>> > > >> > >>>>>>>> *Benefits:* > > >> > >>>>>>>> 1. The free Travis account are limited to 5 parallel > > builds, > > >> with > > >> > >> a > > >> > >>>>>>> timeout > > >> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with > 300 > > minute > > >> > >>>>>>>> timeouts > > >> > >>>>>>>> *for > > >> > >>>>>>>> free for open source projects. > > >> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build > > machines* to > > >> the > > >> > >>>>>>>> pool > > >> > >>>>>>> of > > >> > >>>>>>>> 10 free parallel builders. > > >> > >>>>>>>> This will allow the Flink community to scale the > > available > > >> build > > >> > >>>>>>>> capacity > > >> > >>>>>>>> as the project grows. We are dependent on donations > from > > >> > >> supporting > > >> > >>>>>>>> companies, but I believe that it is easier for > companies > > to > > >> donate > > >> > >>>>>>> machines > > >> > >>>>>>>> than money. > > >> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 > cores > > each > > >> to > > >> > >> the > > >> > >>>>>>> Flink > > >> > >>>>>>>> project for this purpose. > > >> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM > support > > for > > >> Flink > > >> > >>>>>>> provided > > >> > >>>>>>>> me with 2 ARM machines (16 cores each). > > >> > >>>>>>>> I want to use the custom, more efficient build machines > > for > > >> > >> building > > >> > >>>>>>>> Flink's pull requests and master-pushes. > > >> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, > > allowing for > > >> > >>>>>>>> example to > > >> > >>>>>>>> transfer intermediate build artifacts between pipeline > > stages. > > >> > >> This > > >> > >>>>>>>> will > > >> > >>>>>>>> allow us to make the build more reliable (we are > > currently > > >> abusing > > >> > >>> the > > >> > >>>>>>>> caching mechanism in Travis for this). > > >> > >>>>>>>> It also has some basic analytics on test results / > flaky > > tests > > >> > >> etc. > > >> > >>>>>>>> > > >> > >>>>>>>> *Known problems:* > > >> > >>>>>>>> - Initially, we might see different build instabilities > > than > > >> > >> before > > >> > >>>>>>>> - There's a higher maintenance overhead for the custom > > build > > >> > >>> machines > > >> > >>>>>>>> (keeping them up to date etc.) > > >> > >>>>>>>> - We can not use the build status integration of AZP, > > because > > >> they > > >> > >>>>>>> require > > >> > >>>>>>>> write access to the repository's source. The foundation > > does > > >> not > > >> > >>> allow > > >> > >>>>>>> that > > >> > >>>>>>>> [2]. > > >> > >>>>>>>> I propose to extend flinkbot / the flink-ci repository. > > >> > >>>>>>>> > > >> > >>>>>>>> *Current Status:* > > >> > >>>>>>>> - I'm able [3] to execute [4] the current custom build > > scripts > > >> on > > >> > >>>>>>>> Azure > > >> > >>>>>>>> Pipelines: This means that we will have one compile > > stage, and > > >> N > > >> > >>>>>>>> testing > > >> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 testing > > jobs. > > >> > >>>>>>>> The time from the start of a build till all tests have > > >> completed > > >> > >> is > > >> > >>>>>>>> 1h22 > > >> > >>>>>>>> minutes. > > >> > >>>>>>>> - I'm working on getting the nightly end to end tests > to > > run on > > >> > >> the > > >> > >>>>>>>> new > > >> > >>>>>>>> infrastructure. > > >> > >>>>>>>> - I'm working on getting the build to work on our pool > of > > >> custom > > >> > >>>>>>>> machines > > >> > >>>>>>>> as well > > >> > >>>>>>>> - I'm working on setting up the full matrix of builds > > >> (different > > >> > >>>>>>>> scala, > > >> > >>>>>>>> hadoop etc. versions) for the nightlies > > >> > >>>>>>>> > > >> > >>>>>>>> *Next Steps:* > > >> > >>>>>>>> - I propose to document the entire build system in the > > Flink > > >> Wiki > > >> > >>>>>>>> - Once Azure can cover the same pull request tests as > > Travis, I > > >> > >>>>>>>> would set > > >> > >>>>>>>> it up to run in parallel (including Flinkbot posting > > links to > > >> > >>>>>>>> Azure). I > > >> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so that > > we do > > >> not > > >> > >>>>>>>> have to > > >> > >>>>>>>> maintain things concurrently. I will monitor the build > > >> stability > > >> > >>>>>>>> closely, > > >> > >>>>>>>> but would expect some support with debugging potential > > issues > > >> from > > >> > >>> the > > >> > >>>>>>>> contributors. > > >> > >>>>>>>> - Once there are no problems with the new setup, we > > remove the > > >> > >>> Travis > > >> > >>>>>>>> setup. > > >> > >>>>>>>> - Independently, I will work on triggering builds from > > master / > > >> > >>>>>>>> release - > > >> > >>>>>>>> branch pushes, as well as cron builds from the master > > branch > > >> ... > > >> > >>>>>>>> all this > > >> > >>>>>>>> will be described in the Wiki. > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>>> *Timeline:*- Once I have the feeling that people are > > >> supportive of > > >> > >>> the > > >> > >>>>>>>> idea, I will start documenting in the Wiki. The first > > pull > > >> > >> requests > > >> > >>>>>>> should > > >> > >>>>>>>> show up after a few more days. > > >> > >>>>>>>> I will do a one month parental leave starting some time > > later > > >> in > > >> > >>>>>>> December, > > >> > >>>>>>>> which will probably delay things a bit. I hope to have > > >> everything > > >> > >>>>>>> finished > > >> > >>>>>>>> by end of January. > > >> > >>>>>>>> > > >> > >>>>>>>> I'm happy to hear your thoughts on this work. > > >> > >>>>>>>> If nobody objects, I will start documenting the system > > and > > >> prepare > > >> > >>>>>>>> everything for the migration. > > >> > >>>>>>>> > > >> > >>>>>>>> Best, > > >> > >>>>>>>> Robert > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>>> [1] > > >> > >>>>>>>> > > >> > >>>>>>>> > > >> > >>>>>>> > > >> > >>>> > > >> > >>> > > >> > >> > > >> > > > >> > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > >> > >>>>>>> > > >> > >>>>>>>> [2] https://issues.apache.org/jira/browse/INFRA-17030 > > >> > >>>>>>>> [3] > > https://github.com/rmetzger/flink/tree/azure_playground > > >> > >>>>>>>> [4] > > >> > >>>>>>> > > >> > >>> > > >> > > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > >> > >>>>> > > >> > >>>>> > > >> > >>>>> > > >> > >>>> > > >> > >>>> > > >> > >>> > > >> > >> > > >> > >> > > >> > >> -- > > >> > >> Best Regards > > >> > >> > > >> > >> Jeff Zhang > > >> > >> > > >> > > > >> > > > >> > > > > > > > > > > |
+1 for migrating to Azure pipelines as this can have shorter build time,
and faster response. Best, Congxian Xiyuan Wang <[hidden email]> 于2019年12月9日周一 上午10:13写道: > Hi Robert, > Thanks for bring up this topic. The 2 ARM machines(16cores) which I > donated is just for POC test. We(Huawei) can donate more once moving to > official Azure pipeline. :) > > Robert Metzger <[hidden email]> 于2019年12月6日周五 上午3:25写道: > > > Thanks for your comments Yun. > > If there's strong support for idea 2, it would actually make my > > life easier: the migration would be easier to do. > > > > I also noticed that the uploads to transfer.sh were broken, but this > should > > be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). The > > builds in "flink-ci.flink" (coming from flink-ci/flink) might have > troubles > > with transfer.sh. > > > > > > On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: > > > > > Hi Robert > > > > > > Really exciting to see this new more powerful CI tool to get rid of the > > 50 > > > minutes limit of traivs-CI free account. > > > > > > After reading the wiki, I support idea 2 of AZP-setup version-2. > > > > > > However, after I dig into some failing builds at > > > https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view > the > > > logs of some IT cases which would be uploaded by traivs_watchdog to > > > transfer.sh previously. > > > I think this feature is also easy to implement in AZP, right? > > > > > > Best > > > Yun Tang > > > > > > On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: > > > > > > I've created a first draft of my plans in the wiki: > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines > > > . > > > I'm looking forward to your comments. > > > > > > On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger < > [hidden email]> > > > wrote: > > > > > > > Thank you all for the positive feedback. I will start putting > > > together a > > > > page in the wiki. > > > > > > > > @Jark: Azure Pipelines provides a free services, that is even > > better > > > than > > > > what Travis provides for free: 10 parallel builds with 6 hours > > > timeouts. > > > > > > > > @Chesnay: I will answer your questions in the yet-to-be-written > > > > documentation in the wiki. > > > > > > > > > > > > On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email] > > > > > wrote: > > > > > > > >> +1 I had good experiences with Azure pipelines in the past. > > > >> > > > >> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < > > > [hidden email]> > > > >> wrote: > > > >> > > > >> > +1 > > > >> > > > > >> > Thanks for the effort! The tooling seems to be quite a bit > nicer > > > and I > > > >> > like that we can grow by adding more machines. > > > >> > > > > >> > Best, > > > >> > Aljoscha > > > >> > > > > >> > > On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > > > >> > > > > > >> > > +1 for Azure pipeline because it promises better > performance. > > > >> > > > > > >> > > However, I have 2 concerns: > > > >> > > > > > >> > > 1) Travis provides personal free service for testing > personal > > > >> branches. > > > >> > > Usually, contributors use this feature to test PoC or run > CRON > > > jobs > > > >> for > > > >> > > pull requests. > > > >> > > Using local machine will cost a lot of time. Does AZP > > > provides the > > > >> > same > > > >> > > free service? > > > >> > > 2) Currently, we deployed a webhook [1] to receive Travis CI > > > build > > > >> > > notifications [2] and send to [hidden email] > mailing > > > list. > > > >> > > We need to figure out a way how to send Azure build > results > > > to the > > > >> > > mailing list. And this [3] might be the way to go. > > > >> > > > > > >> > > [hidden email] mailing list > > > >> > > > > > >> > > Best, > > > >> > > Jark > > > >> > > > > > >> > > [1]: https://github.com/wuchong/flink-notification-bot > > > >> > > [2]: > > > >> > > > > > >> > > > > >> > > > > > > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > > > >> > > [3]: > > > >> > > > > > >> > > > > >> > > > > > > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> > > > wrote: > > > >> > > > > > >> > >> +1 > > > >> > >> > > > >> > >> Till Rohrmann <[hidden email]> 于2019年12月4日周三 > > 下午10:43写道: > > > >> > >> > > > >> > >>> +1 for moving to Azure pipelines as it promises better > > > scalability > > > >> and > > > >> > >>> tooling. Looking forward to having faster builds and hence > > > shorter > > > >> > >> feedback > > > >> > >>> cycles :-) > > > >> > >>> > > > >> > >>> Cheers, > > > >> > >>> Till > > > >> > >>> > > > >> > >>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < > > > [hidden email] > > > >> > > > > >> > >>> wrote: > > > >> > >>> > > > >> > >>>> @robert Can you expand how the azure setup interacts with > > > CiBot? > > > >> Do we > > > >> > >>>> have to continue mirroring builds into flink-ci? How will > > the > > > >> cronjob > > > >> > >>>> configuration work? We should have a general idea on how > to > > > >> implement > > > >> > >>>> this before proceeding. > > > >> > >>>> Additionally, moving /all /jobs into flink-ci requires > > > setting up > > > >> the > > > >> > >>>> environment variables we have; can we set these up via > > files > > > or > > > >> will > > > >> > we > > > >> > >>>> have to give all committers permissions for > flink-ci/flink? > > > >> > >>>> > > > >> > >>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > > > >> > >>>>> From what I've seen so far Azure will provide us a > better > > > >> experience, > > > >> > >>>>> so I'd say +1 for the transition as a whole. > > > >> > >>>>> > > > >> > >>>>> I'd delay merge at least until the feature branch is > cut. > > > >> > >>>>> Given the parental leave it may even make sense to only > > > start > > > >> merging > > > >> > >>>>> in January afterwards, to reduce the total time taken > for > > > the > > > >> > >>> transition. > > > >> > >>>>> > > > >> > >>>>> Reviews could maybe be made earlier, but I'm wondering > > > whether > > > >> anyone > > > >> > >>>>> would even have the time at the moment to do so. > > > >> > >>>>> > > > >> > >>>>> On 04/12/2019 12:35, Kurt Young wrote: > > > >> > >>>>>> Thanks Robert for driving this. There is another big > pain > > > point > > > >> of > > > >> > >>>>>> current > > > >> > >>>>>> travis, > > > >> > >>>>>> which is its cache mechanism will fail from time to > time. > > > Almost > > > >> > >>>>>> around 50% > > > >> > >>>>>> of > > > >> > >>>>>> the build fails are caused by cache problem. I opened > > this > > > issue > > > >> to > > > >> > >>>>>> travis > > > >> > >>>>>> but > > > >> > >>>>>> got no response yet. So big +1 from my side. > > > >> > >>>>>> > > > >> > >>>>>> Just one comment, it's close to 1.10 feature freeze and > > we > > > will > > > >> > >> spend > > > >> > >>>>>> some > > > >> > >>>>>> time > > > >> > >>>>>> to make tests stable before release. I wish this > > > replacement can > > > >> > >>> happen > > > >> > >>>>>> after > > > >> > >>>>>> 1.10 release, otherwise it will be a unstable factor > > during > > > >> release > > > >> > >>>>>> testing. > > > >> > >>>>>> > > > >> > >>>>>> Best, > > > >> > >>>>>> Kurt > > > >> > >>>>>> > > > >> > >>>>>> > > > >> > >>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu < > > [hidden email]> > > > >> wrote: > > > >> > >>>>>> > > > >> > >>>>>>> Thanks Robert for the updates! And thanks a lot for > all > > > the > > > >> efforts > > > >> > >>> to > > > >> > >>>>>>> investigate, experiment and tune Azure Pipelines for > > Flink > > > >> > >> building. > > > >> > >>>>>>> Big +1 for it. > > > >> > >>>>>>> > > > >> > >>>>>>> It would be great that the community building can be > > > extended > > > >> with > > > >> > >>>>>>> custom > > > >> > >>>>>>> machines so that the tests would not be queued for > long > > > with > > > >> daily > > > >> > >>>>>>> growing > > > >> > >>>>>>> PRs. > > > >> > >>>>>>> > > > >> > >>>>>>> The increased timeout would be also very helpful. > > > >> > >>>>>>> The 50min timeout for free travis accounts is a pain > > > currently, > > > >> > >>>>>>> especially > > > >> > >>>>>>> when we'd like to run e2e tests in our own travis. > And I > > > had to > > > >> > >>>>>>> manually > > > >> > >>>>>>> split the jobs to make it possible to pass. > > > >> > >>>>>>> > > > >> > >>>>>>> Thanks, > > > >> > >>>>>>> Zhu Zhu > > > >> > >>>>>>> > > > >> > >>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 > > > 下午6:36写道: > > > >> > >>>>>>> > > > >> > >>>>>>>> Hi all, > > > >> > >>>>>>>> > > > >> > >>>>>>>> as a follow up from our discussion on reducing the > > build > > > time > > > >> > >> [1], I > > > >> > >>>>>>> would > > > >> > >>>>>>>> like to propose migrating our build infrastructure to > > > Azure > > > >> > >>> Pipelines > > > >> > >>>>>>> (away > > > >> > >>>>>>>> from Travis). > > > >> > >>>>>>>> > > > >> > >>>>>>>> I believe that we have reached the limits of what > > Travis > > > can > > > >> > >>>>>>>> provide the > > > >> > >>>>>>>> Flink community, and I don't want the build system to > > > limit or > > > >> > >>>>>>>> influence > > > >> > >>>>>>>> the project's growth. > > > >> > >>>>>>>> > > > >> > >>>>>>>> *Benefits:* > > > >> > >>>>>>>> 1. The free Travis account are limited to 5 parallel > > > builds, > > > >> with > > > >> > >> a > > > >> > >>>>>>> timeout > > > >> > >>>>>>>> of 50 minutes. Azure offers *10 parallel builds with > > 300 > > > minute > > > >> > >>>>>>>> timeouts > > > >> > >>>>>>>> *for > > > >> > >>>>>>>> free for open source projects. > > > >> > >>>>>>>> 2. Azure Pipelines allows us to *add custom build > > > machines* to > > > >> the > > > >> > >>>>>>>> pool > > > >> > >>>>>>> of > > > >> > >>>>>>>> 10 free parallel builders. > > > >> > >>>>>>>> This will allow the Flink community to scale the > > > available > > > >> build > > > >> > >>>>>>>> capacity > > > >> > >>>>>>>> as the project grows. We are dependent on donations > > from > > > >> > >> supporting > > > >> > >>>>>>>> companies, but I believe that it is easier for > > companies > > > to > > > >> donate > > > >> > >>>>>>> machines > > > >> > >>>>>>>> than money. > > > >> > >>>>>>>> Alibaba is willing to provide 10 machines, with 32 > > cores > > > each > > > >> to > > > >> > >> the > > > >> > >>>>>>> Flink > > > >> > >>>>>>>> project for this purpose. > > > >> > >>>>>>>> In addition, Xiyuan, who's working on adding ARM > > support > > > for > > > >> Flink > > > >> > >>>>>>> provided > > > >> > >>>>>>>> me with 2 ARM machines (16 cores each). > > > >> > >>>>>>>> I want to use the custom, more efficient build > machines > > > for > > > >> > >> building > > > >> > >>>>>>>> Flink's pull requests and master-pushes. > > > >> > >>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, > > > allowing for > > > >> > >>>>>>>> example to > > > >> > >>>>>>>> transfer intermediate build artifacts between > pipeline > > > stages. > > > >> > >> This > > > >> > >>>>>>>> will > > > >> > >>>>>>>> allow us to make the build more reliable (we are > > > currently > > > >> abusing > > > >> > >>> the > > > >> > >>>>>>>> caching mechanism in Travis for this). > > > >> > >>>>>>>> It also has some basic analytics on test results / > > flaky > > > tests > > > >> > >> etc. > > > >> > >>>>>>>> > > > >> > >>>>>>>> *Known problems:* > > > >> > >>>>>>>> - Initially, we might see different build > instabilities > > > than > > > >> > >> before > > > >> > >>>>>>>> - There's a higher maintenance overhead for the > custom > > > build > > > >> > >>> machines > > > >> > >>>>>>>> (keeping them up to date etc.) > > > >> > >>>>>>>> - We can not use the build status integration of AZP, > > > because > > > >> they > > > >> > >>>>>>> require > > > >> > >>>>>>>> write access to the repository's source. The > foundation > > > does > > > >> not > > > >> > >>> allow > > > >> > >>>>>>> that > > > >> > >>>>>>>> [2]. > > > >> > >>>>>>>> I propose to extend flinkbot / the flink-ci > repository. > > > >> > >>>>>>>> > > > >> > >>>>>>>> *Current Status:* > > > >> > >>>>>>>> - I'm able [3] to execute [4] the current custom > build > > > scripts > > > >> on > > > >> > >>>>>>>> Azure > > > >> > >>>>>>>> Pipelines: This means that we will have one compile > > > stage, and > > > >> N > > > >> > >>>>>>>> testing > > > >> > >>>>>>>> jobs in the 2nd stage. Currently, we have N=10 > testing > > > jobs. > > > >> > >>>>>>>> The time from the start of a build till all tests > have > > > >> completed > > > >> > >> is > > > >> > >>>>>>>> 1h22 > > > >> > >>>>>>>> minutes. > > > >> > >>>>>>>> - I'm working on getting the nightly end to end tests > > to > > > run on > > > >> > >> the > > > >> > >>>>>>>> new > > > >> > >>>>>>>> infrastructure. > > > >> > >>>>>>>> - I'm working on getting the build to work on our > pool > > of > > > >> custom > > > >> > >>>>>>>> machines > > > >> > >>>>>>>> as well > > > >> > >>>>>>>> - I'm working on setting up the full matrix of builds > > > >> (different > > > >> > >>>>>>>> scala, > > > >> > >>>>>>>> hadoop etc. versions) for the nightlies > > > >> > >>>>>>>> > > > >> > >>>>>>>> *Next Steps:* > > > >> > >>>>>>>> - I propose to document the entire build system in > the > > > Flink > > > >> Wiki > > > >> > >>>>>>>> - Once Azure can cover the same pull request tests as > > > Travis, I > > > >> > >>>>>>>> would set > > > >> > >>>>>>>> it up to run in parallel (including Flinkbot posting > > > links to > > > >> > >>>>>>>> Azure). I > > > >> > >>>>>>>> hope that this phase lasts for 1-2 weeks only, so > that > > > we do > > > >> not > > > >> > >>>>>>>> have to > > > >> > >>>>>>>> maintain things concurrently. I will monitor the > build > > > >> stability > > > >> > >>>>>>>> closely, > > > >> > >>>>>>>> but would expect some support with debugging > potential > > > issues > > > >> from > > > >> > >>> the > > > >> > >>>>>>>> contributors. > > > >> > >>>>>>>> - Once there are no problems with the new setup, we > > > remove the > > > >> > >>> Travis > > > >> > >>>>>>>> setup. > > > >> > >>>>>>>> - Independently, I will work on triggering builds > from > > > master / > > > >> > >>>>>>>> release - > > > >> > >>>>>>>> branch pushes, as well as cron builds from the master > > > branch > > > >> ... > > > >> > >>>>>>>> all this > > > >> > >>>>>>>> will be described in the Wiki. > > > >> > >>>>>>>> > > > >> > >>>>>>>> > > > >> > >>>>>>>> *Timeline:*- Once I have the feeling that people are > > > >> supportive of > > > >> > >>> the > > > >> > >>>>>>>> idea, I will start documenting in the Wiki. The first > > > pull > > > >> > >> requests > > > >> > >>>>>>> should > > > >> > >>>>>>>> show up after a few more days. > > > >> > >>>>>>>> I will do a one month parental leave starting some > time > > > later > > > >> in > > > >> > >>>>>>> December, > > > >> > >>>>>>>> which will probably delay things a bit. I hope to > have > > > >> everything > > > >> > >>>>>>> finished > > > >> > >>>>>>>> by end of January. > > > >> > >>>>>>>> > > > >> > >>>>>>>> I'm happy to hear your thoughts on this work. > > > >> > >>>>>>>> If nobody objects, I will start documenting the > system > > > and > > > >> prepare > > > >> > >>>>>>>> everything for the migration. > > > >> > >>>>>>>> > > > >> > >>>>>>>> Best, > > > >> > >>>>>>>> Robert > > > >> > >>>>>>>> > > > >> > >>>>>>>> > > > >> > >>>>>>>> > > > >> > >>>>>>>> [1] > > > >> > >>>>>>>> > > > >> > >>>>>>>> > > > >> > >>>>>>> > > > >> > >>>> > > > >> > >>> > > > >> > >> > > > >> > > > > >> > > > > > > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > > > >> > >>>>>>> > > > >> > >>>>>>>> [2] > https://issues.apache.org/jira/browse/INFRA-17030 > > > >> > >>>>>>>> [3] > > > https://github.com/rmetzger/flink/tree/azure_playground > > > >> > >>>>>>>> [4] > > > >> > >>>>>>> > > > >> > >>> > > > >> > > > https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > > > >> > >>>>> > > > >> > >>>>> > > > >> > >>>>> > > > >> > >>>> > > > >> > >>>> > > > >> > >>> > > > >> > >> > > > >> > >> > > > >> > >> -- > > > >> > >> Best Regards > > > >> > >> > > > >> > >> Jeff Zhang > > > >> > >> > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > |
+1 for the migration.
*10 parallel builds with 300 minute timeouts * is very useful for tasks that takes long time like e2e tests. And in Travis, looks like we compile entire project for every cron task even if they use same profile, eg: `name: e2e - misc - hadoop 2.8 name: e2e - ha - hadoop 2.8 name: e2e - sticky - hadoop 2.8 name: e2e - checkpoints - hadoop 2.8 name: e2e - container - hadoop 2.8 name: e2e - heavy - hadoop 2.8 name: e2e - tpcds - hadoop 2.8` We will compile entire project with profile `hadoop 2.8` 7 times, and every task will take about 25 minutes. @robert @chesnay Should we consider to compile once for multi cron task which have same profile in the new Azure Pipelines? Best, Leonard Xu > On Dec 9, 2019, at 11:57, Congxian Qiu <[hidden email]> wrote: > > +1 for migrating to Azure pipelines as this can have shorter build time, > and faster response. > > Best, > Congxian > > > Xiyuan Wang <[hidden email]> 于2019年12月9日周一 上午10:13写道: > >> Hi Robert, >> Thanks for bring up this topic. The 2 ARM machines(16cores) which I >> donated is just for POC test. We(Huawei) can donate more once moving to >> official Azure pipeline. :) >> >> Robert Metzger <[hidden email]> 于2019年12月6日周五 上午3:25写道: >> >>> Thanks for your comments Yun. >>> If there's strong support for idea 2, it would actually make my >>> life easier: the migration would be easier to do. >>> >>> I also noticed that the uploads to transfer.sh were broken, but this >> should >>> be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). The >>> builds in "flink-ci.flink" (coming from flink-ci/flink) might have >> troubles >>> with transfer.sh. >>> >>> >>> On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: >>> >>>> Hi Robert >>>> >>>> Really exciting to see this new more powerful CI tool to get rid of the >>> 50 >>>> minutes limit of traivs-CI free account. >>>> >>>> After reading the wiki, I support idea 2 of AZP-setup version-2. >>>> >>>> However, after I dig into some failing builds at >>>> https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view >> the >>>> logs of some IT cases which would be uploaded by traivs_watchdog to >>>> transfer.sh previously. >>>> I think this feature is also easy to implement in AZP, right? >>>> >>>> Best >>>> Yun Tang >>>> >>>> On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: >>>> >>>> I've created a first draft of my plans in the wiki: >>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines >>>> . >>>> I'm looking forward to your comments. >>>> >>>> On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger < >> [hidden email]> >>>> wrote: >>>> >>>>> Thank you all for the positive feedback. I will start putting >>>> together a >>>>> page in the wiki. >>>>> >>>>> @Jark: Azure Pipelines provides a free services, that is even >>> better >>>> than >>>>> what Travis provides for free: 10 parallel builds with 6 hours >>>> timeouts. >>>>> >>>>> @Chesnay: I will answer your questions in the yet-to-be-written >>>>> documentation in the wiki. >>>>> >>>>> >>>>> On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email] >>> >>>> wrote: >>>>> >>>>>> +1 I had good experiences with Azure pipelines in the past. >>>>>> >>>>>> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < >>>> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> Thanks for the effort! The tooling seems to be quite a bit >> nicer >>>> and I >>>>>>> like that we can grow by adding more machines. >>>>>>> >>>>>>> Best, >>>>>>> Aljoscha >>>>>>> >>>>>>>> On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: >>>>>>>> >>>>>>>> +1 for Azure pipeline because it promises better >> performance. >>>>>>>> >>>>>>>> However, I have 2 concerns: >>>>>>>> >>>>>>>> 1) Travis provides personal free service for testing >> personal >>>>>> branches. >>>>>>>> Usually, contributors use this feature to test PoC or run >> CRON >>>> jobs >>>>>> for >>>>>>>> pull requests. >>>>>>>> Using local machine will cost a lot of time. Does AZP >>>> provides the >>>>>>> same >>>>>>>> free service? >>>>>>>> 2) Currently, we deployed a webhook [1] to receive Travis CI >>>> build >>>>>>>> notifications [2] and send to [hidden email] >> mailing >>>> list. >>>>>>>> We need to figure out a way how to send Azure build >> results >>>> to the >>>>>>>> mailing list. And this [3] might be the way to go. >>>>>>>> >>>>>>>> [hidden email] mailing list >>>>>>>> >>>>>>>> Best, >>>>>>>> Jark >>>>>>>> >>>>>>>> [1]: https://github.com/wuchong/flink-notification-bot >>>>>>>> [2]: >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications >>>>>>>> [3]: >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> >>>> wrote: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> Till Rohrmann <[hidden email]> 于2019年12月4日周三 >>> 下午10:43写道: >>>>>>>>> >>>>>>>>>> +1 for moving to Azure pipelines as it promises better >>>> scalability >>>>>> and >>>>>>>>>> tooling. Looking forward to having faster builds and hence >>>> shorter >>>>>>>>> feedback >>>>>>>>>> cycles :-) >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Till >>>>>>>>>> >>>>>>>>>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < >>>> [hidden email] >>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> @robert Can you expand how the azure setup interacts with >>>> CiBot? >>>>>> Do we >>>>>>>>>>> have to continue mirroring builds into flink-ci? How will >>> the >>>>>> cronjob >>>>>>>>>>> configuration work? We should have a general idea on how >> to >>>>>> implement >>>>>>>>>>> this before proceeding. >>>>>>>>>>> Additionally, moving /all /jobs into flink-ci requires >>>> setting up >>>>>> the >>>>>>>>>>> environment variables we have; can we set these up via >>> files >>>> or >>>>>> will >>>>>>> we >>>>>>>>>>> have to give all committers permissions for >> flink-ci/flink? >>>>>>>>>>> >>>>>>>>>>> On 04/12/2019 12:55, Chesnay Schepler wrote: >>>>>>>>>>>> From what I've seen so far Azure will provide us a >> better >>>>>> experience, >>>>>>>>>>>> so I'd say +1 for the transition as a whole. >>>>>>>>>>>> >>>>>>>>>>>> I'd delay merge at least until the feature branch is >> cut. >>>>>>>>>>>> Given the parental leave it may even make sense to only >>>> start >>>>>> merging >>>>>>>>>>>> in January afterwards, to reduce the total time taken >> for >>>> the >>>>>>>>>> transition. >>>>>>>>>>>> >>>>>>>>>>>> Reviews could maybe be made earlier, but I'm wondering >>>> whether >>>>>> anyone >>>>>>>>>>>> would even have the time at the moment to do so. >>>>>>>>>>>> >>>>>>>>>>>> On 04/12/2019 12:35, Kurt Young wrote: >>>>>>>>>>>>> Thanks Robert for driving this. There is another big >> pain >>>> point >>>>>> of >>>>>>>>>>>>> current >>>>>>>>>>>>> travis, >>>>>>>>>>>>> which is its cache mechanism will fail from time to >> time. >>>> Almost >>>>>>>>>>>>> around 50% >>>>>>>>>>>>> of >>>>>>>>>>>>> the build fails are caused by cache problem. I opened >>> this >>>> issue >>>>>> to >>>>>>>>>>>>> travis >>>>>>>>>>>>> but >>>>>>>>>>>>> got no response yet. So big +1 from my side. >>>>>>>>>>>>> >>>>>>>>>>>>> Just one comment, it's close to 1.10 feature freeze and >>> we >>>> will >>>>>>>>> spend >>>>>>>>>>>>> some >>>>>>>>>>>>> time >>>>>>>>>>>>> to make tests stable before release. I wish this >>>> replacement can >>>>>>>>>> happen >>>>>>>>>>>>> after >>>>>>>>>>>>> 1.10 release, otherwise it will be a unstable factor >>> during >>>>>> release >>>>>>>>>>>>> testing. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Kurt >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu < >>> [hidden email]> >>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks Robert for the updates! And thanks a lot for >> all >>>> the >>>>>> efforts >>>>>>>>>> to >>>>>>>>>>>>>> investigate, experiment and tune Azure Pipelines for >>> Flink >>>>>>>>> building. >>>>>>>>>>>>>> Big +1 for it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It would be great that the community building can be >>>> extended >>>>>> with >>>>>>>>>>>>>> custom >>>>>>>>>>>>>> machines so that the tests would not be queued for >> long >>>> with >>>>>> daily >>>>>>>>>>>>>> growing >>>>>>>>>>>>>> PRs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The increased timeout would be also very helpful. >>>>>>>>>>>>>> The 50min timeout for free travis accounts is a pain >>>> currently, >>>>>>>>>>>>>> especially >>>>>>>>>>>>>> when we'd like to run e2e tests in our own travis. >> And I >>>> had to >>>>>>>>>>>>>> manually >>>>>>>>>>>>>> split the jobs to make it possible to pass. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Zhu Zhu >>>>>>>>>>>>>> >>>>>>>>>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 >>>> 下午6:36写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> as a follow up from our discussion on reducing the >>> build >>>> time >>>>>>>>> [1], I >>>>>>>>>>>>>> would >>>>>>>>>>>>>>> like to propose migrating our build infrastructure to >>>> Azure >>>>>>>>>> Pipelines >>>>>>>>>>>>>> (away >>>>>>>>>>>>>>> from Travis). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I believe that we have reached the limits of what >>> Travis >>>> can >>>>>>>>>>>>>>> provide the >>>>>>>>>>>>>>> Flink community, and I don't want the build system to >>>> limit or >>>>>>>>>>>>>>> influence >>>>>>>>>>>>>>> the project's growth. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Benefits:* >>>>>>>>>>>>>>> 1. The free Travis account are limited to 5 parallel >>>> builds, >>>>>> with >>>>>>>>> a >>>>>>>>>>>>>> timeout >>>>>>>>>>>>>>> of 50 minutes. Azure offers *10 parallel builds with >>> 300 >>>> minute >>>>>>>>>>>>>>> timeouts >>>>>>>>>>>>>>> *for >>>>>>>>>>>>>>> free for open source projects. >>>>>>>>>>>>>>> 2. Azure Pipelines allows us to *add custom build >>>> machines* to >>>>>> the >>>>>>>>>>>>>>> pool >>>>>>>>>>>>>> of >>>>>>>>>>>>>>> 10 free parallel builders. >>>>>>>>>>>>>>> This will allow the Flink community to scale the >>>> available >>>>>> build >>>>>>>>>>>>>>> capacity >>>>>>>>>>>>>>> as the project grows. We are dependent on donations >>> from >>>>>>>>> supporting >>>>>>>>>>>>>>> companies, but I believe that it is easier for >>> companies >>>> to >>>>>> donate >>>>>>>>>>>>>> machines >>>>>>>>>>>>>>> than money. >>>>>>>>>>>>>>> Alibaba is willing to provide 10 machines, with 32 >>> cores >>>> each >>>>>> to >>>>>>>>> the >>>>>>>>>>>>>> Flink >>>>>>>>>>>>>>> project for this purpose. >>>>>>>>>>>>>>> In addition, Xiyuan, who's working on adding ARM >>> support >>>> for >>>>>> Flink >>>>>>>>>>>>>> provided >>>>>>>>>>>>>>> me with 2 ARM machines (16 cores each). >>>>>>>>>>>>>>> I want to use the custom, more efficient build >> machines >>>> for >>>>>>>>> building >>>>>>>>>>>>>>> Flink's pull requests and master-pushes. >>>>>>>>>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, >>>> allowing for >>>>>>>>>>>>>>> example to >>>>>>>>>>>>>>> transfer intermediate build artifacts between >> pipeline >>>> stages. >>>>>>>>> This >>>>>>>>>>>>>>> will >>>>>>>>>>>>>>> allow us to make the build more reliable (we are >>>> currently >>>>>> abusing >>>>>>>>>> the >>>>>>>>>>>>>>> caching mechanism in Travis for this). >>>>>>>>>>>>>>> It also has some basic analytics on test results / >>> flaky >>>> tests >>>>>>>>> etc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Known problems:* >>>>>>>>>>>>>>> - Initially, we might see different build >> instabilities >>>> than >>>>>>>>> before >>>>>>>>>>>>>>> - There's a higher maintenance overhead for the >> custom >>>> build >>>>>>>>>> machines >>>>>>>>>>>>>>> (keeping them up to date etc.) >>>>>>>>>>>>>>> - We can not use the build status integration of AZP, >>>> because >>>>>> they >>>>>>>>>>>>>> require >>>>>>>>>>>>>>> write access to the repository's source. The >> foundation >>>> does >>>>>> not >>>>>>>>>> allow >>>>>>>>>>>>>> that >>>>>>>>>>>>>>> [2]. >>>>>>>>>>>>>>> I propose to extend flinkbot / the flink-ci >> repository. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Current Status:* >>>>>>>>>>>>>>> - I'm able [3] to execute [4] the current custom >> build >>>> scripts >>>>>> on >>>>>>>>>>>>>>> Azure >>>>>>>>>>>>>>> Pipelines: This means that we will have one compile >>>> stage, and >>>>>> N >>>>>>>>>>>>>>> testing >>>>>>>>>>>>>>> jobs in the 2nd stage. Currently, we have N=10 >> testing >>>> jobs. >>>>>>>>>>>>>>> The time from the start of a build till all tests >> have >>>>>> completed >>>>>>>>> is >>>>>>>>>>>>>>> 1h22 >>>>>>>>>>>>>>> minutes. >>>>>>>>>>>>>>> - I'm working on getting the nightly end to end tests >>> to >>>> run on >>>>>>>>> the >>>>>>>>>>>>>>> new >>>>>>>>>>>>>>> infrastructure. >>>>>>>>>>>>>>> - I'm working on getting the build to work on our >> pool >>> of >>>>>> custom >>>>>>>>>>>>>>> machines >>>>>>>>>>>>>>> as well >>>>>>>>>>>>>>> - I'm working on setting up the full matrix of builds >>>>>> (different >>>>>>>>>>>>>>> scala, >>>>>>>>>>>>>>> hadoop etc. versions) for the nightlies >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Next Steps:* >>>>>>>>>>>>>>> - I propose to document the entire build system in >> the >>>> Flink >>>>>> Wiki >>>>>>>>>>>>>>> - Once Azure can cover the same pull request tests as >>>> Travis, I >>>>>>>>>>>>>>> would set >>>>>>>>>>>>>>> it up to run in parallel (including Flinkbot posting >>>> links to >>>>>>>>>>>>>>> Azure). I >>>>>>>>>>>>>>> hope that this phase lasts for 1-2 weeks only, so >> that >>>> we do >>>>>> not >>>>>>>>>>>>>>> have to >>>>>>>>>>>>>>> maintain things concurrently. I will monitor the >> build >>>>>> stability >>>>>>>>>>>>>>> closely, >>>>>>>>>>>>>>> but would expect some support with debugging >> potential >>>> issues >>>>>> from >>>>>>>>>> the >>>>>>>>>>>>>>> contributors. >>>>>>>>>>>>>>> - Once there are no problems with the new setup, we >>>> remove the >>>>>>>>>> Travis >>>>>>>>>>>>>>> setup. >>>>>>>>>>>>>>> - Independently, I will work on triggering builds >> from >>>> master / >>>>>>>>>>>>>>> release - >>>>>>>>>>>>>>> branch pushes, as well as cron builds from the master >>>> branch >>>>>> ... >>>>>>>>>>>>>>> all this >>>>>>>>>>>>>>> will be described in the Wiki. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Timeline:*- Once I have the feeling that people are >>>>>> supportive of >>>>>>>>>> the >>>>>>>>>>>>>>> idea, I will start documenting in the Wiki. The first >>>> pull >>>>>>>>> requests >>>>>>>>>>>>>> should >>>>>>>>>>>>>>> show up after a few more days. >>>>>>>>>>>>>>> I will do a one month parental leave starting some >> time >>>> later >>>>>> in >>>>>>>>>>>>>> December, >>>>>>>>>>>>>>> which will probably delay things a bit. I hope to >> have >>>>>> everything >>>>>>>>>>>>>> finished >>>>>>>>>>>>>>> by end of January. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm happy to hear your thoughts on this work. >>>>>>>>>>>>>>> If nobody objects, I will start documenting the >> system >>>> and >>>>>> prepare >>>>>>>>>>>>>>> everything for the migration. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Robert >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>> >>> >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >>>>>>>>>>>>>> >>>>>>>>>>>>>>> [2] >> https://issues.apache.org/jira/browse/INFRA-17030 >>>>>>>>>>>>>>> [3] >>>> https://github.com/rmetzger/flink/tree/azure_playground >>>>>>>>>>>>>>> [4] >>>>>>>>>>>>>> >>>>>>>>>> >>>>>> >>>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Regards >>>>>>>>> >>>>>>>>> Jeff Zhang >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> |
@Leonard: On Azure, I'm not splitting the execution of the end to end tests
anymore. We won't have the overhead of compiling the same profile multiple times anymore. @all: We have recently merged a first version of the Azure configuration files to Flink [1]. This will allow us to build pull requests with all the additional checks we had in place for Travis as well. In the next few days, I'm going to build pushes and the nightly crons on Azure as well. From now on, you can set up Azure Pipelines for your own Flink fork as well, and execute end to end tests there quite easily [2]. I'll be closely monitoring the new setup in the coming days. Expect some smaller issues while not all pull requests have my changes (at some point, I will change a configuration in Azure, which will break builds that do not have my changes) Once Azure is stable, and we have the same features as the Travis build, we'll stop processing builds on Travis. [1] https://github.com/apache/flink/pull/10976 [2] https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines#id-[preview]AzurePipelines-Runningendtoendtests: On Mon, Dec 9, 2019 at 2:16 PM Leonard Xu <[hidden email]> wrote: > +1 for the migration. > *10 parallel builds with 300 minute timeouts * is very useful for tasks > that takes long time like e2e tests. > And in Travis, looks like we compile entire project for every cron task > even if they use same profile, eg: > `name: e2e - misc - hadoop 2.8 > name: e2e - ha - hadoop 2.8 > name: e2e - sticky - hadoop 2.8 > name: e2e - checkpoints - hadoop 2.8 > name: e2e - container - hadoop 2.8 > name: e2e - heavy - hadoop 2.8 > name: e2e - tpcds - hadoop 2.8` > We will compile entire project with profile `hadoop 2.8` 7 times, and > every task will take about 25 minutes. > @robert @chesnay Should we consider to compile once for multi cron task > which have same profile in the new Azure Pipelines? > > Best, > Leonard Xu > > > On Dec 9, 2019, at 11:57, Congxian Qiu <[hidden email]> wrote: > > > > +1 for migrating to Azure pipelines as this can have shorter build time, > > and faster response. > > > > Best, > > Congxian > > > > > > Xiyuan Wang <[hidden email]> 于2019年12月9日周一 上午10:13写道: > > > >> Hi Robert, > >> Thanks for bring up this topic. The 2 ARM machines(16cores) which I > >> donated is just for POC test. We(Huawei) can donate more once moving to > >> official Azure pipeline. :) > >> > >> Robert Metzger <[hidden email]> 于2019年12月6日周五 上午3:25写道: > >> > >>> Thanks for your comments Yun. > >>> If there's strong support for idea 2, it would actually make my > >>> life easier: the migration would be easier to do. > >>> > >>> I also noticed that the uploads to transfer.sh were broken, but this > >> should > >>> be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). > The > >>> builds in "flink-ci.flink" (coming from flink-ci/flink) might have > >> troubles > >>> with transfer.sh. > >>> > >>> > >>> On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: > >>> > >>>> Hi Robert > >>>> > >>>> Really exciting to see this new more powerful CI tool to get rid of > the > >>> 50 > >>>> minutes limit of traivs-CI free account. > >>>> > >>>> After reading the wiki, I support idea 2 of AZP-setup version-2. > >>>> > >>>> However, after I dig into some failing builds at > >>>> https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view > >> the > >>>> logs of some IT cases which would be uploaded by traivs_watchdog to > >>>> transfer.sh previously. > >>>> I think this feature is also easy to implement in AZP, right? > >>>> > >>>> Best > >>>> Yun Tang > >>>> > >>>> On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: > >>>> > >>>> I've created a first draft of my plans in the wiki: > >>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines > >>>> . > >>>> I'm looking forward to your comments. > >>>> > >>>> On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger < > >> [hidden email]> > >>>> wrote: > >>>> > >>>>> Thank you all for the positive feedback. I will start putting > >>>> together a > >>>>> page in the wiki. > >>>>> > >>>>> @Jark: Azure Pipelines provides a free services, that is even > >>> better > >>>> than > >>>>> what Travis provides for free: 10 parallel builds with 6 hours > >>>> timeouts. > >>>>> > >>>>> @Chesnay: I will answer your questions in the yet-to-be-written > >>>>> documentation in the wiki. > >>>>> > >>>>> > >>>>> On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email] > >>> > >>>> wrote: > >>>>> > >>>>>> +1 I had good experiences with Azure pipelines in the past. > >>>>>> > >>>>>> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < > >>>> [hidden email]> > >>>>>> wrote: > >>>>>> > >>>>>>> +1 > >>>>>>> > >>>>>>> Thanks for the effort! The tooling seems to be quite a bit > >> nicer > >>>> and I > >>>>>>> like that we can grow by adding more machines. > >>>>>>> > >>>>>>> Best, > >>>>>>> Aljoscha > >>>>>>> > >>>>>>>> On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: > >>>>>>>> > >>>>>>>> +1 for Azure pipeline because it promises better > >> performance. > >>>>>>>> > >>>>>>>> However, I have 2 concerns: > >>>>>>>> > >>>>>>>> 1) Travis provides personal free service for testing > >> personal > >>>>>> branches. > >>>>>>>> Usually, contributors use this feature to test PoC or run > >> CRON > >>>> jobs > >>>>>> for > >>>>>>>> pull requests. > >>>>>>>> Using local machine will cost a lot of time. Does AZP > >>>> provides the > >>>>>>> same > >>>>>>>> free service? > >>>>>>>> 2) Currently, we deployed a webhook [1] to receive Travis CI > >>>> build > >>>>>>>> notifications [2] and send to [hidden email] > >> mailing > >>>> list. > >>>>>>>> We need to figure out a way how to send Azure build > >> results > >>>> to the > >>>>>>>> mailing list. And this [3] might be the way to go. > >>>>>>>> > >>>>>>>> [hidden email] mailing list > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Jark > >>>>>>>> > >>>>>>>> [1]: https://github.com/wuchong/flink-notification-bot > >>>>>>>> [2]: > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications > >>>>>>>> [3]: > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> > >>>> wrote: > >>>>>>>> > >>>>>>>>> +1 > >>>>>>>>> > >>>>>>>>> Till Rohrmann <[hidden email]> 于2019年12月4日周三 > >>> 下午10:43写道: > >>>>>>>>> > >>>>>>>>>> +1 for moving to Azure pipelines as it promises better > >>>> scalability > >>>>>> and > >>>>>>>>>> tooling. Looking forward to having faster builds and hence > >>>> shorter > >>>>>>>>> feedback > >>>>>>>>>> cycles :-) > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> Till > >>>>>>>>>> > >>>>>>>>>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < > >>>> [hidden email] > >>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> @robert Can you expand how the azure setup interacts with > >>>> CiBot? > >>>>>> Do we > >>>>>>>>>>> have to continue mirroring builds into flink-ci? How will > >>> the > >>>>>> cronjob > >>>>>>>>>>> configuration work? We should have a general idea on how > >> to > >>>>>> implement > >>>>>>>>>>> this before proceeding. > >>>>>>>>>>> Additionally, moving /all /jobs into flink-ci requires > >>>> setting up > >>>>>> the > >>>>>>>>>>> environment variables we have; can we set these up via > >>> files > >>>> or > >>>>>> will > >>>>>>> we > >>>>>>>>>>> have to give all committers permissions for > >> flink-ci/flink? > >>>>>>>>>>> > >>>>>>>>>>> On 04/12/2019 12:55, Chesnay Schepler wrote: > >>>>>>>>>>>> From what I've seen so far Azure will provide us a > >> better > >>>>>> experience, > >>>>>>>>>>>> so I'd say +1 for the transition as a whole. > >>>>>>>>>>>> > >>>>>>>>>>>> I'd delay merge at least until the feature branch is > >> cut. > >>>>>>>>>>>> Given the parental leave it may even make sense to only > >>>> start > >>>>>> merging > >>>>>>>>>>>> in January afterwards, to reduce the total time taken > >> for > >>>> the > >>>>>>>>>> transition. > >>>>>>>>>>>> > >>>>>>>>>>>> Reviews could maybe be made earlier, but I'm wondering > >>>> whether > >>>>>> anyone > >>>>>>>>>>>> would even have the time at the moment to do so. > >>>>>>>>>>>> > >>>>>>>>>>>> On 04/12/2019 12:35, Kurt Young wrote: > >>>>>>>>>>>>> Thanks Robert for driving this. There is another big > >> pain > >>>> point > >>>>>> of > >>>>>>>>>>>>> current > >>>>>>>>>>>>> travis, > >>>>>>>>>>>>> which is its cache mechanism will fail from time to > >> time. > >>>> Almost > >>>>>>>>>>>>> around 50% > >>>>>>>>>>>>> of > >>>>>>>>>>>>> the build fails are caused by cache problem. I opened > >>> this > >>>> issue > >>>>>> to > >>>>>>>>>>>>> travis > >>>>>>>>>>>>> but > >>>>>>>>>>>>> got no response yet. So big +1 from my side. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Just one comment, it's close to 1.10 feature freeze and > >>> we > >>>> will > >>>>>>>>> spend > >>>>>>>>>>>>> some > >>>>>>>>>>>>> time > >>>>>>>>>>>>> to make tests stable before release. I wish this > >>>> replacement can > >>>>>>>>>> happen > >>>>>>>>>>>>> after > >>>>>>>>>>>>> 1.10 release, otherwise it will be a unstable factor > >>> during > >>>>>> release > >>>>>>>>>>>>> testing. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best, > >>>>>>>>>>>>> Kurt > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu < > >>> [hidden email]> > >>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Robert for the updates! And thanks a lot for > >> all > >>>> the > >>>>>> efforts > >>>>>>>>>> to > >>>>>>>>>>>>>> investigate, experiment and tune Azure Pipelines for > >>> Flink > >>>>>>>>> building. > >>>>>>>>>>>>>> Big +1 for it. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It would be great that the community building can be > >>>> extended > >>>>>> with > >>>>>>>>>>>>>> custom > >>>>>>>>>>>>>> machines so that the tests would not be queued for > >> long > >>>> with > >>>>>> daily > >>>>>>>>>>>>>> growing > >>>>>>>>>>>>>> PRs. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The increased timeout would be also very helpful. > >>>>>>>>>>>>>> The 50min timeout for free travis accounts is a pain > >>>> currently, > >>>>>>>>>>>>>> especially > >>>>>>>>>>>>>> when we'd like to run e2e tests in our own travis. > >> And I > >>>> had to > >>>>>>>>>>>>>> manually > >>>>>>>>>>>>>> split the jobs to make it possible to pass. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Zhu Zhu > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 > >>>> 下午6:36写道: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> as a follow up from our discussion on reducing the > >>> build > >>>> time > >>>>>>>>> [1], I > >>>>>>>>>>>>>> would > >>>>>>>>>>>>>>> like to propose migrating our build infrastructure to > >>>> Azure > >>>>>>>>>> Pipelines > >>>>>>>>>>>>>> (away > >>>>>>>>>>>>>>> from Travis). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I believe that we have reached the limits of what > >>> Travis > >>>> can > >>>>>>>>>>>>>>> provide the > >>>>>>>>>>>>>>> Flink community, and I don't want the build system to > >>>> limit or > >>>>>>>>>>>>>>> influence > >>>>>>>>>>>>>>> the project's growth. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Benefits:* > >>>>>>>>>>>>>>> 1. The free Travis account are limited to 5 parallel > >>>> builds, > >>>>>> with > >>>>>>>>> a > >>>>>>>>>>>>>> timeout > >>>>>>>>>>>>>>> of 50 minutes. Azure offers *10 parallel builds with > >>> 300 > >>>> minute > >>>>>>>>>>>>>>> timeouts > >>>>>>>>>>>>>>> *for > >>>>>>>>>>>>>>> free for open source projects. > >>>>>>>>>>>>>>> 2. Azure Pipelines allows us to *add custom build > >>>> machines* to > >>>>>> the > >>>>>>>>>>>>>>> pool > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> 10 free parallel builders. > >>>>>>>>>>>>>>> This will allow the Flink community to scale the > >>>> available > >>>>>> build > >>>>>>>>>>>>>>> capacity > >>>>>>>>>>>>>>> as the project grows. We are dependent on donations > >>> from > >>>>>>>>> supporting > >>>>>>>>>>>>>>> companies, but I believe that it is easier for > >>> companies > >>>> to > >>>>>> donate > >>>>>>>>>>>>>> machines > >>>>>>>>>>>>>>> than money. > >>>>>>>>>>>>>>> Alibaba is willing to provide 10 machines, with 32 > >>> cores > >>>> each > >>>>>> to > >>>>>>>>> the > >>>>>>>>>>>>>> Flink > >>>>>>>>>>>>>>> project for this purpose. > >>>>>>>>>>>>>>> In addition, Xiyuan, who's working on adding ARM > >>> support > >>>> for > >>>>>> Flink > >>>>>>>>>>>>>> provided > >>>>>>>>>>>>>>> me with 2 ARM machines (16 cores each). > >>>>>>>>>>>>>>> I want to use the custom, more efficient build > >> machines > >>>> for > >>>>>>>>> building > >>>>>>>>>>>>>>> Flink's pull requests and master-pushes. > >>>>>>>>>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, > >>>> allowing for > >>>>>>>>>>>>>>> example to > >>>>>>>>>>>>>>> transfer intermediate build artifacts between > >> pipeline > >>>> stages. > >>>>>>>>> This > >>>>>>>>>>>>>>> will > >>>>>>>>>>>>>>> allow us to make the build more reliable (we are > >>>> currently > >>>>>> abusing > >>>>>>>>>> the > >>>>>>>>>>>>>>> caching mechanism in Travis for this). > >>>>>>>>>>>>>>> It also has some basic analytics on test results / > >>> flaky > >>>> tests > >>>>>>>>> etc. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Known problems:* > >>>>>>>>>>>>>>> - Initially, we might see different build > >> instabilities > >>>> than > >>>>>>>>> before > >>>>>>>>>>>>>>> - There's a higher maintenance overhead for the > >> custom > >>>> build > >>>>>>>>>> machines > >>>>>>>>>>>>>>> (keeping them up to date etc.) > >>>>>>>>>>>>>>> - We can not use the build status integration of AZP, > >>>> because > >>>>>> they > >>>>>>>>>>>>>> require > >>>>>>>>>>>>>>> write access to the repository's source. The > >> foundation > >>>> does > >>>>>> not > >>>>>>>>>> allow > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> [2]. > >>>>>>>>>>>>>>> I propose to extend flinkbot / the flink-ci > >> repository. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Current Status:* > >>>>>>>>>>>>>>> - I'm able [3] to execute [4] the current custom > >> build > >>>> scripts > >>>>>> on > >>>>>>>>>>>>>>> Azure > >>>>>>>>>>>>>>> Pipelines: This means that we will have one compile > >>>> stage, and > >>>>>> N > >>>>>>>>>>>>>>> testing > >>>>>>>>>>>>>>> jobs in the 2nd stage. Currently, we have N=10 > >> testing > >>>> jobs. > >>>>>>>>>>>>>>> The time from the start of a build till all tests > >> have > >>>>>> completed > >>>>>>>>> is > >>>>>>>>>>>>>>> 1h22 > >>>>>>>>>>>>>>> minutes. > >>>>>>>>>>>>>>> - I'm working on getting the nightly end to end tests > >>> to > >>>> run on > >>>>>>>>> the > >>>>>>>>>>>>>>> new > >>>>>>>>>>>>>>> infrastructure. > >>>>>>>>>>>>>>> - I'm working on getting the build to work on our > >> pool > >>> of > >>>>>> custom > >>>>>>>>>>>>>>> machines > >>>>>>>>>>>>>>> as well > >>>>>>>>>>>>>>> - I'm working on setting up the full matrix of builds > >>>>>> (different > >>>>>>>>>>>>>>> scala, > >>>>>>>>>>>>>>> hadoop etc. versions) for the nightlies > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Next Steps:* > >>>>>>>>>>>>>>> - I propose to document the entire build system in > >> the > >>>> Flink > >>>>>> Wiki > >>>>>>>>>>>>>>> - Once Azure can cover the same pull request tests as > >>>> Travis, I > >>>>>>>>>>>>>>> would set > >>>>>>>>>>>>>>> it up to run in parallel (including Flinkbot posting > >>>> links to > >>>>>>>>>>>>>>> Azure). I > >>>>>>>>>>>>>>> hope that this phase lasts for 1-2 weeks only, so > >> that > >>>> we do > >>>>>> not > >>>>>>>>>>>>>>> have to > >>>>>>>>>>>>>>> maintain things concurrently. I will monitor the > >> build > >>>>>> stability > >>>>>>>>>>>>>>> closely, > >>>>>>>>>>>>>>> but would expect some support with debugging > >> potential > >>>> issues > >>>>>> from > >>>>>>>>>> the > >>>>>>>>>>>>>>> contributors. > >>>>>>>>>>>>>>> - Once there are no problems with the new setup, we > >>>> remove the > >>>>>>>>>> Travis > >>>>>>>>>>>>>>> setup. > >>>>>>>>>>>>>>> - Independently, I will work on triggering builds > >> from > >>>> master / > >>>>>>>>>>>>>>> release - > >>>>>>>>>>>>>>> branch pushes, as well as cron builds from the master > >>>> branch > >>>>>> ... > >>>>>>>>>>>>>>> all this > >>>>>>>>>>>>>>> will be described in the Wiki. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Timeline:*- Once I have the feeling that people are > >>>>>> supportive of > >>>>>>>>>> the > >>>>>>>>>>>>>>> idea, I will start documenting in the Wiki. The first > >>>> pull > >>>>>>>>> requests > >>>>>>>>>>>>>> should > >>>>>>>>>>>>>>> show up after a few more days. > >>>>>>>>>>>>>>> I will do a one month parental leave starting some > >> time > >>>> later > >>>>>> in > >>>>>>>>>>>>>> December, > >>>>>>>>>>>>>>> which will probably delay things a bit. I hope to > >> have > >>>>>> everything > >>>>>>>>>>>>>> finished > >>>>>>>>>>>>>>> by end of January. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'm happy to hear your thoughts on this work. > >>>>>>>>>>>>>>> If nobody objects, I will start documenting the > >> system > >>>> and > >>>>>> prepare > >>>>>>>>>>>>>>> everything for the migration. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>> Robert > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>> > >> > https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [2] > >> https://issues.apache.org/jira/browse/INFRA-17030 > >>>>>>>>>>>>>>> [3] > >>>> https://github.com/rmetzger/flink/tree/azure_playground > >>>>>>>>>>>>>>> [4] > >>>>>>>>>>>>>> > >>>>>>>>>> > >>>>>> > >>>> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Best Regards > >>>>>>>>> > >>>>>>>>> Jeff Zhang > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>> > >> > > |
Quick update on this effort: Since yesterday, I'm experimenting with
running the end to end tests with each pull request and "master" push. I hope that this helps to uncover issues earlier (without waiting for the nightly test execution) The tests run for almost 3 hours, so the overall build status will remain "PENDING" for quite a while. You should still have the regular compile / test results quicker (depending on the time of day). We might run into capacity issues with the end to end test execution for each PR. I'll be closely monitoring this and report back. In general, please let me know where if you have any problems with the new CI setup. For test failures, I'm happy to fix any issues caused by the build system, just file a ticket for the new "Build System / Azure Pipelines <https://issues.apache.org/jira/issues/?jql=project+%3D+FLINK+AND+component+%3D+%22Build+System+%2F+Azure+Pipelines%22> " component. On Mon, Feb 17, 2020 at 12:23 PM Robert Metzger <[hidden email]> wrote: > @Leonard: On Azure, I'm not splitting the execution of the end to end > tests anymore. We won't have the overhead of compiling the same profile > multiple times anymore. > > > @all: We have recently merged a first version of the Azure configuration > files to Flink [1]. This will allow us to build pull requests with all the > additional checks we had in place for Travis as well. > In the next few days, I'm going to build pushes and the nightly crons on > Azure as well. > > From now on, you can set up Azure Pipelines for your own Flink fork as > well, and execute end to end tests there quite easily [2]. > I'll be closely monitoring the new setup in the coming days. Expect some > smaller issues while not all pull requests have my changes (at some point, > I will change a configuration in Azure, which will break builds that do not > have my changes) > Once Azure is stable, and we have the same features as the Travis build, > we'll stop processing builds on Travis. > > > [1] https://github.com/apache/flink/pull/10976 > [2] > https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines#id-[preview]AzurePipelines-Runningendtoendtests: > > On Mon, Dec 9, 2019 at 2:16 PM Leonard Xu <[hidden email]> wrote: > >> +1 for the migration. >> *10 parallel builds with 300 minute timeouts * is very useful for tasks >> that takes long time like e2e tests. >> And in Travis, looks like we compile entire project for every cron task >> even if they use same profile, eg: >> `name: e2e - misc - hadoop 2.8 >> name: e2e - ha - hadoop 2.8 >> name: e2e - sticky - hadoop 2.8 >> name: e2e - checkpoints - hadoop 2.8 >> name: e2e - container - hadoop 2.8 >> name: e2e - heavy - hadoop 2.8 >> name: e2e - tpcds - hadoop 2.8` >> We will compile entire project with profile `hadoop 2.8` 7 times, and >> every task will take about 25 minutes. >> @robert @chesnay Should we consider to compile once for multi cron task >> which have same profile in the new Azure Pipelines? >> >> Best, >> Leonard Xu >> >> > On Dec 9, 2019, at 11:57, Congxian Qiu <[hidden email]> wrote: >> > >> > +1 for migrating to Azure pipelines as this can have shorter build time, >> > and faster response. >> > >> > Best, >> > Congxian >> > >> > >> > Xiyuan Wang <[hidden email]> 于2019年12月9日周一 上午10:13写道: >> > >> >> Hi Robert, >> >> Thanks for bring up this topic. The 2 ARM machines(16cores) which I >> >> donated is just for POC test. We(Huawei) can donate more once moving to >> >> official Azure pipeline. :) >> >> >> >> Robert Metzger <[hidden email]> 于2019年12月6日周五 上午3:25写道: >> >> >> >>> Thanks for your comments Yun. >> >>> If there's strong support for idea 2, it would actually make my >> >>> life easier: the migration would be easier to do. >> >>> >> >>> I also noticed that the uploads to transfer.sh were broken, but this >> >> should >> >>> be fixed in the "rmetzger.flink" builds (coming from rmetzger/flink). >> The >> >>> builds in "flink-ci.flink" (coming from flink-ci/flink) might have >> >> troubles >> >>> with transfer.sh. >> >>> >> >>> >> >>> On Thu, Dec 5, 2019 at 5:50 PM Yun Tang <[hidden email]> wrote: >> >>> >> >>>> Hi Robert >> >>>> >> >>>> Really exciting to see this new more powerful CI tool to get rid of >> the >> >>> 50 >> >>>> minutes limit of traivs-CI free account. >> >>>> >> >>>> After reading the wiki, I support idea 2 of AZP-setup version-2. >> >>>> >> >>>> However, after I dig into some failing builds at >> >>>> https://dev.azure.com/rmetzger/Flink/_build , I found we cannot view >> >> the >> >>>> logs of some IT cases which would be uploaded by traivs_watchdog to >> >>>> transfer.sh previously. >> >>>> I think this feature is also easy to implement in AZP, right? >> >>>> >> >>>> Best >> >>>> Yun Tang >> >>>> >> >>>> On 12/6/19, 12:19 AM, "Robert Metzger" <[hidden email]> wrote: >> >>>> >> >>>> I've created a first draft of my plans in the wiki: >> >>>> >> >>>> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/%5Bpreview%5D+Azure+Pipelines >> >>>> . >> >>>> I'm looking forward to your comments. >> >>>> >> >>>> On Thu, Dec 5, 2019 at 12:37 PM Robert Metzger < >> >> [hidden email]> >> >>>> wrote: >> >>>> >> >>>>> Thank you all for the positive feedback. I will start putting >> >>>> together a >> >>>>> page in the wiki. >> >>>>> >> >>>>> @Jark: Azure Pipelines provides a free services, that is even >> >>> better >> >>>> than >> >>>>> what Travis provides for free: 10 parallel builds with 6 hours >> >>>> timeouts. >> >>>>> >> >>>>> @Chesnay: I will answer your questions in the yet-to-be-written >> >>>>> documentation in the wiki. >> >>>>> >> >>>>> >> >>>>> On Thu, Dec 5, 2019 at 11:58 AM Arvid Heise <[hidden email] >> >>> >> >>>> wrote: >> >>>>> >> >>>>>> +1 I had good experiences with Azure pipelines in the past. >> >>>>>> >> >>>>>> On Thu, Dec 5, 2019 at 11:35 AM Aljoscha Krettek < >> >>>> [hidden email]> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> +1 >> >>>>>>> >> >>>>>>> Thanks for the effort! The tooling seems to be quite a bit >> >> nicer >> >>>> and I >> >>>>>>> like that we can grow by adding more machines. >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> Aljoscha >> >>>>>>> >> >>>>>>>> On 5. Dec 2019, at 03:18, Jark Wu <[hidden email]> wrote: >> >>>>>>>> >> >>>>>>>> +1 for Azure pipeline because it promises better >> >> performance. >> >>>>>>>> >> >>>>>>>> However, I have 2 concerns: >> >>>>>>>> >> >>>>>>>> 1) Travis provides personal free service for testing >> >> personal >> >>>>>> branches. >> >>>>>>>> Usually, contributors use this feature to test PoC or run >> >> CRON >> >>>> jobs >> >>>>>> for >> >>>>>>>> pull requests. >> >>>>>>>> Using local machine will cost a lot of time. Does AZP >> >>>> provides the >> >>>>>>> same >> >>>>>>>> free service? >> >>>>>>>> 2) Currently, we deployed a webhook [1] to receive Travis CI >> >>>> build >> >>>>>>>> notifications [2] and send to [hidden email] >> >> mailing >> >>>> list. >> >>>>>>>> We need to figure out a way how to send Azure build >> >> results >> >>>> to the >> >>>>>>>> mailing list. And this [3] might be the way to go. >> >>>>>>>> >> >>>>>>>> [hidden email] mailing list >> >>>>>>>> >> >>>>>>>> Best, >> >>>>>>>> Jark >> >>>>>>>> >> >>>>>>>> [1]: https://github.com/wuchong/flink-notification-bot >> >>>>>>>> [2]: >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://docs.travis-ci.com/user/notifications/#configuring-webhook-notifications >> >>>>>>>> [3]: >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://docs.microsoft.com/en-us/azure/devops/service-hooks/overview?view=azure-devops >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Wed, 4 Dec 2019 at 22:48, Jeff Zhang <[hidden email]> >> >>>> wrote: >> >>>>>>>> >> >>>>>>>>> +1 >> >>>>>>>>> >> >>>>>>>>> Till Rohrmann <[hidden email]> 于2019年12月4日周三 >> >>> 下午10:43写道: >> >>>>>>>>> >> >>>>>>>>>> +1 for moving to Azure pipelines as it promises better >> >>>> scalability >> >>>>>> and >> >>>>>>>>>> tooling. Looking forward to having faster builds and hence >> >>>> shorter >> >>>>>>>>> feedback >> >>>>>>>>>> cycles :-) >> >>>>>>>>>> >> >>>>>>>>>> Cheers, >> >>>>>>>>>> Till >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Dec 4, 2019 at 1:24 PM Chesnay Schepler < >> >>>> [hidden email] >> >>>>>>> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> @robert Can you expand how the azure setup interacts with >> >>>> CiBot? >> >>>>>> Do we >> >>>>>>>>>>> have to continue mirroring builds into flink-ci? How will >> >>> the >> >>>>>> cronjob >> >>>>>>>>>>> configuration work? We should have a general idea on how >> >> to >> >>>>>> implement >> >>>>>>>>>>> this before proceeding. >> >>>>>>>>>>> Additionally, moving /all /jobs into flink-ci requires >> >>>> setting up >> >>>>>> the >> >>>>>>>>>>> environment variables we have; can we set these up via >> >>> files >> >>>> or >> >>>>>> will >> >>>>>>> we >> >>>>>>>>>>> have to give all committers permissions for >> >> flink-ci/flink? >> >>>>>>>>>>> >> >>>>>>>>>>> On 04/12/2019 12:55, Chesnay Schepler wrote: >> >>>>>>>>>>>> From what I've seen so far Azure will provide us a >> >> better >> >>>>>> experience, >> >>>>>>>>>>>> so I'd say +1 for the transition as a whole. >> >>>>>>>>>>>> >> >>>>>>>>>>>> I'd delay merge at least until the feature branch is >> >> cut. >> >>>>>>>>>>>> Given the parental leave it may even make sense to only >> >>>> start >> >>>>>> merging >> >>>>>>>>>>>> in January afterwards, to reduce the total time taken >> >> for >> >>>> the >> >>>>>>>>>> transition. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Reviews could maybe be made earlier, but I'm wondering >> >>>> whether >> >>>>>> anyone >> >>>>>>>>>>>> would even have the time at the moment to do so. >> >>>>>>>>>>>> >> >>>>>>>>>>>> On 04/12/2019 12:35, Kurt Young wrote: >> >>>>>>>>>>>>> Thanks Robert for driving this. There is another big >> >> pain >> >>>> point >> >>>>>> of >> >>>>>>>>>>>>> current >> >>>>>>>>>>>>> travis, >> >>>>>>>>>>>>> which is its cache mechanism will fail from time to >> >> time. >> >>>> Almost >> >>>>>>>>>>>>> around 50% >> >>>>>>>>>>>>> of >> >>>>>>>>>>>>> the build fails are caused by cache problem. I opened >> >>> this >> >>>> issue >> >>>>>> to >> >>>>>>>>>>>>> travis >> >>>>>>>>>>>>> but >> >>>>>>>>>>>>> got no response yet. So big +1 from my side. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Just one comment, it's close to 1.10 feature freeze and >> >>> we >> >>>> will >> >>>>>>>>> spend >> >>>>>>>>>>>>> some >> >>>>>>>>>>>>> time >> >>>>>>>>>>>>> to make tests stable before release. I wish this >> >>>> replacement can >> >>>>>>>>>> happen >> >>>>>>>>>>>>> after >> >>>>>>>>>>>>> 1.10 release, otherwise it will be a unstable factor >> >>> during >> >>>>>> release >> >>>>>>>>>>>>> testing. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Best, >> >>>>>>>>>>>>> Kurt >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Wed, Dec 4, 2019 at 7:16 PM Zhu Zhu < >> >>> [hidden email]> >> >>>>>> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks Robert for the updates! And thanks a lot for >> >> all >> >>>> the >> >>>>>> efforts >> >>>>>>>>>> to >> >>>>>>>>>>>>>> investigate, experiment and tune Azure Pipelines for >> >>> Flink >> >>>>>>>>> building. >> >>>>>>>>>>>>>> Big +1 for it. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> It would be great that the community building can be >> >>>> extended >> >>>>>> with >> >>>>>>>>>>>>>> custom >> >>>>>>>>>>>>>> machines so that the tests would not be queued for >> >> long >> >>>> with >> >>>>>> daily >> >>>>>>>>>>>>>> growing >> >>>>>>>>>>>>>> PRs. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> The increased timeout would be also very helpful. >> >>>>>>>>>>>>>> The 50min timeout for free travis accounts is a pain >> >>>> currently, >> >>>>>>>>>>>>>> especially >> >>>>>>>>>>>>>> when we'd like to run e2e tests in our own travis. >> >> And I >> >>>> had to >> >>>>>>>>>>>>>> manually >> >>>>>>>>>>>>>> split the jobs to make it possible to pass. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>> Zhu Zhu >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Robert Metzger <[hidden email]> 于2019年12月4日周三 >> >>>> 下午6:36写道: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Hi all, >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> as a follow up from our discussion on reducing the >> >>> build >> >>>> time >> >>>>>>>>> [1], I >> >>>>>>>>>>>>>> would >> >>>>>>>>>>>>>>> like to propose migrating our build infrastructure to >> >>>> Azure >> >>>>>>>>>> Pipelines >> >>>>>>>>>>>>>> (away >> >>>>>>>>>>>>>>> from Travis). >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> I believe that we have reached the limits of what >> >>> Travis >> >>>> can >> >>>>>>>>>>>>>>> provide the >> >>>>>>>>>>>>>>> Flink community, and I don't want the build system to >> >>>> limit or >> >>>>>>>>>>>>>>> influence >> >>>>>>>>>>>>>>> the project's growth. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Benefits:* >> >>>>>>>>>>>>>>> 1. The free Travis account are limited to 5 parallel >> >>>> builds, >> >>>>>> with >> >>>>>>>>> a >> >>>>>>>>>>>>>> timeout >> >>>>>>>>>>>>>>> of 50 minutes. Azure offers *10 parallel builds with >> >>> 300 >> >>>> minute >> >>>>>>>>>>>>>>> timeouts >> >>>>>>>>>>>>>>> *for >> >>>>>>>>>>>>>>> free for open source projects. >> >>>>>>>>>>>>>>> 2. Azure Pipelines allows us to *add custom build >> >>>> machines* to >> >>>>>> the >> >>>>>>>>>>>>>>> pool >> >>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>> 10 free parallel builders. >> >>>>>>>>>>>>>>> This will allow the Flink community to scale the >> >>>> available >> >>>>>> build >> >>>>>>>>>>>>>>> capacity >> >>>>>>>>>>>>>>> as the project grows. We are dependent on donations >> >>> from >> >>>>>>>>> supporting >> >>>>>>>>>>>>>>> companies, but I believe that it is easier for >> >>> companies >> >>>> to >> >>>>>> donate >> >>>>>>>>>>>>>> machines >> >>>>>>>>>>>>>>> than money. >> >>>>>>>>>>>>>>> Alibaba is willing to provide 10 machines, with 32 >> >>> cores >> >>>> each >> >>>>>> to >> >>>>>>>>> the >> >>>>>>>>>>>>>> Flink >> >>>>>>>>>>>>>>> project for this purpose. >> >>>>>>>>>>>>>>> In addition, Xiyuan, who's working on adding ARM >> >>> support >> >>>> for >> >>>>>> Flink >> >>>>>>>>>>>>>> provided >> >>>>>>>>>>>>>>> me with 2 ARM machines (16 cores each). >> >>>>>>>>>>>>>>> I want to use the custom, more efficient build >> >> machines >> >>>> for >> >>>>>>>>> building >> >>>>>>>>>>>>>>> Flink's pull requests and master-pushes. >> >>>>>>>>>>>>>>> 3. *Azure Pipelines is a more feature-rich tool*, >> >>>> allowing for >> >>>>>>>>>>>>>>> example to >> >>>>>>>>>>>>>>> transfer intermediate build artifacts between >> >> pipeline >> >>>> stages. >> >>>>>>>>> This >> >>>>>>>>>>>>>>> will >> >>>>>>>>>>>>>>> allow us to make the build more reliable (we are >> >>>> currently >> >>>>>> abusing >> >>>>>>>>>> the >> >>>>>>>>>>>>>>> caching mechanism in Travis for this). >> >>>>>>>>>>>>>>> It also has some basic analytics on test results / >> >>> flaky >> >>>> tests >> >>>>>>>>> etc. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Known problems:* >> >>>>>>>>>>>>>>> - Initially, we might see different build >> >> instabilities >> >>>> than >> >>>>>>>>> before >> >>>>>>>>>>>>>>> - There's a higher maintenance overhead for the >> >> custom >> >>>> build >> >>>>>>>>>> machines >> >>>>>>>>>>>>>>> (keeping them up to date etc.) >> >>>>>>>>>>>>>>> - We can not use the build status integration of AZP, >> >>>> because >> >>>>>> they >> >>>>>>>>>>>>>> require >> >>>>>>>>>>>>>>> write access to the repository's source. The >> >> foundation >> >>>> does >> >>>>>> not >> >>>>>>>>>> allow >> >>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>> [2]. >> >>>>>>>>>>>>>>> I propose to extend flinkbot / the flink-ci >> >> repository. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Current Status:* >> >>>>>>>>>>>>>>> - I'm able [3] to execute [4] the current custom >> >> build >> >>>> scripts >> >>>>>> on >> >>>>>>>>>>>>>>> Azure >> >>>>>>>>>>>>>>> Pipelines: This means that we will have one compile >> >>>> stage, and >> >>>>>> N >> >>>>>>>>>>>>>>> testing >> >>>>>>>>>>>>>>> jobs in the 2nd stage. Currently, we have N=10 >> >> testing >> >>>> jobs. >> >>>>>>>>>>>>>>> The time from the start of a build till all tests >> >> have >> >>>>>> completed >> >>>>>>>>> is >> >>>>>>>>>>>>>>> 1h22 >> >>>>>>>>>>>>>>> minutes. >> >>>>>>>>>>>>>>> - I'm working on getting the nightly end to end tests >> >>> to >> >>>> run on >> >>>>>>>>> the >> >>>>>>>>>>>>>>> new >> >>>>>>>>>>>>>>> infrastructure. >> >>>>>>>>>>>>>>> - I'm working on getting the build to work on our >> >> pool >> >>> of >> >>>>>> custom >> >>>>>>>>>>>>>>> machines >> >>>>>>>>>>>>>>> as well >> >>>>>>>>>>>>>>> - I'm working on setting up the full matrix of builds >> >>>>>> (different >> >>>>>>>>>>>>>>> scala, >> >>>>>>>>>>>>>>> hadoop etc. versions) for the nightlies >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Next Steps:* >> >>>>>>>>>>>>>>> - I propose to document the entire build system in >> >> the >> >>>> Flink >> >>>>>> Wiki >> >>>>>>>>>>>>>>> - Once Azure can cover the same pull request tests as >> >>>> Travis, I >> >>>>>>>>>>>>>>> would set >> >>>>>>>>>>>>>>> it up to run in parallel (including Flinkbot posting >> >>>> links to >> >>>>>>>>>>>>>>> Azure). I >> >>>>>>>>>>>>>>> hope that this phase lasts for 1-2 weeks only, so >> >> that >> >>>> we do >> >>>>>> not >> >>>>>>>>>>>>>>> have to >> >>>>>>>>>>>>>>> maintain things concurrently. I will monitor the >> >> build >> >>>>>> stability >> >>>>>>>>>>>>>>> closely, >> >>>>>>>>>>>>>>> but would expect some support with debugging >> >> potential >> >>>> issues >> >>>>>> from >> >>>>>>>>>> the >> >>>>>>>>>>>>>>> contributors. >> >>>>>>>>>>>>>>> - Once there are no problems with the new setup, we >> >>>> remove the >> >>>>>>>>>> Travis >> >>>>>>>>>>>>>>> setup. >> >>>>>>>>>>>>>>> - Independently, I will work on triggering builds >> >> from >> >>>> master / >> >>>>>>>>>>>>>>> release - >> >>>>>>>>>>>>>>> branch pushes, as well as cron builds from the master >> >>>> branch >> >>>>>> ... >> >>>>>>>>>>>>>>> all this >> >>>>>>>>>>>>>>> will be described in the Wiki. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> *Timeline:*- Once I have the feeling that people are >> >>>>>> supportive of >> >>>>>>>>>> the >> >>>>>>>>>>>>>>> idea, I will start documenting in the Wiki. The first >> >>>> pull >> >>>>>>>>> requests >> >>>>>>>>>>>>>> should >> >>>>>>>>>>>>>>> show up after a few more days. >> >>>>>>>>>>>>>>> I will do a one month parental leave starting some >> >> time >> >>>> later >> >>>>>> in >> >>>>>>>>>>>>>> December, >> >>>>>>>>>>>>>>> which will probably delay things a bit. I hope to >> >> have >> >>>>>> everything >> >>>>>>>>>>>>>> finished >> >>>>>>>>>>>>>>> by end of January. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> I'm happy to hear your thoughts on this work. >> >>>>>>>>>>>>>>> If nobody objects, I will start documenting the >> >> system >> >>>> and >> >>>>>> prepare >> >>>>>>>>>>>>>>> everything for the migration. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>> Robert >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> [1] >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://lists.apache.org/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> [2] >> >> https://issues.apache.org/jira/browse/INFRA-17030 >> >>>>>>>>>>>>>>> [3] >> >>>> https://github.com/rmetzger/flink/tree/azure_playground >> >>>>>>>>>>>>>>> [4] >> >>>>>>>>>>>>>> >> >>>>>>>>>> >> >>>>>> >> >>>> >> https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Best Regards >> >>>>>>>>> >> >>>>>>>>> Jeff Zhang >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> >> >>> >> >> >> >> |
Free forum by Nabble | Edit this page |