Hey devs,
I need your opinion on something: As part of our migration from Travis to Azure, I'm revisiting the build system of Flink. I currently see two different ways of proceeding, and I would like to know your opinion on the two options. A) We build and test Flink in one "mvn clean verify" call on the CI system. B) We migrate the two staged build of one compile and N test jobs to Azure. Option A) is what we are currently running as part of testing the Azure-based system. Pro/Cons for A) + for "apache/flink" pushes and pull requests, the big testing machines need 1:30 hours to complete (this might go up for a few minutes because the python tests, and some auxiliary tests are not executed yet) + Our build will be easier to maintain and understand, because we rely on fewer scripts - builds on Flink forks, using the free Azure plan currently take 3:30 hours to complete. Pro/Cons for B) + builds on Flink forks using the free Azure plan take 1:20 hours, + Builds take 1:20 hours on the big testing machines - maintenance and complexity of the build scripts - the build times are a lot less predictable, because they depend on the availability of workers. For the free plan builds, they are currently fast, because the test stage has 10 jobs, and Azure offers 10 parallel workers. We currently only have a total of 8 big machines, so there will always be some queueing. In practice, for the "apache/flink" repo, build times will be less favorable, because of the scheduling. In my opinion, the question is mostly: Are you okay to wait 3.5 hours for a build to finish on your private CI, in favor of a less complex build system? Ideally, we'll be able to reduce these 3.5 hours by using a more modern build tool ("gradle") in the future. I'm happy to hear your thoughts! Best, Robert |
Hi Robert,
thank you very much for raising this issue and improving the build system. For now, I'd like to stick to a lean solution (= option A). While option B can greatly reduce build times, it also has the habit of clogging up the build machines. Just some arbitrary numbers, but it currently feels like B cuts down latency by half but also uses 10 machines for 30 minutes, decreasing the overall throughput significantly. Thus, when many folks want to see their commits tested, resources quickly run out and this in turn significantly increases latency. I'd like to have some more predictable build times and sacrifice some latency for now. It would be interesting to see if we could rearrange the project execution in Maven, such that fast projects are executed first. E2E tests should be executed last, which they are somewhat, because of the project dependencies. Of course, I'm very interested to improve the overall build experience by exploring other options to Maven. Best, Arvid On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> wrote: > Hey devs, > > I need your opinion on something: As part of our migration from Travis to > Azure, I'm revisiting the build system of Flink. I currently see two > different ways of proceeding, and I would like to know your opinion on the > two options. > > A) We build and test Flink in one "mvn clean verify" call on the CI system. > B) We migrate the two staged build of one compile and N test jobs to Azure. > > Option A) is what we are currently running as part of testing the > Azure-based system. > > Pro/Cons for A) > + for "apache/flink" pushes and pull requests, the big testing machines > need 1:30 hours to complete (this might go up for a few minutes because the > python tests, and some auxiliary tests are not executed yet) > + Our build will be easier to maintain and understand, because we rely on > fewer scripts > - builds on Flink forks, using the free Azure plan currently take 3:30 > hours to complete. > > Pro/Cons for B) > + builds on Flink forks using the free Azure plan take 1:20 hours, > + Builds take 1:20 hours on the big testing machines > - maintenance and complexity of the build scripts > - the build times are a lot less predictable, because they depend on the > availability of workers. For the free plan builds, they are currently fast, > because the test stage has 10 jobs, and Azure offers 10 parallel workers. > We currently only have a total of 8 big machines, so there will always be > some queueing. In practice, for the "apache/flink" repo, build times will > be less favorable, because of the scheduling. > > > In my opinion, the question is mostly: Are you okay to wait 3.5 hours for a > build to finish on your private CI, in favor of a less complex build > system? > Ideally, we'll be able to reduce these 3.5 hours by using a more modern > build tool ("gradle") in the future. > > I'm happy to hear your thoughts! > > Best, > Robert > |
Note that for B it's not strictly necessary to maintain the current
number of splits; 2 might already be enough to bring contributor builds to a more reasonable level. I don't think that a contributor build taking 3,5h is a viable option; people will start disregarding their own instance and just open a PR without having run the tests, which will naturally mean that PR quality will drop. Committers probably will start working around this and push branches into the flink repo for running tests; we have seen that in the past and see this currently for e2e tests. This will increase the number of builds being run on the Flink machines by quite a bit, obviously affecting throughput and latency.. On 11/12/2019 14:59, Arvid Heise wrote: > Hi Robert, > > thank you very much for raising this issue and improving the build system. > > For now, I'd like to stick to a lean solution (= option A). > > While option B can greatly reduce build times, it also has the habit of > clogging up the build machines. Just some arbitrary numbers, but it > currently feels like B cuts down latency by half but also uses 10 machines > for 30 minutes, decreasing the overall throughput significantly. Thus, when > many folks want to see their commits tested, resources quickly run out and > this in turn significantly increases latency. > I'd like to have some more predictable build times and sacrifice some > latency for now. > > It would be interesting to see if we could rearrange the project execution > in Maven, such that fast projects are executed first. E2E tests should be > executed last, which they are somewhat, because of the project dependencies. > > Of course, I'm very interested to improve the overall build experience by > exploring other options to Maven. > > Best, > > Arvid > > On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> wrote: > >> Hey devs, >> >> I need your opinion on something: As part of our migration from Travis to >> Azure, I'm revisiting the build system of Flink. I currently see two >> different ways of proceeding, and I would like to know your opinion on the >> two options. >> >> A) We build and test Flink in one "mvn clean verify" call on the CI system. >> B) We migrate the two staged build of one compile and N test jobs to Azure. >> >> Option A) is what we are currently running as part of testing the >> Azure-based system. >> >> Pro/Cons for A) >> + for "apache/flink" pushes and pull requests, the big testing machines >> need 1:30 hours to complete (this might go up for a few minutes because the >> python tests, and some auxiliary tests are not executed yet) >> + Our build will be easier to maintain and understand, because we rely on >> fewer scripts >> - builds on Flink forks, using the free Azure plan currently take 3:30 >> hours to complete. >> >> Pro/Cons for B) >> + builds on Flink forks using the free Azure plan take 1:20 hours, >> + Builds take 1:20 hours on the big testing machines >> - maintenance and complexity of the build scripts >> - the build times are a lot less predictable, because they depend on the >> availability of workers. For the free plan builds, they are currently fast, >> because the test stage has 10 jobs, and Azure offers 10 parallel workers. >> We currently only have a total of 8 big machines, so there will always be >> some queueing. In practice, for the "apache/flink" repo, build times will >> be less favorable, because of the scheduling. >> >> >> In my opinion, the question is mostly: Are you okay to wait 3.5 hours for a >> build to finish on your private CI, in favor of a less complex build >> system? >> Ideally, we'll be able to reduce these 3.5 hours by using a more modern >> build tool ("gradle") in the future. >> >> I'm happy to hear your thoughts! >> >> Best, >> Robert >> |
Some comments on Chesnay's message:
- Changing the number of splits will not reduce the complexity. - One can also use the Flink build machines by opening a PR to the "flink-ci/flink" repo, no need to open crappy PRs :) - On the number of builds being run: We currently use 4 out of 10 machines offered by Alibaba, and we are not yet hitting any limits. In addition to that, another big cloud provider has reached out to us, offering build capacity. But generally, I agree that solely relying on the build infrastructure of Flink is not a good option. The free Azure builds should provide a reasonable experience. On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <[hidden email]> wrote: > Note that for B it's not strictly necessary to maintain the current > number of splits; 2 might already be enough to bring contributor builds > to a more reasonable level. > > I don't think that a contributor build taking 3,5h is a viable option; > people will start disregarding their own instance and just open a PR > without having run the tests, which will naturally mean that PR quality > will drop. Committers probably will start working around this and push > branches into the flink repo for running tests; we have seen that in the > past and see this currently for e2e tests. > > This will increase the number of builds being run on the Flink machines > by quite a bit, obviously affecting throughput and latency.. > > On 11/12/2019 14:59, Arvid Heise wrote: > > Hi Robert, > > > > thank you very much for raising this issue and improving the build > system. > > > > For now, I'd like to stick to a lean solution (= option A). > > > > While option B can greatly reduce build times, it also has the habit of > > clogging up the build machines. Just some arbitrary numbers, but it > > currently feels like B cuts down latency by half but also uses 10 > machines > > for 30 minutes, decreasing the overall throughput significantly. Thus, > when > > many folks want to see their commits tested, resources quickly run out > and > > this in turn significantly increases latency. > > I'd like to have some more predictable build times and sacrifice some > > latency for now. > > > > It would be interesting to see if we could rearrange the project > execution > > in Maven, such that fast projects are executed first. E2E tests should be > > executed last, which they are somewhat, because of the project > dependencies. > > > > Of course, I'm very interested to improve the overall build experience by > > exploring other options to Maven. > > > > Best, > > > > Arvid > > > > On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> > wrote: > > > >> Hey devs, > >> > >> I need your opinion on something: As part of our migration from Travis > to > >> Azure, I'm revisiting the build system of Flink. I currently see two > >> different ways of proceeding, and I would like to know your opinion on > the > >> two options. > >> > >> A) We build and test Flink in one "mvn clean verify" call on the CI > system. > >> B) We migrate the two staged build of one compile and N test jobs to > Azure. > >> > >> Option A) is what we are currently running as part of testing the > >> Azure-based system. > >> > >> Pro/Cons for A) > >> + for "apache/flink" pushes and pull requests, the big testing machines > >> need 1:30 hours to complete (this might go up for a few minutes because > the > >> python tests, and some auxiliary tests are not executed yet) > >> + Our build will be easier to maintain and understand, because we rely > on > >> fewer scripts > >> - builds on Flink forks, using the free Azure plan currently take 3:30 > >> hours to complete. > >> > >> Pro/Cons for B) > >> + builds on Flink forks using the free Azure plan take 1:20 hours, > >> + Builds take 1:20 hours on the big testing machines > >> - maintenance and complexity of the build scripts > >> - the build times are a lot less predictable, because they depend on the > >> availability of workers. For the free plan builds, they are currently > fast, > >> because the test stage has 10 jobs, and Azure offers 10 parallel > workers. > >> We currently only have a total of 8 big machines, so there will always > be > >> some queueing. In practice, for the "apache/flink" repo, build times > will > >> be less favorable, because of the scheduling. > >> > >> > >> In my opinion, the question is mostly: Are you okay to wait 3.5 hours > for a > >> build to finish on your private CI, in favor of a less complex build > >> system? > >> Ideally, we'll be able to reduce these 3.5 hours by using a more modern > >> build tool ("gradle") in the future. > >> > >> I'm happy to hear your thoughts! > >> > >> Best, > >> Robert > >> > > |
It depends on how to define "split"; if you split by module (as we do
currently) you have the same complexity as we have right now; caching of artifacts and brittle definition of splits. But there are other ways to split builds, for example into unit and integration tests; could also add end-to-end tests to this list. At that point we're basically talking about multiple parallel builds that are fully independent. Let's also remember that caching of the build artifact is only useful when the compile times are large enough to warrant it; if we only go with 2 splits in the grand scheme of things the caching wouldn't even be required. We added the caching to Travis since at 5+ builds (and the guarantee for this number to go up) the compilation time was a much larger factor. As for the current split setup we have (as in by modules), it isn't just about faster feedback times; they can also be used to isolate components from each other. I know that quite a few people appreciate the kafka/python module being in it's own split for example. On 11/12/2019 16:44, Robert Metzger wrote: > Some comments on Chesnay's message: > - Changing the number of splits will not reduce the complexity. > - One can also use the Flink build machines by opening a PR to the > "flink-ci/flink" repo, no need to open crappy PRs :) > - On the number of builds being run: We currently use 4 out of 10 machines > offered by Alibaba, and we are not yet hitting any limits. In addition to > that, another big cloud provider has reached out to us, offering build > capacity. > > But generally, I agree that solely relying on the build infrastructure of > Flink is not a good option. The free Azure builds should provide a > reasonable experience. > > > On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <[hidden email]> wrote: > >> Note that for B it's not strictly necessary to maintain the current >> number of splits; 2 might already be enough to bring contributor builds >> to a more reasonable level. >> >> I don't think that a contributor build taking 3,5h is a viable option; >> people will start disregarding their own instance and just open a PR >> without having run the tests, which will naturally mean that PR quality >> will drop. Committers probably will start working around this and push >> branches into the flink repo for running tests; we have seen that in the >> past and see this currently for e2e tests. >> >> This will increase the number of builds being run on the Flink machines >> by quite a bit, obviously affecting throughput and latency.. >> >> On 11/12/2019 14:59, Arvid Heise wrote: >>> Hi Robert, >>> >>> thank you very much for raising this issue and improving the build >> system. >>> For now, I'd like to stick to a lean solution (= option A). >>> >>> While option B can greatly reduce build times, it also has the habit of >>> clogging up the build machines. Just some arbitrary numbers, but it >>> currently feels like B cuts down latency by half but also uses 10 >> machines >>> for 30 minutes, decreasing the overall throughput significantly. Thus, >> when >>> many folks want to see their commits tested, resources quickly run out >> and >>> this in turn significantly increases latency. >>> I'd like to have some more predictable build times and sacrifice some >>> latency for now. >>> >>> It would be interesting to see if we could rearrange the project >> execution >>> in Maven, such that fast projects are executed first. E2E tests should be >>> executed last, which they are somewhat, because of the project >> dependencies. >>> Of course, I'm very interested to improve the overall build experience by >>> exploring other options to Maven. >>> >>> Best, >>> >>> Arvid >>> >>> On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> >> wrote: >>>> Hey devs, >>>> >>>> I need your opinion on something: As part of our migration from Travis >> to >>>> Azure, I'm revisiting the build system of Flink. I currently see two >>>> different ways of proceeding, and I would like to know your opinion on >> the >>>> two options. >>>> >>>> A) We build and test Flink in one "mvn clean verify" call on the CI >> system. >>>> B) We migrate the two staged build of one compile and N test jobs to >> Azure. >>>> Option A) is what we are currently running as part of testing the >>>> Azure-based system. >>>> >>>> Pro/Cons for A) >>>> + for "apache/flink" pushes and pull requests, the big testing machines >>>> need 1:30 hours to complete (this might go up for a few minutes because >> the >>>> python tests, and some auxiliary tests are not executed yet) >>>> + Our build will be easier to maintain and understand, because we rely >> on >>>> fewer scripts >>>> - builds on Flink forks, using the free Azure plan currently take 3:30 >>>> hours to complete. >>>> >>>> Pro/Cons for B) >>>> + builds on Flink forks using the free Azure plan take 1:20 hours, >>>> + Builds take 1:20 hours on the big testing machines >>>> - maintenance and complexity of the build scripts >>>> - the build times are a lot less predictable, because they depend on the >>>> availability of workers. For the free plan builds, they are currently >> fast, >>>> because the test stage has 10 jobs, and Azure offers 10 parallel >> workers. >>>> We currently only have a total of 8 big machines, so there will always >> be >>>> some queueing. In practice, for the "apache/flink" repo, build times >> will >>>> be less favorable, because of the scheduling. >>>> >>>> >>>> In my opinion, the question is mostly: Are you okay to wait 3.5 hours >> for a >>>> build to finish on your private CI, in favor of a less complex build >>>> system? >>>> Ideally, we'll be able to reduce these 3.5 hours by using a more modern >>>> build tool ("gradle") in the future. >>>> >>>> I'm happy to hear your thoughts! >>>> >>>> Best, >>>> Robert >>>> >> |
It’s a though question. One the one hand I like less complexity in the build system. But one of the most important things for developers is fast iteration cycles.
So I would prefer the solution that keeps the iteration time low. Best, Aljoscha > On 13. Dec 2019, at 14:41, Chesnay Schepler <[hidden email]> wrote: > > It depends on how to define "split"; if you split by module (as we do currently) you have the same complexity as we have right now; > caching of artifacts and brittle definition of splits. > > But there are other ways to split builds, for example into unit and integration tests; could also add end-to-end tests to this list. > At that point we're basically talking about multiple parallel builds that are fully independent. > Let's also remember that caching of the build artifact is only useful when the compile times are large enough to warrant it; > if we only go with 2 splits in the grand scheme of things the caching wouldn't even be required. > We added the caching to Travis since at 5+ builds (and the guarantee for this number to go up) the compilation time was a much larger factor. > > As for the current split setup we have (as in by modules), it isn't just about faster feedback times; they can also be used to isolate components from each other. > I know that quite a few people appreciate the kafka/python module being in it's own split for example. > > On 11/12/2019 16:44, Robert Metzger wrote: >> Some comments on Chesnay's message: >> - Changing the number of splits will not reduce the complexity. >> - One can also use the Flink build machines by opening a PR to the >> "flink-ci/flink" repo, no need to open crappy PRs :) >> - On the number of builds being run: We currently use 4 out of 10 machines >> offered by Alibaba, and we are not yet hitting any limits. In addition to >> that, another big cloud provider has reached out to us, offering build >> capacity. >> >> But generally, I agree that solely relying on the build infrastructure of >> Flink is not a good option. The free Azure builds should provide a >> reasonable experience. >> >> >> On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <[hidden email]> wrote: >> >>> Note that for B it's not strictly necessary to maintain the current >>> number of splits; 2 might already be enough to bring contributor builds >>> to a more reasonable level. >>> >>> I don't think that a contributor build taking 3,5h is a viable option; >>> people will start disregarding their own instance and just open a PR >>> without having run the tests, which will naturally mean that PR quality >>> will drop. Committers probably will start working around this and push >>> branches into the flink repo for running tests; we have seen that in the >>> past and see this currently for e2e tests. >>> >>> This will increase the number of builds being run on the Flink machines >>> by quite a bit, obviously affecting throughput and latency.. >>> >>> On 11/12/2019 14:59, Arvid Heise wrote: >>>> Hi Robert, >>>> >>>> thank you very much for raising this issue and improving the build >>> system. >>>> For now, I'd like to stick to a lean solution (= option A). >>>> >>>> While option B can greatly reduce build times, it also has the habit of >>>> clogging up the build machines. Just some arbitrary numbers, but it >>>> currently feels like B cuts down latency by half but also uses 10 >>> machines >>>> for 30 minutes, decreasing the overall throughput significantly. Thus, >>> when >>>> many folks want to see their commits tested, resources quickly run out >>> and >>>> this in turn significantly increases latency. >>>> I'd like to have some more predictable build times and sacrifice some >>>> latency for now. >>>> >>>> It would be interesting to see if we could rearrange the project >>> execution >>>> in Maven, such that fast projects are executed first. E2E tests should be >>>> executed last, which they are somewhat, because of the project >>> dependencies. >>>> Of course, I'm very interested to improve the overall build experience by >>>> exploring other options to Maven. >>>> >>>> Best, >>>> >>>> Arvid >>>> >>>> On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> >>> wrote: >>>>> Hey devs, >>>>> >>>>> I need your opinion on something: As part of our migration from Travis >>> to >>>>> Azure, I'm revisiting the build system of Flink. I currently see two >>>>> different ways of proceeding, and I would like to know your opinion on >>> the >>>>> two options. >>>>> >>>>> A) We build and test Flink in one "mvn clean verify" call on the CI >>> system. >>>>> B) We migrate the two staged build of one compile and N test jobs to >>> Azure. >>>>> Option A) is what we are currently running as part of testing the >>>>> Azure-based system. >>>>> >>>>> Pro/Cons for A) >>>>> + for "apache/flink" pushes and pull requests, the big testing machines >>>>> need 1:30 hours to complete (this might go up for a few minutes because >>> the >>>>> python tests, and some auxiliary tests are not executed yet) >>>>> + Our build will be easier to maintain and understand, because we rely >>> on >>>>> fewer scripts >>>>> - builds on Flink forks, using the free Azure plan currently take 3:30 >>>>> hours to complete. >>>>> >>>>> Pro/Cons for B) >>>>> + builds on Flink forks using the free Azure plan take 1:20 hours, >>>>> + Builds take 1:20 hours on the big testing machines >>>>> - maintenance and complexity of the build scripts >>>>> - the build times are a lot less predictable, because they depend on the >>>>> availability of workers. For the free plan builds, they are currently >>> fast, >>>>> because the test stage has 10 jobs, and Azure offers 10 parallel >>> workers. >>>>> We currently only have a total of 8 big machines, so there will always >>> be >>>>> some queueing. In practice, for the "apache/flink" repo, build times >>> will >>>>> be less favorable, because of the scheduling. >>>>> >>>>> >>>>> In my opinion, the question is mostly: Are you okay to wait 3.5 hours >>> for a >>>>> build to finish on your private CI, in favor of a less complex build >>>>> system? >>>>> Ideally, we'll be able to reduce these 3.5 hours by using a more modern >>>>> build tool ("gradle") in the future. >>>>> >>>>> I'm happy to hear your thoughts! >>>>> >>>>> Best, >>>>> Robert >>>>> >>> > |
Thanks for starting this discussion Robert.
I can see benefits for both options as already mentioned in this thread. However, given that we already have the profile splits and that it would considerably decrease feedback for developers on their personal Azure accounts, I'd be in favour of option B for the time being. If we see that we can keep the build time for the local Azure setups down differently, then one could start simplifying the build. Cheers, Till On Fri, Dec 13, 2019 at 2:42 PM Aljoscha Krettek <[hidden email]> wrote: > It’s a though question. One the one hand I like less complexity in the > build system. But one of the most important things for developers is fast > iteration cycles. > > So I would prefer the solution that keeps the iteration time low. > > Best, > Aljoscha > > > On 13. Dec 2019, at 14:41, Chesnay Schepler <[hidden email]> wrote: > > > > It depends on how to define "split"; if you split by module (as we do > currently) you have the same complexity as we have right now; > > caching of artifacts and brittle definition of splits. > > > > But there are other ways to split builds, for example into unit and > integration tests; could also add end-to-end tests to this list. > > At that point we're basically talking about multiple parallel builds > that are fully independent. > > Let's also remember that caching of the build artifact is only useful > when the compile times are large enough to warrant it; > > if we only go with 2 splits in the grand scheme of things the caching > wouldn't even be required. > > We added the caching to Travis since at 5+ builds (and the guarantee for > this number to go up) the compilation time was a much larger factor. > > > > As for the current split setup we have (as in by modules), it isn't just > about faster feedback times; they can also be used to isolate components > from each other. > > I know that quite a few people appreciate the kafka/python module being > in it's own split for example. > > > > On 11/12/2019 16:44, Robert Metzger wrote: > >> Some comments on Chesnay's message: > >> - Changing the number of splits will not reduce the complexity. > >> - One can also use the Flink build machines by opening a PR to the > >> "flink-ci/flink" repo, no need to open crappy PRs :) > >> - On the number of builds being run: We currently use 4 out of 10 > machines > >> offered by Alibaba, and we are not yet hitting any limits. In addition > to > >> that, another big cloud provider has reached out to us, offering build > >> capacity. > >> > >> But generally, I agree that solely relying on the build infrastructure > of > >> Flink is not a good option. The free Azure builds should provide a > >> reasonable experience. > >> > >> > >> On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <[hidden email]> > wrote: > >> > >>> Note that for B it's not strictly necessary to maintain the current > >>> number of splits; 2 might already be enough to bring contributor builds > >>> to a more reasonable level. > >>> > >>> I don't think that a contributor build taking 3,5h is a viable option; > >>> people will start disregarding their own instance and just open a PR > >>> without having run the tests, which will naturally mean that PR quality > >>> will drop. Committers probably will start working around this and push > >>> branches into the flink repo for running tests; we have seen that in > the > >>> past and see this currently for e2e tests. > >>> > >>> This will increase the number of builds being run on the Flink machines > >>> by quite a bit, obviously affecting throughput and latency.. > >>> > >>> On 11/12/2019 14:59, Arvid Heise wrote: > >>>> Hi Robert, > >>>> > >>>> thank you very much for raising this issue and improving the build > >>> system. > >>>> For now, I'd like to stick to a lean solution (= option A). > >>>> > >>>> While option B can greatly reduce build times, it also has the habit > of > >>>> clogging up the build machines. Just some arbitrary numbers, but it > >>>> currently feels like B cuts down latency by half but also uses 10 > >>> machines > >>>> for 30 minutes, decreasing the overall throughput significantly. Thus, > >>> when > >>>> many folks want to see their commits tested, resources quickly run out > >>> and > >>>> this in turn significantly increases latency. > >>>> I'd like to have some more predictable build times and sacrifice some > >>>> latency for now. > >>>> > >>>> It would be interesting to see if we could rearrange the project > >>> execution > >>>> in Maven, such that fast projects are executed first. E2E tests > should be > >>>> executed last, which they are somewhat, because of the project > >>> dependencies. > >>>> Of course, I'm very interested to improve the overall build > experience by > >>>> exploring other options to Maven. > >>>> > >>>> Best, > >>>> > >>>> Arvid > >>>> > >>>> On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> > >>> wrote: > >>>>> Hey devs, > >>>>> > >>>>> I need your opinion on something: As part of our migration from > Travis > >>> to > >>>>> Azure, I'm revisiting the build system of Flink. I currently see two > >>>>> different ways of proceeding, and I would like to know your opinion > on > >>> the > >>>>> two options. > >>>>> > >>>>> A) We build and test Flink in one "mvn clean verify" call on the CI > >>> system. > >>>>> B) We migrate the two staged build of one compile and N test jobs to > >>> Azure. > >>>>> Option A) is what we are currently running as part of testing the > >>>>> Azure-based system. > >>>>> > >>>>> Pro/Cons for A) > >>>>> + for "apache/flink" pushes and pull requests, the big testing > machines > >>>>> need 1:30 hours to complete (this might go up for a few minutes > because > >>> the > >>>>> python tests, and some auxiliary tests are not executed yet) > >>>>> + Our build will be easier to maintain and understand, because we > rely > >>> on > >>>>> fewer scripts > >>>>> - builds on Flink forks, using the free Azure plan currently take > 3:30 > >>>>> hours to complete. > >>>>> > >>>>> Pro/Cons for B) > >>>>> + builds on Flink forks using the free Azure plan take 1:20 hours, > >>>>> + Builds take 1:20 hours on the big testing machines > >>>>> - maintenance and complexity of the build scripts > >>>>> - the build times are a lot less predictable, because they depend on > the > >>>>> availability of workers. For the free plan builds, they are currently > >>> fast, > >>>>> because the test stage has 10 jobs, and Azure offers 10 parallel > >>> workers. > >>>>> We currently only have a total of 8 big machines, so there will > always > >>> be > >>>>> some queueing. In practice, for the "apache/flink" repo, build times > >>> will > >>>>> be less favorable, because of the scheduling. > >>>>> > >>>>> > >>>>> In my opinion, the question is mostly: Are you okay to wait 3.5 hours > >>> for a > >>>>> build to finish on your private CI, in favor of a less complex build > >>>>> system? > >>>>> Ideally, we'll be able to reduce these 3.5 hours by using a more > modern > >>>>> build tool ("gradle") in the future. > >>>>> > >>>>> I'm happy to hear your thoughts! > >>>>> > >>>>> Best, > >>>>> Robert > >>>>> > >>> > > > > |
Thanks for your feedback.
I will then go for option B. On Fri, Dec 13, 2019 at 2:51 PM Till Rohrmann <[hidden email]> wrote: > Thanks for starting this discussion Robert. > > I can see benefits for both options as already mentioned in this thread. > However, given that we already have the profile splits and that it would > considerably decrease feedback for developers on their personal Azure > accounts, I'd be in favour of option B for the time being. If we see that > we can keep the build time for the local Azure setups down differently, > then one could start simplifying the build. > > Cheers, > Till > > On Fri, Dec 13, 2019 at 2:42 PM Aljoscha Krettek <[hidden email]> > wrote: > >> It’s a though question. One the one hand I like less complexity in the >> build system. But one of the most important things for developers is fast >> iteration cycles. >> >> So I would prefer the solution that keeps the iteration time low. >> >> Best, >> Aljoscha >> >> > On 13. Dec 2019, at 14:41, Chesnay Schepler <[hidden email]> wrote: >> > >> > It depends on how to define "split"; if you split by module (as we do >> currently) you have the same complexity as we have right now; >> > caching of artifacts and brittle definition of splits. >> > >> > But there are other ways to split builds, for example into unit and >> integration tests; could also add end-to-end tests to this list. >> > At that point we're basically talking about multiple parallel builds >> that are fully independent. >> > Let's also remember that caching of the build artifact is only useful >> when the compile times are large enough to warrant it; >> > if we only go with 2 splits in the grand scheme of things the caching >> wouldn't even be required. >> > We added the caching to Travis since at 5+ builds (and the guarantee >> for this number to go up) the compilation time was a much larger factor. >> > >> > As for the current split setup we have (as in by modules), it isn't >> just about faster feedback times; they can also be used to isolate >> components from each other. >> > I know that quite a few people appreciate the kafka/python module >> being in it's own split for example. >> > >> > On 11/12/2019 16:44, Robert Metzger wrote: >> >> Some comments on Chesnay's message: >> >> - Changing the number of splits will not reduce the complexity. >> >> - One can also use the Flink build machines by opening a PR to the >> >> "flink-ci/flink" repo, no need to open crappy PRs :) >> >> - On the number of builds being run: We currently use 4 out of 10 >> machines >> >> offered by Alibaba, and we are not yet hitting any limits. In addition >> to >> >> that, another big cloud provider has reached out to us, offering build >> >> capacity. >> >> >> >> But generally, I agree that solely relying on the build infrastructure >> of >> >> Flink is not a good option. The free Azure builds should provide a >> >> reasonable experience. >> >> >> >> >> >> On Wed, Dec 11, 2019 at 3:22 PM Chesnay Schepler <[hidden email]> >> wrote: >> >> >> >>> Note that for B it's not strictly necessary to maintain the current >> >>> number of splits; 2 might already be enough to bring contributor >> builds >> >>> to a more reasonable level. >> >>> >> >>> I don't think that a contributor build taking 3,5h is a viable option; >> >>> people will start disregarding their own instance and just open a PR >> >>> without having run the tests, which will naturally mean that PR >> quality >> >>> will drop. Committers probably will start working around this and push >> >>> branches into the flink repo for running tests; we have seen that in >> the >> >>> past and see this currently for e2e tests. >> >>> >> >>> This will increase the number of builds being run on the Flink >> machines >> >>> by quite a bit, obviously affecting throughput and latency.. >> >>> >> >>> On 11/12/2019 14:59, Arvid Heise wrote: >> >>>> Hi Robert, >> >>>> >> >>>> thank you very much for raising this issue and improving the build >> >>> system. >> >>>> For now, I'd like to stick to a lean solution (= option A). >> >>>> >> >>>> While option B can greatly reduce build times, it also has the habit >> of >> >>>> clogging up the build machines. Just some arbitrary numbers, but it >> >>>> currently feels like B cuts down latency by half but also uses 10 >> >>> machines >> >>>> for 30 minutes, decreasing the overall throughput significantly. >> Thus, >> >>> when >> >>>> many folks want to see their commits tested, resources quickly run >> out >> >>> and >> >>>> this in turn significantly increases latency. >> >>>> I'd like to have some more predictable build times and sacrifice some >> >>>> latency for now. >> >>>> >> >>>> It would be interesting to see if we could rearrange the project >> >>> execution >> >>>> in Maven, such that fast projects are executed first. E2E tests >> should be >> >>>> executed last, which they are somewhat, because of the project >> >>> dependencies. >> >>>> Of course, I'm very interested to improve the overall build >> experience by >> >>>> exploring other options to Maven. >> >>>> >> >>>> Best, >> >>>> >> >>>> Arvid >> >>>> >> >>>> On Wed, Dec 11, 2019 at 2:32 PM Robert Metzger <[hidden email]> >> >>> wrote: >> >>>>> Hey devs, >> >>>>> >> >>>>> I need your opinion on something: As part of our migration from >> Travis >> >>> to >> >>>>> Azure, I'm revisiting the build system of Flink. I currently see two >> >>>>> different ways of proceeding, and I would like to know your opinion >> on >> >>> the >> >>>>> two options. >> >>>>> >> >>>>> A) We build and test Flink in one "mvn clean verify" call on the CI >> >>> system. >> >>>>> B) We migrate the two staged build of one compile and N test jobs to >> >>> Azure. >> >>>>> Option A) is what we are currently running as part of testing the >> >>>>> Azure-based system. >> >>>>> >> >>>>> Pro/Cons for A) >> >>>>> + for "apache/flink" pushes and pull requests, the big testing >> machines >> >>>>> need 1:30 hours to complete (this might go up for a few minutes >> because >> >>> the >> >>>>> python tests, and some auxiliary tests are not executed yet) >> >>>>> + Our build will be easier to maintain and understand, because we >> rely >> >>> on >> >>>>> fewer scripts >> >>>>> - builds on Flink forks, using the free Azure plan currently take >> 3:30 >> >>>>> hours to complete. >> >>>>> >> >>>>> Pro/Cons for B) >> >>>>> + builds on Flink forks using the free Azure plan take 1:20 hours, >> >>>>> + Builds take 1:20 hours on the big testing machines >> >>>>> - maintenance and complexity of the build scripts >> >>>>> - the build times are a lot less predictable, because they depend >> on the >> >>>>> availability of workers. For the free plan builds, they are >> currently >> >>> fast, >> >>>>> because the test stage has 10 jobs, and Azure offers 10 parallel >> >>> workers. >> >>>>> We currently only have a total of 8 big machines, so there will >> always >> >>> be >> >>>>> some queueing. In practice, for the "apache/flink" repo, build times >> >>> will >> >>>>> be less favorable, because of the scheduling. >> >>>>> >> >>>>> >> >>>>> In my opinion, the question is mostly: Are you okay to wait 3.5 >> hours >> >>> for a >> >>>>> build to finish on your private CI, in favor of a less complex build >> >>>>> system? >> >>>>> Ideally, we'll be able to reduce these 3.5 hours by using a more >> modern >> >>>>> build tool ("gradle") in the future. >> >>>>> >> >>>>> I'm happy to hear your thoughts! >> >>>>> >> >>>>> Best, >> >>>>> Robert >> >>>>> >> >>> >> > >> >> |
Free forum by Nabble | Edit this page |