How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume. Cheers, Till On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger <[hidden email]> wrote: > Yes, we can ensure the same (or better) experience for contributors. > > On the powerful machines, builds finish in 1.5 hours (without any caching > enabled). > > Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a > build for open source projects. Flink needs 3.5 hours on that infra (not > parallelized at all, no caching). These free machines are very similar to > those of Travis, so I expect no build time regressions, if we set it up > similarly. > > > On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <[hidden email]> > wrote: > > > Will using more powerful for the project make it more difficult to > > ensure that contributor builds are still running in a reasonable time? > > > > As an example of this happening on Travis, contributors currently cannot > > run all e2e tests since they timeout, but on apache we have a larger > > timeout. > > > > On 03/09/2019 18:57, Robert Metzger wrote: > > > Hi all, > > > > > > I wanted to give a short update on this: > > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > > > working on making all modules compile and test with Gradle. We've also > > > identified some problematic areas (shading being the most obvious one) > > > which we will analyse as part of the PoC. > > > The goal is to see how much Gradle helps to parallelise our build, and > to > > > avoid duplicate work (incremental builds). > > > > > > - I am working on setting up a Flink testing infrastructure based on > > Azure > > > Pipelines, using more powerful hardware. Alibaba kindly provided me > with > > > two 32 core machines (temporarily), and another company reached out to > > > privately, looking into options for cheap, fast machines :) > > > If nobody in the community disagrees, I am going to set up Azure > > Pipelines > > > with our apache/flink GitHub as a build infrastructure that exists next > > to > > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines > is > > > equally or even more reliable than Travis, and I want to see what the > > > required maintenance work is. > > > On top of that, Azure Pipelines is a very feature-rich tool with a lot > of > > > nice options for us to improve the build experience (statistics about > > tests > > > (flaky tests etc.), nice docker support, plenty of free build resources > > for > > > open source projects, ...) > > > > > > Best, > > > Robert > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> > > wrote: > > > > > >> Hi all, > > >> > > >> I have summarized all arguments mentioned so far + some additional > > >> research into a Wiki page here: > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > > >> > > >> I'm happy to hear further comments on my summary! I'm pretty sure we > can > > >> find more pro's and con's for the different options. > > >> > > >> My opinion after looking at the options: > > >> > > >> - Flink relies on an outdated build tool (Maven), while a good > > >> alternative is well-established (gradle), and will likely provide > a > > much > > >> better CI and local build experience through incremental build and > > cached > > >> intermediates. > > >> Scripting around Maven, or splitting modules / test execution / > > >> repositories won't solve this problem. We should rather spend the > > effort in > > >> migrating to a modern build tool which will provide us benefits in > > the long > > >> run. > > >> - Flink relies on a fairly slow build service (Travis CI), while > > >> simply putting more money onto the problem could cut the build > time > > at > > >> least in half. > > >> We should consider using a build service that provides bigger > > machines > > >> to solve our build time problem. > > >> > > >> My opinion is based on many assumptions (gradle is actually as fast as > > >> promised (haven't used it before), we can build Flink with gradle, we > > find > > >> sponsors for bigger build machines) that we need to test first through > > PoCs. > > >> > > >> Best, > > >> Robert > > >> > > >> > > >> > > >> > > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek < > [hidden email]> > > >> wrote: > > >> > > >>> I did a quick test: a normal "mvn clean install -DskipTests > > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > > machine > > >>> takes about 14 minutes. After removing all mentions of > > maven-shade-plugin > > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > > resulting > > >>> Flink won’t work, because some expected stuff is not packaged and > most > > of > > >>> the end-to-end tests use the shade plugin to package the jars for > > testing. > > >>> > > >>> Aljoscha > > >>> > > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> > > wrote: > > >>>> > > >>>> Hi all, > > >>>> > > >>>> I wanted to understand the impact of the hardware we are using for > > >>> running > > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory > > >>> [1]. > > >>>> They are using Google Cloud Compute Engine *n1-standard-2* > instances. > > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine > > >>> type. > > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, > takes > > >>> *1:21 > > >>>> h*. > > >>>> > > >>>> What is interesting are the per-module build time differences. > > >>>> Modules which are parallelizing tests well greatly benefit from the > > >>>> additional cores: > > >>>> "flink-tests" 36:51 min vs 4:33 min > > >>>> "flink-runtime" 23:41 min vs 3:47 min > > >>>> "flink-table-planner" 15:54 min vs 3:13 min > > >>>> > > >>>> On the other hand, we have modules which are not parallel at all: > > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > > >>>> Also, the checkstyle plugin is not scaling at all. > > >>>> > > >>>> Chesnay reported some significant speedups by reusing forks. > > >>>> I don't know how much effort it would be to make the Kafka tests > > >>>> parallelizable. In total, they currently use 30 minutes on the big > > >>> machine > > >>>> (while 31 CPUs are idling :) ) > > >>>> > > >>>> Let me know what you think about these results. If the community is > > >>>> generally interested in further investigating into that direction, I > > >>> could > > >>>> look into software to orchestrate this, as well as sponsors for such > > an > > >>>> infrastructure. > > >>>> > > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > > >>>> > > >>>> > > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler < > [hidden email]> > > >>> wrote: > > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see > > >>> this > > >>>>> quite easily by looking at the compile step in the misc profile > > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules > > that > > >>>>> longer than a fraction of a section are usually caused by shading > > lots > > >>>>> of classes. Note that I cannot tell you how much of this is spent > on > > >>>>> relocations, and how much on writing the jar. > > >>>>> > > >>>>> Personally, I'd very much like us to move all shading to > > flink-shaded; > > >>>>> this would finally allows us to use newer maven versions without > > >>> needing > > >>>>> cumbersome workarounds for flink-dist. However, this isn't a > trivial > > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > > >>>>> > > >>>>> On another note, this would also simplify switching the main repo > to > > >>>>> another build system, since you would no longer had to deal with > > >>>>> relocations, just packaging + merging NOTICE files. > > >>>>> > > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API > > >>>>> compatibility checks, checkstyle, layered shading (e.g., > > flink-runtime > > >>>>> and flink-dist, where both relocate dependencies and one is bundled > > by > > >>>>> the other), and, most importantly, CI (and really, without CI being > > >>>>> covered in a PoC there's nothing to discuss). > > >>>>> > > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > > >>> shading > > >>>>> is on the build time? We could get rid of shading completely in the > > >>> Flink > > >>>>> main repository by moving everything that we shade to flink-shaded. > > >>>>>> Aljoscha > > >>>>>> > > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote: > > >>>>>>> > > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > > >>>>> non-disruptive, > > >>>>>>> gradual migration approach if we decide to go that route. > > >>>>>>> > > >>>>>>> To add on, I want to point it out that we can actually start with > > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. > It's > > >>> of > > >>>>> much > > >>>>>>> smaller size, totally isolated from and not interfered with flink > > >>>>> project > > >>>>>>> [2], and it actually covers most of our practical feature > > >>> requirements > > >>>>> for > > >>>>>>> a build tool - all making it an ideal experimental field. > > >>>>>>> > > >>>>>>> [1] https://github.com/apache/flink-shaded > > >>>>>>> [2] https://github.com/apache/flink > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > > [hidden email]> > > >>>>> wrote: > > >>>>>>>> For the sake of keeping the discussion focused and not > cluttering > > >>> the > > >>>>>>>> discussion thread I would suggest to split the detailed > reporting > > >>> for > > >>>>>>>> reusing JVMs to a separate thread and cross linking it from > here. > > >>>>>>>> > > >>>>>>>> Cheers, > > >>>>>>>> Till > > >>>>>>>> > > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > > >>> [hidden email]> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Update: > > >>>>>>>>> > > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork > reuse > > >>> right > > >>>>>>>>> away, while flink-tests has the potential for huge savings, but > > we > > >>>>> have > > >>>>>>>>> to figure out some issues first. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Build link: > https://travis-ci.org/zentol/flink/builds/572659220 > > >>>>>>>>> > > >>>>>>>>> 4/8 profiles failed. > > >>>>>>>>> > > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved > > in > > >>>>>>>>> libraries (table-planner). > > >>>>>>>>> > > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due > to > > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: > > >>>>>>>>> kafka-producer-network-thread | producer-239 > > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > > >>>>>>>>> The tests profile failed due to various errors in migration > > tests: > > >>>>>>>>> > > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected > > >>>>>>>> accumulator > > >>>>>>>>> results within time limit. > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one > > >>> above > > >>>>>>>>> failed after 19 minutes and is only missing the migration tests > > >>> (which > > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between > > 15 > > >>> to > > >>>>> 20 > > >>>>>>>>> minutes here. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Finally, the misc profiles fails in YARN: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError > > >>>>>>>>> at > > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > > >>>>>>>>> No significant speedup could be observed in other modules; for > > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > > >>>>>>>>> > > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > > >>>>>>>>>> There appears to be a general agreement that 1) should be > looked > > >>>>> into; > > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all > tests; > > >>> will > > >>>>>>>>>> report back the results. > > >>>>>>>>>> > > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > > >>>>>>>>>>> Hello everyone, > > >>>>>>>>>>> > > >>>>>>>>>>> improving our build times is a hot topic at the moment so > let's > > >>>>>>>>>>> discuss the different ways how they could be reduced. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Current state: > > >>>>>>>>>>> > > >>>>>>>>>>> First up, let's look at some numbers: > > >>>>>>>>>>> > > >>>>>>>>>>> 1 full build currently consumes 5h of build time total > ("total > > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") > to > > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of > > >>> course > > >>>>>>>>>>> depending on the current Travis load. This applies both to > > >>> builds on > > >>>>>>>>>>> the Apache and flink-ci Travis. > > >>>>>>>>>>> > > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > > >>>>> (reminder: > > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically > means > > >>> that > > >>>>>>>>>>> we are processing builds at the rate that they come in), > > however > > >>> we > > >>>>>>>>>>> are in an admittedly quiet period right now. > > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h > > as > > >>>>>>>>>>> everyone was scrambling to get their changes merged in time > for > > >>> the > > >>>>>>>>>>> feature freeze. > > >>>>>>>>>>> > > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > > pending > > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or > the > > >>> PR > > >>>>>>>>>>> was closed, which should prove especially useful during the > > rush > > >>>>>>>>>>> hours we see before feature-freezes.) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Past approaches > > >>>>>>>>>>> > > >>>>>>>>>>> Over the years we have done rather few things to improve this > > >>>>>>>>>>> situation (hence our current predicament). > > >>>>>>>>>>> > > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > > >>>>> reduction > > >>>>>>>>>>> in total build times was the introduction of cron jobs, which > > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > > >>> (different > > >>>>>>>>>>> scala/hadoop versions) to 1. > > >>>>>>>>>>> > > >>>>>>>>>>> The separation into multiple build profiles was only a > > >>> work-around > > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has > the > > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > > >>> hitting > > >>>>> a > > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they > > >>> nearly > > >>>>>>>>>>> consume an entire profile by themselves (and thus no further > > >>>>>>>>>>> splitting is possible). > > >>>>>>>>>>> > > >>>>>>>>>>> The rework that introduced stages, at the time of > introduction, > > >>> did > > >>>>>>>>>>> also not provide a speed up, although this changed slightly > > once > > >>>>> more > > >>>>>>>>>>> profiles were added and some optimizations to the caching > have > > >>> been > > >>>>>>>>>>> made. > > >>>>>>>>>>> > > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration > for > > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > > >>> providing > > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen > > any > > >>>>>>>>>>> negative consequences. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Suggestions > > >>>>>>>>>>> > > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > > times > > >>>>> that > > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily > > >>> mine > > >>>>>>>>>>> nor may I agree with all of them). > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > > >>>>>>>>>>> * We've seen significant speedups in the blink planner, > > and > > >>>>> this > > >>>>>>>>>>> should be applicable for all modules. However, I > presume > > >>>>>>>> there's > > >>>>>>>>>>> a reason why we disabled JVM reuse (information on > this > > >>> would > > >>>>>>>> be > > >>>>>>>>>>> appreciated) > > >>>>>>>>>>> 2. Custom differential build scripts > > >>>>>>>>>>> * Setup custom scripts for determining which modules > > might be > > >>>>>>>>>>> affected by change, and manipulate the splits > > accordingly. > > >>>>> This > > >>>>>>>>>>> approach is conceptually quite straight-forward, but > has > > >>>>> limits > > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > > flink-core > > >>>>>>>>>>> _must_ result in testing all modules. > > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on > > >>>>> demand. > > >>>>>>>>>>> * With the introduction of the ci-bot we now have > > >>> significantly > > >>>>>>>>>>> more options on how to handle PR builds. One option > > could > > >>> be > > >>>>> to > > >>>>>>>>>>> only run basic tests when the PR is created (which may > > be > > >>>>> only > > >>>>>>>>>>> modified modules, or all unit tests, or another > low-cost > > >>>>>>>>>>> scheme), and then have a committer trigger other > builds > > >>> (full > > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > > >>>>>>>>>>> 4. Move more tests into cron builds > > >>>>>>>>>>> * The budget version of 3); move certain tests that are > > >>> either > > >>>>>>>>>>> expensive (like some runtime tests that take minutes) > > or in > > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > > >>>>>>>>>>> 5. Gradle > > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > > support > > >>>>> for > > >>>>>>>>>>> differential builds; basically providing 2) without > the > > >>>>>>>> overhead > > >>>>>>>>>>> of maintaining additional scripts. > > >>>>>>>>>>> * To date no PoC was provided that shows it working in > > our CI > > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > > >>>>>>>>>>> * This is the most disruptive change by a fair margin, > as > > it > > >>>>>>>> would > > >>>>>>>>>>> affect the entire project, developers and potentially > > users > > >>>>> (f > > >>>>>>>>>>> they build from source). > > >>>>>>>>>>> 6. CI service > > >>>>>>>>>>> * Our current artifact caching setup on Travis is > > basically a > > >>>>>>>>>>> hack; we're basically abusing the Travis cache, which > is > > >>>>> meant > > >>>>>>>>>>> for long-term caching, to ship build artifacts across > > jobs. > > >>>>>>>> It's > > >>>>>>>>>>> brittle at times due to timing/visibility issues and > on > > >>>>>>>> branches > > >>>>>>>>>>> the cleanup processes can interfere with running > > builds. It > > >>>>> is > > >>>>>>>>>>> also not as effective as it could be. > > >>>>>>>>>>> * There are CI services that provide build artifact > > caching > > >>> out > > >>>>>>>> of > > >>>>>>>>>>> the box, which could be useful for us. > > >>>>>>>>>>> * To date, no PoC for using another CI service has been > > >>>>> provided. > > >>>>>>>>>>> > > >>>>> > > >>> > > > > > |
e2e tests on Travis add another 4-5 hours, but we never optimized these
to make use of the cached Flink artifact. On 04/09/2019 13:26, Till Rohrmann wrote: > How long do we need to run all e2e tests? They are not included in the 3,5 > hours I assume. > > Cheers, > Till > > On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger <[hidden email]> wrote: > >> Yes, we can ensure the same (or better) experience for contributors. >> >> On the powerful machines, builds finish in 1.5 hours (without any caching >> enabled). >> >> Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a >> build for open source projects. Flink needs 3.5 hours on that infra (not >> parallelized at all, no caching). These free machines are very similar to >> those of Travis, so I expect no build time regressions, if we set it up >> similarly. >> >> >> On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <[hidden email]> >> wrote: >> >>> Will using more powerful for the project make it more difficult to >>> ensure that contributor builds are still running in a reasonable time? >>> >>> As an example of this happening on Travis, contributors currently cannot >>> run all e2e tests since they timeout, but on apache we have a larger >>> timeout. >>> >>> On 03/09/2019 18:57, Robert Metzger wrote: >>>> Hi all, >>>> >>>> I wanted to give a short update on this: >>>> - Arvid, Aljoscha and I have started working on a Gradle PoC, currently >>>> working on making all modules compile and test with Gradle. We've also >>>> identified some problematic areas (shading being the most obvious one) >>>> which we will analyse as part of the PoC. >>>> The goal is to see how much Gradle helps to parallelise our build, and >> to >>>> avoid duplicate work (incremental builds). >>>> >>>> - I am working on setting up a Flink testing infrastructure based on >>> Azure >>>> Pipelines, using more powerful hardware. Alibaba kindly provided me >> with >>>> two 32 core machines (temporarily), and another company reached out to >>>> privately, looking into options for cheap, fast machines :) >>>> If nobody in the community disagrees, I am going to set up Azure >>> Pipelines >>>> with our apache/flink GitHub as a build infrastructure that exists next >>> to >>>> Flinkbot and flink-ci. I would like to make sure that Azure Pipelines >> is >>>> equally or even more reliable than Travis, and I want to see what the >>>> required maintenance work is. >>>> On top of that, Azure Pipelines is a very feature-rich tool with a lot >> of >>>> nice options for us to improve the build experience (statistics about >>> tests >>>> (flaky tests etc.), nice docker support, plenty of free build resources >>> for >>>> open source projects, ...) >>>> >>>> Best, >>>> Robert >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> >>> wrote: >>>>> Hi all, >>>>> >>>>> I have summarized all arguments mentioned so far + some additional >>>>> research into a Wiki page here: >>>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 >>>>> I'm happy to hear further comments on my summary! I'm pretty sure we >> can >>>>> find more pro's and con's for the different options. >>>>> >>>>> My opinion after looking at the options: >>>>> >>>>> - Flink relies on an outdated build tool (Maven), while a good >>>>> alternative is well-established (gradle), and will likely provide >> a >>> much >>>>> better CI and local build experience through incremental build and >>> cached >>>>> intermediates. >>>>> Scripting around Maven, or splitting modules / test execution / >>>>> repositories won't solve this problem. We should rather spend the >>> effort in >>>>> migrating to a modern build tool which will provide us benefits in >>> the long >>>>> run. >>>>> - Flink relies on a fairly slow build service (Travis CI), while >>>>> simply putting more money onto the problem could cut the build >> time >>> at >>>>> least in half. >>>>> We should consider using a build service that provides bigger >>> machines >>>>> to solve our build time problem. >>>>> >>>>> My opinion is based on many assumptions (gradle is actually as fast as >>>>> promised (haven't used it before), we can build Flink with gradle, we >>> find >>>>> sponsors for bigger build machines) that we need to test first through >>> PoCs. >>>>> Best, >>>>> Robert >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek < >> [hidden email]> >>>>> wrote: >>>>> >>>>>> I did a quick test: a normal "mvn clean install -DskipTests >>>>>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my >>> machine >>>>>> takes about 14 minutes. After removing all mentions of >>> maven-shade-plugin >>>>>> the build time goes down to roughly 11.5 minutes. (Obviously the >>> resulting >>>>>> Flink won’t work, because some expected stuff is not packaged and >> most >>> of >>>>>> the end-to-end tests use the shade plugin to package the jars for >>> testing. >>>>>> Aljoscha >>>>>> >>>>>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> >>> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> I wanted to understand the impact of the hardware we are using for >>>>>> running >>>>>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory >>>>>> [1]. >>>>>>> They are using Google Cloud Compute Engine *n1-standard-2* >> instances. >>>>>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine >>>>>> type. >>>>>>> Running the same workload on a 32 virtual cores, 64 gb machine, >> takes >>>>>> *1:21 >>>>>>> h*. >>>>>>> >>>>>>> What is interesting are the per-module build time differences. >>>>>>> Modules which are parallelizing tests well greatly benefit from the >>>>>>> additional cores: >>>>>>> "flink-tests" 36:51 min vs 4:33 min >>>>>>> "flink-runtime" 23:41 min vs 3:47 min >>>>>>> "flink-table-planner" 15:54 min vs 3:13 min >>>>>>> >>>>>>> On the other hand, we have modules which are not parallel at all: >>>>>>> "flink-connector-kafka": 16:32 min vs 15:19 min >>>>>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min >>>>>>> Also, the checkstyle plugin is not scaling at all. >>>>>>> >>>>>>> Chesnay reported some significant speedups by reusing forks. >>>>>>> I don't know how much effort it would be to make the Kafka tests >>>>>>> parallelizable. In total, they currently use 30 minutes on the big >>>>>> machine >>>>>>> (while 31 CPUs are idling :) ) >>>>>>> >>>>>>> Let me know what you think about these results. If the community is >>>>>>> generally interested in further investigating into that direction, I >>>>>> could >>>>>>> look into software to orchestrate this, as well as sponsors for such >>> an >>>>>>> infrastructure. >>>>>>> >>>>>>> [1] https://docs.travis-ci.com/user/reference/overview/ >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler < >> [hidden email]> >>>>>> wrote: >>>>>>>> @Aljoscha Shading takes a few minutes for a full build; you can see >>>>>> this >>>>>>>> quite easily by looking at the compile step in the misc profile >>>>>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules >>> that >>>>>>>> longer than a fraction of a section are usually caused by shading >>> lots >>>>>>>> of classes. Note that I cannot tell you how much of this is spent >> on >>>>>>>> relocations, and how much on writing the jar. >>>>>>>> >>>>>>>> Personally, I'd very much like us to move all shading to >>> flink-shaded; >>>>>>>> this would finally allows us to use newer maven versions without >>>>>> needing >>>>>>>> cumbersome workarounds for flink-dist. However, this isn't a >> trivial >>>>>>>> affair in some cases; IIRC calcite could be difficult to handle. >>>>>>>> >>>>>>>> On another note, this would also simplify switching the main repo >> to >>>>>>>> another build system, since you would no longer had to deal with >>>>>>>> relocations, just packaging + merging NOTICE files. >>>>>>>> >>>>>>>> @BowenLi I disagree, flink-shaded does not include any tests, API >>>>>>>> compatibility checks, checkstyle, layered shading (e.g., >>> flink-runtime >>>>>>>> and flink-dist, where both relocate dependencies and one is bundled >>> by >>>>>>>> the other), and, most importantly, CI (and really, without CI being >>>>>>>> covered in a PoC there's nothing to discuss). >>>>>>>> >>>>>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: >>>>>>>>> Speaking of flink-shaded, do we have any idea what the impact of >>>>>> shading >>>>>>>> is on the build time? We could get rid of shading completely in the >>>>>> Flink >>>>>>>> main repository by moving everything that we shade to flink-shaded. >>>>>>>>> Aljoscha >>>>>>>>> >>>>>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote: >>>>>>>>>> >>>>>>>>>> +1 to Till's points on #2 and #5, especially the potential >>>>>>>> non-disruptive, >>>>>>>>>> gradual migration approach if we decide to go that route. >>>>>>>>>> >>>>>>>>>> To add on, I want to point it out that we can actually start with >>>>>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. >> It's >>>>>> of >>>>>>>> much >>>>>>>>>> smaller size, totally isolated from and not interfered with flink >>>>>>>> project >>>>>>>>>> [2], and it actually covers most of our practical feature >>>>>> requirements >>>>>>>> for >>>>>>>>>> a build tool - all making it an ideal experimental field. >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/flink-shaded >>>>>>>>>> [2] https://github.com/apache/flink >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < >>> [hidden email]> >>>>>>>> wrote: >>>>>>>>>>> For the sake of keeping the discussion focused and not >> cluttering >>>>>> the >>>>>>>>>>> discussion thread I would suggest to split the detailed >> reporting >>>>>> for >>>>>>>>>>> reusing JVMs to a separate thread and cross linking it from >> here. >>>>>>>>>>> Cheers, >>>>>>>>>>> Till >>>>>>>>>>> >>>>>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < >>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Update: >>>>>>>>>>>> >>>>>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork >> reuse >>>>>> right >>>>>>>>>>>> away, while flink-tests has the potential for huge savings, but >>> we >>>>>>>> have >>>>>>>>>>>> to figure out some issues first. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Build link: >> https://travis-ci.org/zentol/flink/builds/572659220 >>>>>>>>>>>> 4/8 profiles failed. >>>>>>>>>>>> >>>>>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved >>> in >>>>>>>>>>>> libraries (table-planner). >>>>>>>>>>>> >>>>>>>>>>>> The kafka and connectors profiles both fail in kafka tests due >> to >>>>>>>>>>>> producer leaks, and no speed up could be confirmed so far: >>>>>>>>>>>> >>>>>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: >>>>>>>>>>>> kafka-producer-network-thread | producer-239 >>>>>>>>>>>> at org.junit.Assert.fail(Assert.java:88) >>>>>>>>>>>> at >>>>>>>>>>>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) >>>>>>>>>>>> at >>>>>>>>>>>> >> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) >>>>>>>>>>>> The tests profile failed due to various errors in migration >>> tests: >>>>>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected >>>>>>>>>>> accumulator >>>>>>>>>>>> results within time limit. >>>>>>>>>>>> at >>>>>>>>>>>> >> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) >>>>>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one >>>>>> above >>>>>>>>>>>> failed after 19 minutes and is only missing the migration tests >>>>>> (which >>>>>>>>>>>> currently need 6-7 minutes). So we could save somewhere between >>> 15 >>>>>> to >>>>>>>> 20 >>>>>>>>>>>> minutes here. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Finally, the misc profiles fails in YARN: >>>>>>>>>>>> >>>>>>>>>>>> java.lang.AssertionError >>>>>>>>>>>> at >>>>>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) >>>>>>>>>>>> No significant speedup could be observed in other modules; for >>>>>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. >>>>>>>>>>>> >>>>>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: >>>>>>>>>>>>> There appears to be a general agreement that 1) should be >> looked >>>>>>>> into; >>>>>>>>>>>>> I've setup a branch with fork reuse being enabled for all >> tests; >>>>>> will >>>>>>>>>>>>> report back the results. >>>>>>>>>>>>> >>>>>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: >>>>>>>>>>>>>> Hello everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> improving our build times is a hot topic at the moment so >> let's >>>>>>>>>>>>>> discuss the different ways how they could be reduced. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Current state: >>>>>>>>>>>>>> >>>>>>>>>>>>>> First up, let's look at some numbers: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1 full build currently consumes 5h of build time total >> ("total >>>>>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") >> to >>>>>>>>>>>>>> complete from start to finish. The run time may fluctuate of >>>>>> course >>>>>>>>>>>>>> depending on the current Travis load. This applies both to >>>>>> builds on >>>>>>>>>>>>>> the Apache and flink-ci Travis. >>>>>>>>>>>>>> >>>>>>>>>>>>>> At the time of writing, the current queue time for PR jobs >>>>>>>> (reminder: >>>>>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically >> means >>>>>> that >>>>>>>>>>>>>> we are processing builds at the rate that they come in), >>> however >>>>>> we >>>>>>>>>>>>>> are in an admittedly quiet period right now. >>>>>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h >>> as >>>>>>>>>>>>>> everyone was scrambling to get their changes merged in time >> for >>>>>> the >>>>>>>>>>>>>> feature freeze. >>>>>>>>>>>>>> >>>>>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where >>> pending >>>>>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or >> the >>>>>> PR >>>>>>>>>>>>>> was closed, which should prove especially useful during the >>> rush >>>>>>>>>>>>>> hours we see before feature-freezes.) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Past approaches >>>>>>>>>>>>>> >>>>>>>>>>>>>> Over the years we have done rather few things to improve this >>>>>>>>>>>>>> situation (hence our current predicament). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable >>>>>>>> reduction >>>>>>>>>>>>>> in total build times was the introduction of cron jobs, which >>>>>>>>>>>>>> consolidated the per-commit matrix from 4 configurations >>>>>> (different >>>>>>>>>>>>>> scala/hadoop versions) to 1. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The separation into multiple build profiles was only a >>>>>> work-around >>>>>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has >> the >>>>>>>>>>>>>> obvious potential of reducing run time, but we're currently >>>>>> hitting >>>>>>>> a >>>>>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, >>>>>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they >>>>>> nearly >>>>>>>>>>>>>> consume an entire profile by themselves (and thus no further >>>>>>>>>>>>>> splitting is possible). >>>>>>>>>>>>>> >>>>>>>>>>>>>> The rework that introduced stages, at the time of >> introduction, >>>>>> did >>>>>>>>>>>>>> also not provide a speed up, although this changed slightly >>> once >>>>>>>> more >>>>>>>>>>>>>> profiles were added and some optimizations to the caching >> have >>>>>> been >>>>>>>>>>>>>> made. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Very recently we modified the surefire-plugin configuration >> for >>>>>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, >>>>>> providing >>>>>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen >>> any >>>>>>>>>>>>>> negative consequences. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Suggestions >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is a list of /all /suggestions for reducing run/total >>> times >>>>>>>> that >>>>>>>>>>>>>> I have seen recently (in other words, they aren't necessarily >>>>>> mine >>>>>>>>>>>>>> nor may I agree with all of them). >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. >>>>>>>>>>>>>> * We've seen significant speedups in the blink planner, >>> and >>>>>>>> this >>>>>>>>>>>>>> should be applicable for all modules. However, I >> presume >>>>>>>>>>> there's >>>>>>>>>>>>>> a reason why we disabled JVM reuse (information on >> this >>>>>> would >>>>>>>>>>> be >>>>>>>>>>>>>> appreciated) >>>>>>>>>>>>>> 2. Custom differential build scripts >>>>>>>>>>>>>> * Setup custom scripts for determining which modules >>> might be >>>>>>>>>>>>>> affected by change, and manipulate the splits >>> accordingly. >>>>>>>> This >>>>>>>>>>>>>> approach is conceptually quite straight-forward, but >> has >>>>>>>> limits >>>>>>>>>>>>>> since it has to be pessimistic; i.e. a change in >>> flink-core >>>>>>>>>>>>>> _must_ result in testing all modules. >>>>>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on >>>>>>>> demand. >>>>>>>>>>>>>> * With the introduction of the ci-bot we now have >>>>>> significantly >>>>>>>>>>>>>> more options on how to handle PR builds. One option >>> could >>>>>> be >>>>>>>> to >>>>>>>>>>>>>> only run basic tests when the PR is created (which may >>> be >>>>>>>> only >>>>>>>>>>>>>> modified modules, or all unit tests, or another >> low-cost >>>>>>>>>>>>>> scheme), and then have a committer trigger other >> builds >>>>>> (full >>>>>>>>>>>>>> test run, e2e tests, etc...) on demand. >>>>>>>>>>>>>> 4. Move more tests into cron builds >>>>>>>>>>>>>> * The budget version of 3); move certain tests that are >>>>>> either >>>>>>>>>>>>>> expensive (like some runtime tests that take minutes) >>> or in >>>>>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. >>>>>>>>>>>>>> 5. Gradle >>>>>>>>>>>>>> * Gradle was brought up a few times for it's built-in >>> support >>>>>>>> for >>>>>>>>>>>>>> differential builds; basically providing 2) without >> the >>>>>>>>>>> overhead >>>>>>>>>>>>>> of maintaining additional scripts. >>>>>>>>>>>>>> * To date no PoC was provided that shows it working in >>> our CI >>>>>>>>>>>>>> environment (i.e., handling splits & caching etc). >>>>>>>>>>>>>> * This is the most disruptive change by a fair margin, >> as >>> it >>>>>>>>>>> would >>>>>>>>>>>>>> affect the entire project, developers and potentially >>> users >>>>>>>> (f >>>>>>>>>>>>>> they build from source). >>>>>>>>>>>>>> 6. CI service >>>>>>>>>>>>>> * Our current artifact caching setup on Travis is >>> basically a >>>>>>>>>>>>>> hack; we're basically abusing the Travis cache, which >> is >>>>>>>> meant >>>>>>>>>>>>>> for long-term caching, to ship build artifacts across >>> jobs. >>>>>>>>>>> It's >>>>>>>>>>>>>> brittle at times due to timing/visibility issues and >> on >>>>>>>>>>> branches >>>>>>>>>>>>>> the cleanup processes can interfere with running >>> builds. It >>>>>>>> is >>>>>>>>>>>>>> also not as effective as it could be. >>>>>>>>>>>>>> * There are CI services that provide build artifact >>> caching >>>>>> out >>>>>>>>>>> of >>>>>>>>>>>>>> the box, which could be useful for us. >>>>>>>>>>>>>> * To date, no PoC for using another CI service has been >>>>>>>> provided. >>> |
In reply to this post by Robert Metzger
I assume you already have a working (and verified) azure setup?
Once we're running things on azure on the apache repo people will invariably use that as a source of truth because fancy check marks will yet again appear on commits. Hence I'm wary of running experiments here; I would prefer if we only activate it once things are confirmed to be working. For observation purposes, we could also add it to flink-ci with notifications to people who are interested in this experiment. This wouldn't impact CiBot. On 03/09/2019 18:57, Robert Metzger wrote: > Hi all, > > I wanted to give a short update on this: > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > working on making all modules compile and test with Gradle. We've also > identified some problematic areas (shading being the most obvious one) > which we will analyse as part of the PoC. > The goal is to see how much Gradle helps to parallelise our build, and to > avoid duplicate work (incremental builds). > > - I am working on setting up a Flink testing infrastructure based on Azure > Pipelines, using more powerful hardware. Alibaba kindly provided me with > two 32 core machines (temporarily), and another company reached out to > privately, looking into options for cheap, fast machines :) > If nobody in the community disagrees, I am going to set up Azure Pipelines > with our apache/flink GitHub as a build infrastructure that exists next to > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is > equally or even more reliable than Travis, and I want to see what the > required maintenance work is. > On top of that, Azure Pipelines is a very feature-rich tool with a lot of > nice options for us to improve the build experience (statistics about tests > (flaky tests etc.), nice docker support, plenty of free build resources for > open source projects, ...) > > Best, > Robert > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> wrote: > >> Hi all, >> >> I have summarized all arguments mentioned so far + some additional >> research into a Wiki page here: >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 >> >> I'm happy to hear further comments on my summary! I'm pretty sure we can >> find more pro's and con's for the different options. >> >> My opinion after looking at the options: >> >> - Flink relies on an outdated build tool (Maven), while a good >> alternative is well-established (gradle), and will likely provide a much >> better CI and local build experience through incremental build and cached >> intermediates. >> Scripting around Maven, or splitting modules / test execution / >> repositories won't solve this problem. We should rather spend the effort in >> migrating to a modern build tool which will provide us benefits in the long >> run. >> - Flink relies on a fairly slow build service (Travis CI), while >> simply putting more money onto the problem could cut the build time at >> least in half. >> We should consider using a build service that provides bigger machines >> to solve our build time problem. >> >> My opinion is based on many assumptions (gradle is actually as fast as >> promised (haven't used it before), we can build Flink with gradle, we find >> sponsors for bigger build machines) that we need to test first through PoCs. >> >> Best, >> Robert >> >> >> >> >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]> >> wrote: >> >>> I did a quick test: a normal "mvn clean install -DskipTests >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine >>> takes about 14 minutes. After removing all mentions of maven-shade-plugin >>> the build time goes down to roughly 11.5 minutes. (Obviously the resulting >>> Flink won’t work, because some expected stuff is not packaged and most of >>> the end-to-end tests use the shade plugin to package the jars for testing. >>> >>> Aljoscha >>> >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> wrote: >>>> >>>> Hi all, >>>> >>>> I wanted to understand the impact of the hardware we are using for >>> running >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory >>> [1]. >>>> They are using Google Cloud Compute Engine *n1-standard-2* instances. >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine >>> type. >>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes >>> *1:21 >>>> h*. >>>> >>>> What is interesting are the per-module build time differences. >>>> Modules which are parallelizing tests well greatly benefit from the >>>> additional cores: >>>> "flink-tests" 36:51 min vs 4:33 min >>>> "flink-runtime" 23:41 min vs 3:47 min >>>> "flink-table-planner" 15:54 min vs 3:13 min >>>> >>>> On the other hand, we have modules which are not parallel at all: >>>> "flink-connector-kafka": 16:32 min vs 15:19 min >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min >>>> Also, the checkstyle plugin is not scaling at all. >>>> >>>> Chesnay reported some significant speedups by reusing forks. >>>> I don't know how much effort it would be to make the Kafka tests >>>> parallelizable. In total, they currently use 30 minutes on the big >>> machine >>>> (while 31 CPUs are idling :) ) >>>> >>>> Let me know what you think about these results. If the community is >>>> generally interested in further investigating into that direction, I >>> could >>>> look into software to orchestrate this, as well as sponsors for such an >>>> infrastructure. >>>> >>>> [1] https://docs.travis-ci.com/user/reference/overview/ >>>> >>>> >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]> >>> wrote: >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see >>> this >>>>> quite easily by looking at the compile step in the misc profile >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that >>>>> longer than a fraction of a section are usually caused by shading lots >>>>> of classes. Note that I cannot tell you how much of this is spent on >>>>> relocations, and how much on writing the jar. >>>>> >>>>> Personally, I'd very much like us to move all shading to flink-shaded; >>>>> this would finally allows us to use newer maven versions without >>> needing >>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial >>>>> affair in some cases; IIRC calcite could be difficult to handle. >>>>> >>>>> On another note, this would also simplify switching the main repo to >>>>> another build system, since you would no longer had to deal with >>>>> relocations, just packaging + merging NOTICE files. >>>>> >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API >>>>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime >>>>> and flink-dist, where both relocate dependencies and one is bundled by >>>>> the other), and, most importantly, CI (and really, without CI being >>>>> covered in a PoC there's nothing to discuss). >>>>> >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: >>>>>> Speaking of flink-shaded, do we have any idea what the impact of >>> shading >>>>> is on the build time? We could get rid of shading completely in the >>> Flink >>>>> main repository by moving everything that we shade to flink-shaded. >>>>>> Aljoscha >>>>>> >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote: >>>>>>> >>>>>>> +1 to Till's points on #2 and #5, especially the potential >>>>> non-disruptive, >>>>>>> gradual migration approach if we decide to go that route. >>>>>>> >>>>>>> To add on, I want to point it out that we can actually start with >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's >>> of >>>>> much >>>>>>> smaller size, totally isolated from and not interfered with flink >>>>> project >>>>>>> [2], and it actually covers most of our practical feature >>> requirements >>>>> for >>>>>>> a build tool - all making it an ideal experimental field. >>>>>>> >>>>>>> [1] https://github.com/apache/flink-shaded >>>>>>> [2] https://github.com/apache/flink >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]> >>>>> wrote: >>>>>>>> For the sake of keeping the discussion focused and not cluttering >>> the >>>>>>>> discussion thread I would suggest to split the detailed reporting >>> for >>>>>>>> reusing JVMs to a separate thread and cross linking it from here. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Till >>>>>>>> >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < >>> [hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Update: >>>>>>>>> >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse >>> right >>>>>>>>> away, while flink-tests has the potential for huge savings, but we >>>>> have >>>>>>>>> to figure out some issues first. >>>>>>>>> >>>>>>>>> >>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220 >>>>>>>>> >>>>>>>>> 4/8 profiles failed. >>>>>>>>> >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in >>>>>>>>> libraries (table-planner). >>>>>>>>> >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to >>>>>>>>> producer leaks, and no speed up could be confirmed so far: >>>>>>>>> >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: >>>>>>>>> kafka-producer-network-thread | producer-239 >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) >>>>>>>>> at >>>>>>>>> >>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) >>>>>>>>> at >>>>>>>>> >>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) >>>>>>>>> The tests profile failed due to various errors in migration tests: >>>>>>>>> >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected >>>>>>>> accumulator >>>>>>>>> results within time limit. >>>>>>>>> at >>>>>>>>> >>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one >>> above >>>>>>>>> failed after 19 minutes and is only missing the migration tests >>> (which >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between 15 >>> to >>>>> 20 >>>>>>>>> minutes here. >>>>>>>>> >>>>>>>>> >>>>>>>>> Finally, the misc profiles fails in YARN: >>>>>>>>> >>>>>>>>> java.lang.AssertionError >>>>>>>>> at >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) >>>>>>>>> No significant speedup could be observed in other modules; for >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. >>>>>>>>> >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: >>>>>>>>>> There appears to be a general agreement that 1) should be looked >>>>> into; >>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests; >>> will >>>>>>>>>> report back the results. >>>>>>>>>> >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: >>>>>>>>>>> Hello everyone, >>>>>>>>>>> >>>>>>>>>>> improving our build times is a hot topic at the moment so let's >>>>>>>>>>> discuss the different ways how they could be reduced. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Current state: >>>>>>>>>>> >>>>>>>>>>> First up, let's look at some numbers: >>>>>>>>>>> >>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to >>>>>>>>>>> complete from start to finish. The run time may fluctuate of >>> course >>>>>>>>>>> depending on the current Travis load. This applies both to >>> builds on >>>>>>>>>>> the Apache and flink-ci Travis. >>>>>>>>>>> >>>>>>>>>>> At the time of writing, the current queue time for PR jobs >>>>> (reminder: >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means >>> that >>>>>>>>>>> we are processing builds at the rate that they come in), however >>> we >>>>>>>>>>> are in an admittedly quiet period right now. >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as >>>>>>>>>>> everyone was scrambling to get their changes merged in time for >>> the >>>>>>>>>>> feature freeze. >>>>>>>>>>> >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where pending >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the >>> PR >>>>>>>>>>> was closed, which should prove especially useful during the rush >>>>>>>>>>> hours we see before feature-freezes.) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Past approaches >>>>>>>>>>> >>>>>>>>>>> Over the years we have done rather few things to improve this >>>>>>>>>>> situation (hence our current predicament). >>>>>>>>>>> >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable >>>>> reduction >>>>>>>>>>> in total build times was the introduction of cron jobs, which >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations >>> (different >>>>>>>>>>> scala/hadoop versions) to 1. >>>>>>>>>>> >>>>>>>>>>> The separation into multiple build profiles was only a >>> work-around >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the >>>>>>>>>>> obvious potential of reducing run time, but we're currently >>> hitting >>>>> a >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they >>> nearly >>>>>>>>>>> consume an entire profile by themselves (and thus no further >>>>>>>>>>> splitting is possible). >>>>>>>>>>> >>>>>>>>>>> The rework that introduced stages, at the time of introduction, >>> did >>>>>>>>>>> also not provide a speed up, although this changed slightly once >>>>> more >>>>>>>>>>> profiles were added and some optimizations to the caching have >>> been >>>>>>>>>>> made. >>>>>>>>>>> >>>>>>>>>>> Very recently we modified the surefire-plugin configuration for >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, >>> providing >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen any >>>>>>>>>>> negative consequences. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Suggestions >>>>>>>>>>> >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total times >>>>> that >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily >>> mine >>>>>>>>>>> nor may I agree with all of them). >>>>>>>>>>> >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. >>>>>>>>>>> * We've seen significant speedups in the blink planner, and >>>>> this >>>>>>>>>>> should be applicable for all modules. However, I presume >>>>>>>> there's >>>>>>>>>>> a reason why we disabled JVM reuse (information on this >>> would >>>>>>>> be >>>>>>>>>>> appreciated) >>>>>>>>>>> 2. Custom differential build scripts >>>>>>>>>>> * Setup custom scripts for determining which modules might be >>>>>>>>>>> affected by change, and manipulate the splits accordingly. >>>>> This >>>>>>>>>>> approach is conceptually quite straight-forward, but has >>>>> limits >>>>>>>>>>> since it has to be pessimistic; i.e. a change in flink-core >>>>>>>>>>> _must_ result in testing all modules. >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on >>>>> demand. >>>>>>>>>>> * With the introduction of the ci-bot we now have >>> significantly >>>>>>>>>>> more options on how to handle PR builds. One option could >>> be >>>>> to >>>>>>>>>>> only run basic tests when the PR is created (which may be >>>>> only >>>>>>>>>>> modified modules, or all unit tests, or another low-cost >>>>>>>>>>> scheme), and then have a committer trigger other builds >>> (full >>>>>>>>>>> test run, e2e tests, etc...) on demand. >>>>>>>>>>> 4. Move more tests into cron builds >>>>>>>>>>> * The budget version of 3); move certain tests that are >>> either >>>>>>>>>>> expensive (like some runtime tests that take minutes) or in >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. >>>>>>>>>>> 5. Gradle >>>>>>>>>>> * Gradle was brought up a few times for it's built-in support >>>>> for >>>>>>>>>>> differential builds; basically providing 2) without the >>>>>>>> overhead >>>>>>>>>>> of maintaining additional scripts. >>>>>>>>>>> * To date no PoC was provided that shows it working in our CI >>>>>>>>>>> environment (i.e., handling splits & caching etc). >>>>>>>>>>> * This is the most disruptive change by a fair margin, as it >>>>>>>> would >>>>>>>>>>> affect the entire project, developers and potentially users >>>>> (f >>>>>>>>>>> they build from source). >>>>>>>>>>> 6. CI service >>>>>>>>>>> * Our current artifact caching setup on Travis is basically a >>>>>>>>>>> hack; we're basically abusing the Travis cache, which is >>>>> meant >>>>>>>>>>> for long-term caching, to ship build artifacts across jobs. >>>>>>>> It's >>>>>>>>>>> brittle at times due to timing/visibility issues and on >>>>>>>> branches >>>>>>>>>>> the cleanup processes can interfere with running builds. It >>>>> is >>>>>>>>>>> also not as effective as it could be. >>>>>>>>>>> * There are CI services that provide build artifact caching >>> out >>>>>>>> of >>>>>>>>>>> the box, which could be useful for us. >>>>>>>>>>> * To date, no PoC for using another CI service has been >>>>> provided. >>>>>>>>>>> >>>>> >>> |
I do have a working Azure setup, yes. E2E tests are not included in the
3.5hrs. Yesterday, I became aware of a major blocker with Azure pipelines: Apache Infra does not allow it to be integrated with Apache GitHub repositories, because it requires write access (for a simple usability feature) [1]. This means that we "have" to use CiBot for the time being. I've also reached out to Microsoft to see if they can do anything about it. +1 For setting it up with CiBot immediately. [1]https://issues.apache.org/jira/browse/INFRA-17030 On Thu, Sep 5, 2019 at 11:04 AM Chesnay Schepler <[hidden email]> wrote: > I assume you already have a working (and verified) azure setup? > > Once we're running things on azure on the apache repo people will > invariably use that as a source of truth because fancy check marks will > yet again appear on commits. Hence I'm wary of running experiments here; > I would prefer if we only activate it once things are confirmed to be > working. > > For observation purposes, we could also add it to flink-ci with > notifications to people who are interested in this experiment. > This wouldn't impact CiBot. > > On 03/09/2019 18:57, Robert Metzger wrote: > > Hi all, > > > > I wanted to give a short update on this: > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > > working on making all modules compile and test with Gradle. We've also > > identified some problematic areas (shading being the most obvious one) > > which we will analyse as part of the PoC. > > The goal is to see how much Gradle helps to parallelise our build, and to > > avoid duplicate work (incremental builds). > > > > - I am working on setting up a Flink testing infrastructure based on > Azure > > Pipelines, using more powerful hardware. Alibaba kindly provided me with > > two 32 core machines (temporarily), and another company reached out to > > privately, looking into options for cheap, fast machines :) > > If nobody in the community disagrees, I am going to set up Azure > Pipelines > > with our apache/flink GitHub as a build infrastructure that exists next > to > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is > > equally or even more reliable than Travis, and I want to see what the > > required maintenance work is. > > On top of that, Azure Pipelines is a very feature-rich tool with a lot of > > nice options for us to improve the build experience (statistics about > tests > > (flaky tests etc.), nice docker support, plenty of free build resources > for > > open source projects, ...) > > > > Best, > > Robert > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> > wrote: > > > >> Hi all, > >> > >> I have summarized all arguments mentioned so far + some additional > >> research into a Wiki page here: > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > >> > >> I'm happy to hear further comments on my summary! I'm pretty sure we can > >> find more pro's and con's for the different options. > >> > >> My opinion after looking at the options: > >> > >> - Flink relies on an outdated build tool (Maven), while a good > >> alternative is well-established (gradle), and will likely provide a > much > >> better CI and local build experience through incremental build and > cached > >> intermediates. > >> Scripting around Maven, or splitting modules / test execution / > >> repositories won't solve this problem. We should rather spend the > effort in > >> migrating to a modern build tool which will provide us benefits in > the long > >> run. > >> - Flink relies on a fairly slow build service (Travis CI), while > >> simply putting more money onto the problem could cut the build time > at > >> least in half. > >> We should consider using a build service that provides bigger > machines > >> to solve our build time problem. > >> > >> My opinion is based on many assumptions (gradle is actually as fast as > >> promised (haven't used it before), we can build Flink with gradle, we > find > >> sponsors for bigger build machines) that we need to test first through > PoCs. > >> > >> Best, > >> Robert > >> > >> > >> > >> > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]> > >> wrote: > >> > >>> I did a quick test: a normal "mvn clean install -DskipTests > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > machine > >>> takes about 14 minutes. After removing all mentions of > maven-shade-plugin > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > resulting > >>> Flink won’t work, because some expected stuff is not packaged and most > of > >>> the end-to-end tests use the shade plugin to package the jars for > testing. > >>> > >>> Aljoscha > >>> > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> > wrote: > >>>> > >>>> Hi all, > >>>> > >>>> I wanted to understand the impact of the hardware we are using for > >>> running > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory > >>> [1]. > >>>> They are using Google Cloud Compute Engine *n1-standard-2* instances. > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine > >>> type. > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes > >>> *1:21 > >>>> h*. > >>>> > >>>> What is interesting are the per-module build time differences. > >>>> Modules which are parallelizing tests well greatly benefit from the > >>>> additional cores: > >>>> "flink-tests" 36:51 min vs 4:33 min > >>>> "flink-runtime" 23:41 min vs 3:47 min > >>>> "flink-table-planner" 15:54 min vs 3:13 min > >>>> > >>>> On the other hand, we have modules which are not parallel at all: > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > >>>> Also, the checkstyle plugin is not scaling at all. > >>>> > >>>> Chesnay reported some significant speedups by reusing forks. > >>>> I don't know how much effort it would be to make the Kafka tests > >>>> parallelizable. In total, they currently use 30 minutes on the big > >>> machine > >>>> (while 31 CPUs are idling :) ) > >>>> > >>>> Let me know what you think about these results. If the community is > >>>> generally interested in further investigating into that direction, I > >>> could > >>>> look into software to orchestrate this, as well as sponsors for such > an > >>>> infrastructure. > >>>> > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > >>>> > >>>> > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]> > >>> wrote: > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see > >>> this > >>>>> quite easily by looking at the compile step in the misc profile > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules > that > >>>>> longer than a fraction of a section are usually caused by shading > lots > >>>>> of classes. Note that I cannot tell you how much of this is spent on > >>>>> relocations, and how much on writing the jar. > >>>>> > >>>>> Personally, I'd very much like us to move all shading to > flink-shaded; > >>>>> this would finally allows us to use newer maven versions without > >>> needing > >>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > >>>>> > >>>>> On another note, this would also simplify switching the main repo to > >>>>> another build system, since you would no longer had to deal with > >>>>> relocations, just packaging + merging NOTICE files. > >>>>> > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API > >>>>> compatibility checks, checkstyle, layered shading (e.g., > flink-runtime > >>>>> and flink-dist, where both relocate dependencies and one is bundled > by > >>>>> the other), and, most importantly, CI (and really, without CI being > >>>>> covered in a PoC there's nothing to discuss). > >>>>> > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > >>> shading > >>>>> is on the build time? We could get rid of shading completely in the > >>> Flink > >>>>> main repository by moving everything that we shade to flink-shaded. > >>>>>> Aljoscha > >>>>>> > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote: > >>>>>>> > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > >>>>> non-disruptive, > >>>>>>> gradual migration approach if we decide to go that route. > >>>>>>> > >>>>>>> To add on, I want to point it out that we can actually start with > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's > >>> of > >>>>> much > >>>>>>> smaller size, totally isolated from and not interfered with flink > >>>>> project > >>>>>>> [2], and it actually covers most of our practical feature > >>> requirements > >>>>> for > >>>>>>> a build tool - all making it an ideal experimental field. > >>>>>>> > >>>>>>> [1] https://github.com/apache/flink-shaded > >>>>>>> [2] https://github.com/apache/flink > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > [hidden email]> > >>>>> wrote: > >>>>>>>> For the sake of keeping the discussion focused and not cluttering > >>> the > >>>>>>>> discussion thread I would suggest to split the detailed reporting > >>> for > >>>>>>>> reusing JVMs to a separate thread and cross linking it from here. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Till > >>>>>>>> > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > >>> [hidden email]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Update: > >>>>>>>>> > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse > >>> right > >>>>>>>>> away, while flink-tests has the potential for huge savings, but > we > >>>>> have > >>>>>>>>> to figure out some issues first. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220 > >>>>>>>>> > >>>>>>>>> 4/8 profiles failed. > >>>>>>>>> > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved > in > >>>>>>>>> libraries (table-planner). > >>>>>>>>> > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > >>>>>>>>> > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: > >>>>>>>>> kafka-producer-network-thread | producer-239 > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > >>>>>>>>> The tests profile failed due to various errors in migration > tests: > >>>>>>>>> > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected > >>>>>>>> accumulator > >>>>>>>>> results within time limit. > >>>>>>>>> at > >>>>>>>>> > >>> > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one > >>> above > >>>>>>>>> failed after 19 minutes and is only missing the migration tests > >>> (which > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between > 15 > >>> to > >>>>> 20 > >>>>>>>>> minutes here. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Finally, the misc profiles fails in YARN: > >>>>>>>>> > >>>>>>>>> java.lang.AssertionError > >>>>>>>>> at > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > >>>>>>>>> No significant speedup could be observed in other modules; for > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > >>>>>>>>> > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > >>>>>>>>>> There appears to be a general agreement that 1) should be looked > >>>>> into; > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests; > >>> will > >>>>>>>>>> report back the results. > >>>>>>>>>> > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > >>>>>>>>>>> Hello everyone, > >>>>>>>>>>> > >>>>>>>>>>> improving our build times is a hot topic at the moment so let's > >>>>>>>>>>> discuss the different ways how they could be reduced. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Current state: > >>>>>>>>>>> > >>>>>>>>>>> First up, let's look at some numbers: > >>>>>>>>>>> > >>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of > >>> course > >>>>>>>>>>> depending on the current Travis load. This applies both to > >>> builds on > >>>>>>>>>>> the Apache and flink-ci Travis. > >>>>>>>>>>> > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > >>>>> (reminder: > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means > >>> that > >>>>>>>>>>> we are processing builds at the rate that they come in), > however > >>> we > >>>>>>>>>>> are in an admittedly quiet period right now. > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h > as > >>>>>>>>>>> everyone was scrambling to get their changes merged in time for > >>> the > >>>>>>>>>>> feature freeze. > >>>>>>>>>>> > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > pending > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the > >>> PR > >>>>>>>>>>> was closed, which should prove especially useful during the > rush > >>>>>>>>>>> hours we see before feature-freezes.) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Past approaches > >>>>>>>>>>> > >>>>>>>>>>> Over the years we have done rather few things to improve this > >>>>>>>>>>> situation (hence our current predicament). > >>>>>>>>>>> > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > >>>>> reduction > >>>>>>>>>>> in total build times was the introduction of cron jobs, which > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > >>> (different > >>>>>>>>>>> scala/hadoop versions) to 1. > >>>>>>>>>>> > >>>>>>>>>>> The separation into multiple build profiles was only a > >>> work-around > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > >>> hitting > >>>>> a > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they > >>> nearly > >>>>>>>>>>> consume an entire profile by themselves (and thus no further > >>>>>>>>>>> splitting is possible). > >>>>>>>>>>> > >>>>>>>>>>> The rework that introduced stages, at the time of introduction, > >>> did > >>>>>>>>>>> also not provide a speed up, although this changed slightly > once > >>>>> more > >>>>>>>>>>> profiles were added and some optimizations to the caching have > >>> been > >>>>>>>>>>> made. > >>>>>>>>>>> > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration for > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > >>> providing > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen > any > >>>>>>>>>>> negative consequences. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Suggestions > >>>>>>>>>>> > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > times > >>>>> that > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily > >>> mine > >>>>>>>>>>> nor may I agree with all of them). > >>>>>>>>>>> > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > >>>>>>>>>>> * We've seen significant speedups in the blink planner, > and > >>>>> this > >>>>>>>>>>> should be applicable for all modules. However, I presume > >>>>>>>> there's > >>>>>>>>>>> a reason why we disabled JVM reuse (information on this > >>> would > >>>>>>>> be > >>>>>>>>>>> appreciated) > >>>>>>>>>>> 2. Custom differential build scripts > >>>>>>>>>>> * Setup custom scripts for determining which modules > might be > >>>>>>>>>>> affected by change, and manipulate the splits > accordingly. > >>>>> This > >>>>>>>>>>> approach is conceptually quite straight-forward, but has > >>>>> limits > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > flink-core > >>>>>>>>>>> _must_ result in testing all modules. > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on > >>>>> demand. > >>>>>>>>>>> * With the introduction of the ci-bot we now have > >>> significantly > >>>>>>>>>>> more options on how to handle PR builds. One option > could > >>> be > >>>>> to > >>>>>>>>>>> only run basic tests when the PR is created (which may > be > >>>>> only > >>>>>>>>>>> modified modules, or all unit tests, or another low-cost > >>>>>>>>>>> scheme), and then have a committer trigger other builds > >>> (full > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > >>>>>>>>>>> 4. Move more tests into cron builds > >>>>>>>>>>> * The budget version of 3); move certain tests that are > >>> either > >>>>>>>>>>> expensive (like some runtime tests that take minutes) > or in > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > >>>>>>>>>>> 5. Gradle > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > support > >>>>> for > >>>>>>>>>>> differential builds; basically providing 2) without the > >>>>>>>> overhead > >>>>>>>>>>> of maintaining additional scripts. > >>>>>>>>>>> * To date no PoC was provided that shows it working in > our CI > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > >>>>>>>>>>> * This is the most disruptive change by a fair margin, as > it > >>>>>>>> would > >>>>>>>>>>> affect the entire project, developers and potentially > users > >>>>> (f > >>>>>>>>>>> they build from source). > >>>>>>>>>>> 6. CI service > >>>>>>>>>>> * Our current artifact caching setup on Travis is > basically a > >>>>>>>>>>> hack; we're basically abusing the Travis cache, which is > >>>>> meant > >>>>>>>>>>> for long-term caching, to ship build artifacts across > jobs. > >>>>>>>> It's > >>>>>>>>>>> brittle at times due to timing/visibility issues and on > >>>>>>>> branches > >>>>>>>>>>> the cleanup processes can interfere with running > builds. It > >>>>> is > >>>>>>>>>>> also not as effective as it could be. > >>>>>>>>>>> * There are CI services that provide build artifact > caching > >>> out > >>>>>>>> of > >>>>>>>>>>> the box, which could be useful for us. > >>>>>>>>>>> * To date, no PoC for using another CI service has been > >>>>> provided. > >>>>>>>>>>> > >>>>> > >>> > > |
Hi,
How is the CI migration going? I notice that Azure pipelines support adding self-hosted pipeline agent.[1] How do you think that OpenLab donate some ARM resources to Flink community as a agent pool? In this way, X86 and ARM can be handled by the same CI config easily. Then we don't need do more effort to adding OpenLab CI support and Flink Community can have the full access to the ARM resources as well. Thanks. [1 ] https://docs.microsoft.com/zh-cn/azure/devops/pipelines/agents/v2-linux?view=azure-devops Robert Metzger <[hidden email]> 于2019年9月5日周四 下午8:54写道: > I do have a working Azure setup, yes. E2E tests are not included in the > 3.5hrs. > > Yesterday, I became aware of a major blocker with Azure pipelines: Apache > Infra does not allow it to be integrated with Apache GitHub repositories, > because it requires write access (for a simple usability feature) [1]. This > means that we "have" to use CiBot for the time being. > I've also reached out to Microsoft to see if they can do anything about it. > > +1 For setting it up with CiBot immediately. > > [1]https://issues.apache.org/jira/browse/INFRA-17030 > > On Thu, Sep 5, 2019 at 11:04 AM Chesnay Schepler <[hidden email]> > wrote: > > > I assume you already have a working (and verified) azure setup? > > > > Once we're running things on azure on the apache repo people will > > invariably use that as a source of truth because fancy check marks will > > yet again appear on commits. Hence I'm wary of running experiments here; > > I would prefer if we only activate it once things are confirmed to be > > working. > > > > For observation purposes, we could also add it to flink-ci with > > notifications to people who are interested in this experiment. > > This wouldn't impact CiBot. > > > > On 03/09/2019 18:57, Robert Metzger wrote: > > > Hi all, > > > > > > I wanted to give a short update on this: > > > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently > > > working on making all modules compile and test with Gradle. We've also > > > identified some problematic areas (shading being the most obvious one) > > > which we will analyse as part of the PoC. > > > The goal is to see how much Gradle helps to parallelise our build, and > to > > > avoid duplicate work (incremental builds). > > > > > > - I am working on setting up a Flink testing infrastructure based on > > Azure > > > Pipelines, using more powerful hardware. Alibaba kindly provided me > with > > > two 32 core machines (temporarily), and another company reached out to > > > privately, looking into options for cheap, fast machines :) > > > If nobody in the community disagrees, I am going to set up Azure > > Pipelines > > > with our apache/flink GitHub as a build infrastructure that exists next > > to > > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines > is > > > equally or even more reliable than Travis, and I want to see what the > > > required maintenance work is. > > > On top of that, Azure Pipelines is a very feature-rich tool with a lot > of > > > nice options for us to improve the build experience (statistics about > > tests > > > (flaky tests etc.), nice docker support, plenty of free build resources > > for > > > open source projects, ...) > > > > > > Best, > > > Robert > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> > > wrote: > > > > > >> Hi all, > > >> > > >> I have summarized all arguments mentioned so far + some additional > > >> research into a Wiki page here: > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > > >> > > >> I'm happy to hear further comments on my summary! I'm pretty sure we > can > > >> find more pro's and con's for the different options. > > >> > > >> My opinion after looking at the options: > > >> > > >> - Flink relies on an outdated build tool (Maven), while a good > > >> alternative is well-established (gradle), and will likely provide > a > > much > > >> better CI and local build experience through incremental build and > > cached > > >> intermediates. > > >> Scripting around Maven, or splitting modules / test execution / > > >> repositories won't solve this problem. We should rather spend the > > effort in > > >> migrating to a modern build tool which will provide us benefits in > > the long > > >> run. > > >> - Flink relies on a fairly slow build service (Travis CI), while > > >> simply putting more money onto the problem could cut the build > time > > at > > >> least in half. > > >> We should consider using a build service that provides bigger > > machines > > >> to solve our build time problem. > > >> > > >> My opinion is based on many assumptions (gradle is actually as fast as > > >> promised (haven't used it before), we can build Flink with gradle, we > > find > > >> sponsors for bigger build machines) that we need to test first through > > PoCs. > > >> > > >> Best, > > >> Robert > > >> > > >> > > >> > > >> > > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek < > [hidden email]> > > >> wrote: > > >> > > >>> I did a quick test: a normal "mvn clean install -DskipTests > > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > > machine > > >>> takes about 14 minutes. After removing all mentions of > > maven-shade-plugin > > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > > resulting > > >>> Flink won’t work, because some expected stuff is not packaged and > most > > of > > >>> the end-to-end tests use the shade plugin to package the jars for > > testing. > > >>> > > >>> Aljoscha > > >>> > > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> > > wrote: > > >>>> > > >>>> Hi all, > > >>>> > > >>>> I wanted to understand the impact of the hardware we are using for > > >>> running > > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory > > >>> [1]. > > >>>> They are using Google Cloud Compute Engine *n1-standard-2* > instances. > > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine > > >>> type. > > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, > takes > > >>> *1:21 > > >>>> h*. > > >>>> > > >>>> What is interesting are the per-module build time differences. > > >>>> Modules which are parallelizing tests well greatly benefit from the > > >>>> additional cores: > > >>>> "flink-tests" 36:51 min vs 4:33 min > > >>>> "flink-runtime" 23:41 min vs 3:47 min > > >>>> "flink-table-planner" 15:54 min vs 3:13 min > > >>>> > > >>>> On the other hand, we have modules which are not parallel at all: > > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > > >>>> Also, the checkstyle plugin is not scaling at all. > > >>>> > > >>>> Chesnay reported some significant speedups by reusing forks. > > >>>> I don't know how much effort it would be to make the Kafka tests > > >>>> parallelizable. In total, they currently use 30 minutes on the big > > >>> machine > > >>>> (while 31 CPUs are idling :) ) > > >>>> > > >>>> Let me know what you think about these results. If the community is > > >>>> generally interested in further investigating into that direction, I > > >>> could > > >>>> look into software to orchestrate this, as well as sponsors for such > > an > > >>>> infrastructure. > > >>>> > > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > > >>>> > > >>>> > > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler < > [hidden email]> > > >>> wrote: > > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see > > >>> this > > >>>>> quite easily by looking at the compile step in the misc profile > > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules > > that > > >>>>> longer than a fraction of a section are usually caused by shading > > lots > > >>>>> of classes. Note that I cannot tell you how much of this is spent > on > > >>>>> relocations, and how much on writing the jar. > > >>>>> > > >>>>> Personally, I'd very much like us to move all shading to > > flink-shaded; > > >>>>> this would finally allows us to use newer maven versions without > > >>> needing > > >>>>> cumbersome workarounds for flink-dist. However, this isn't a > trivial > > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > > >>>>> > > >>>>> On another note, this would also simplify switching the main repo > to > > >>>>> another build system, since you would no longer had to deal with > > >>>>> relocations, just packaging + merging NOTICE files. > > >>>>> > > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, API > > >>>>> compatibility checks, checkstyle, layered shading (e.g., > > flink-runtime > > >>>>> and flink-dist, where both relocate dependencies and one is bundled > > by > > >>>>> the other), and, most importantly, CI (and really, without CI being > > >>>>> covered in a PoC there's nothing to discuss). > > >>>>> > > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > > >>> shading > > >>>>> is on the build time? We could get rid of shading completely in the > > >>> Flink > > >>>>> main repository by moving everything that we shade to flink-shaded. > > >>>>>> Aljoscha > > >>>>>> > > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote: > > >>>>>>> > > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > > >>>>> non-disruptive, > > >>>>>>> gradual migration approach if we decide to go that route. > > >>>>>>> > > >>>>>>> To add on, I want to point it out that we can actually start with > > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. > It's > > >>> of > > >>>>> much > > >>>>>>> smaller size, totally isolated from and not interfered with flink > > >>>>> project > > >>>>>>> [2], and it actually covers most of our practical feature > > >>> requirements > > >>>>> for > > >>>>>>> a build tool - all making it an ideal experimental field. > > >>>>>>> > > >>>>>>> [1] https://github.com/apache/flink-shaded > > >>>>>>> [2] https://github.com/apache/flink > > >>>>>>> > > >>>>>>> > > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > > [hidden email]> > > >>>>> wrote: > > >>>>>>>> For the sake of keeping the discussion focused and not > cluttering > > >>> the > > >>>>>>>> discussion thread I would suggest to split the detailed > reporting > > >>> for > > >>>>>>>> reusing JVMs to a separate thread and cross linking it from > here. > > >>>>>>>> > > >>>>>>>> Cheers, > > >>>>>>>> Till > > >>>>>>>> > > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > > >>> [hidden email]> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Update: > > >>>>>>>>> > > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork > reuse > > >>> right > > >>>>>>>>> away, while flink-tests has the potential for huge savings, but > > we > > >>>>> have > > >>>>>>>>> to figure out some issues first. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Build link: > https://travis-ci.org/zentol/flink/builds/572659220 > > >>>>>>>>> > > >>>>>>>>> 4/8 profiles failed. > > >>>>>>>>> > > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved > > in > > >>>>>>>>> libraries (table-planner). > > >>>>>>>>> > > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due > to > > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name: > > >>>>>>>>> kafka-producer-network-thread | producer-239 > > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > > >>>>>>>>> The tests profile failed due to various errors in migration > > tests: > > >>>>>>>>> > > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected > > >>>>>>>> accumulator > > >>>>>>>>> results within time limit. > > >>>>>>>>> at > > >>>>>>>>> > > >>> > > > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one > > >>> above > > >>>>>>>>> failed after 19 minutes and is only missing the migration tests > > >>> (which > > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between > > 15 > > >>> to > > >>>>> 20 > > >>>>>>>>> minutes here. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Finally, the misc profiles fails in YARN: > > >>>>>>>>> > > >>>>>>>>> java.lang.AssertionError > > >>>>>>>>> at > > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > > >>>>>>>>> No significant speedup could be observed in other modules; for > > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > > >>>>>>>>> > > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > > >>>>>>>>>> There appears to be a general agreement that 1) should be > looked > > >>>>> into; > > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all > tests; > > >>> will > > >>>>>>>>>> report back the results. > > >>>>>>>>>> > > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > > >>>>>>>>>>> Hello everyone, > > >>>>>>>>>>> > > >>>>>>>>>>> improving our build times is a hot topic at the moment so > let's > > >>>>>>>>>>> discuss the different ways how they could be reduced. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Current state: > > >>>>>>>>>>> > > >>>>>>>>>>> First up, let's look at some numbers: > > >>>>>>>>>>> > > >>>>>>>>>>> 1 full build currently consumes 5h of build time total > ("total > > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") > to > > >>>>>>>>>>> complete from start to finish. The run time may fluctuate of > > >>> course > > >>>>>>>>>>> depending on the current Travis load. This applies both to > > >>> builds on > > >>>>>>>>>>> the Apache and flink-ci Travis. > > >>>>>>>>>>> > > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > > >>>>> (reminder: > > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically > means > > >>> that > > >>>>>>>>>>> we are processing builds at the rate that they come in), > > however > > >>> we > > >>>>>>>>>>> are in an admittedly quiet period right now. > > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h > > as > > >>>>>>>>>>> everyone was scrambling to get their changes merged in time > for > > >>> the > > >>>>>>>>>>> feature freeze. > > >>>>>>>>>>> > > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > > pending > > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or > the > > >>> PR > > >>>>>>>>>>> was closed, which should prove especially useful during the > > rush > > >>>>>>>>>>> hours we see before feature-freezes.) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Past approaches > > >>>>>>>>>>> > > >>>>>>>>>>> Over the years we have done rather few things to improve this > > >>>>>>>>>>> situation (hence our current predicament). > > >>>>>>>>>>> > > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > > >>>>> reduction > > >>>>>>>>>>> in total build times was the introduction of cron jobs, which > > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > > >>> (different > > >>>>>>>>>>> scala/hadoop versions) to 1. > > >>>>>>>>>>> > > >>>>>>>>>>> The separation into multiple build profiles was only a > > >>> work-around > > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has > the > > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > > >>> hitting > > >>>>> a > > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they > > >>> nearly > > >>>>>>>>>>> consume an entire profile by themselves (and thus no further > > >>>>>>>>>>> splitting is possible). > > >>>>>>>>>>> > > >>>>>>>>>>> The rework that introduced stages, at the time of > introduction, > > >>> did > > >>>>>>>>>>> also not provide a speed up, although this changed slightly > > once > > >>>>> more > > >>>>>>>>>>> profiles were added and some optimizations to the caching > have > > >>> been > > >>>>>>>>>>> made. > > >>>>>>>>>>> > > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration > for > > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > > >>> providing > > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen > > any > > >>>>>>>>>>> negative consequences. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Suggestions > > >>>>>>>>>>> > > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > > times > > >>>>> that > > >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily > > >>> mine > > >>>>>>>>>>> nor may I agree with all of them). > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > > >>>>>>>>>>> * We've seen significant speedups in the blink planner, > > and > > >>>>> this > > >>>>>>>>>>> should be applicable for all modules. However, I > presume > > >>>>>>>> there's > > >>>>>>>>>>> a reason why we disabled JVM reuse (information on > this > > >>> would > > >>>>>>>> be > > >>>>>>>>>>> appreciated) > > >>>>>>>>>>> 2. Custom differential build scripts > > >>>>>>>>>>> * Setup custom scripts for determining which modules > > might be > > >>>>>>>>>>> affected by change, and manipulate the splits > > accordingly. > > >>>>> This > > >>>>>>>>>>> approach is conceptually quite straight-forward, but > has > > >>>>> limits > > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > > flink-core > > >>>>>>>>>>> _must_ result in testing all modules. > > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on > > >>>>> demand. > > >>>>>>>>>>> * With the introduction of the ci-bot we now have > > >>> significantly > > >>>>>>>>>>> more options on how to handle PR builds. One option > > could > > >>> be > > >>>>> to > > >>>>>>>>>>> only run basic tests when the PR is created (which may > > be > > >>>>> only > > >>>>>>>>>>> modified modules, or all unit tests, or another > low-cost > > >>>>>>>>>>> scheme), and then have a committer trigger other > builds > > >>> (full > > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > > >>>>>>>>>>> 4. Move more tests into cron builds > > >>>>>>>>>>> * The budget version of 3); move certain tests that are > > >>> either > > >>>>>>>>>>> expensive (like some runtime tests that take minutes) > > or in > > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > > >>>>>>>>>>> 5. Gradle > > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > > support > > >>>>> for > > >>>>>>>>>>> differential builds; basically providing 2) without > the > > >>>>>>>> overhead > > >>>>>>>>>>> of maintaining additional scripts. > > >>>>>>>>>>> * To date no PoC was provided that shows it working in > > our CI > > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > > >>>>>>>>>>> * This is the most disruptive change by a fair margin, > as > > it > > >>>>>>>> would > > >>>>>>>>>>> affect the entire project, developers and potentially > > users > > >>>>> (f > > >>>>>>>>>>> they build from source). > > >>>>>>>>>>> 6. CI service > > >>>>>>>>>>> * Our current artifact caching setup on Travis is > > basically a > > >>>>>>>>>>> hack; we're basically abusing the Travis cache, which > is > > >>>>> meant > > >>>>>>>>>>> for long-term caching, to ship build artifacts across > > jobs. > > >>>>>>>> It's > > >>>>>>>>>>> brittle at times due to timing/visibility issues and > on > > >>>>>>>> branches > > >>>>>>>>>>> the cleanup processes can interfere with running > > builds. It > > >>>>> is > > >>>>>>>>>>> also not as effective as it could be. > > >>>>>>>>>>> * There are CI services that provide build artifact > > caching > > >>> out > > >>>>>>>> of > > >>>>>>>>>>> the box, which could be useful for us. > > >>>>>>>>>>> * To date, no PoC for using another CI service has been > > >>>>> provided. > > >>>>>>>>>>> > > >>>>> > > >>> > > > > > |
Hi,
we have not decided how to proceed on the issue yet. I'm currently in the process of setting up some metrics for understanding the impact on the end-to-end build times of Travis vs Azure Pipelines. In my current PoC [1], we are also using self-hosted pipeline agents. Feel free to reach out to me via email, so that we can set up a trial for the ARM integration. [1] https://dev.azure.com/rmetzger/Flink/_build?definitionId=4&_a=summary On Thu, Oct 24, 2019 at 4:07 AM Xiyuan Wang <[hidden email]> wrote: > Hi, > How is the CI migration going? I notice that Azure pipelines support > adding self-hosted pipeline agent.[1] How do you think that OpenLab donate > some ARM resources to Flink community as a agent pool? In this way, X86 and > ARM can be handled by the same CI config easily. Then we don't need do more > effort to adding OpenLab CI support and Flink Community can have the full > access to the ARM resources as well. > > Thanks. > > [1 ] > > https://docs.microsoft.com/zh-cn/azure/devops/pipelines/agents/v2-linux?view=azure-devops > > Robert Metzger <[hidden email]> 于2019年9月5日周四 下午8:54写道: > > > I do have a working Azure setup, yes. E2E tests are not included in the > > 3.5hrs. > > > > Yesterday, I became aware of a major blocker with Azure pipelines: Apache > > Infra does not allow it to be integrated with Apache GitHub repositories, > > because it requires write access (for a simple usability feature) [1]. > This > > means that we "have" to use CiBot for the time being. > > I've also reached out to Microsoft to see if they can do anything about > it. > > > > +1 For setting it up with CiBot immediately. > > > > [1]https://issues.apache.org/jira/browse/INFRA-17030 > > > > On Thu, Sep 5, 2019 at 11:04 AM Chesnay Schepler <[hidden email]> > > wrote: > > > > > I assume you already have a working (and verified) azure setup? > > > > > > Once we're running things on azure on the apache repo people will > > > invariably use that as a source of truth because fancy check marks will > > > yet again appear on commits. Hence I'm wary of running experiments > here; > > > I would prefer if we only activate it once things are confirmed to be > > > working. > > > > > > For observation purposes, we could also add it to flink-ci with > > > notifications to people who are interested in this experiment. > > > This wouldn't impact CiBot. > > > > > > On 03/09/2019 18:57, Robert Metzger wrote: > > > > Hi all, > > > > > > > > I wanted to give a short update on this: > > > > - Arvid, Aljoscha and I have started working on a Gradle PoC, > currently > > > > working on making all modules compile and test with Gradle. We've > also > > > > identified some problematic areas (shading being the most obvious > one) > > > > which we will analyse as part of the PoC. > > > > The goal is to see how much Gradle helps to parallelise our build, > and > > to > > > > avoid duplicate work (incremental builds). > > > > > > > > - I am working on setting up a Flink testing infrastructure based on > > > Azure > > > > Pipelines, using more powerful hardware. Alibaba kindly provided me > > with > > > > two 32 core machines (temporarily), and another company reached out > to > > > > privately, looking into options for cheap, fast machines :) > > > > If nobody in the community disagrees, I am going to set up Azure > > > Pipelines > > > > with our apache/flink GitHub as a build infrastructure that exists > next > > > to > > > > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines > > is > > > > equally or even more reliable than Travis, and I want to see what the > > > > required maintenance work is. > > > > On top of that, Azure Pipelines is a very feature-rich tool with a > lot > > of > > > > nice options for us to improve the build experience (statistics about > > > tests > > > > (flaky tests etc.), nice docker support, plenty of free build > resources > > > for > > > > open source projects, ...) > > > > > > > > Best, > > > > Robert > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> > > > wrote: > > > > > > > >> Hi all, > > > >> > > > >> I have summarized all arguments mentioned so far + some additional > > > >> research into a Wiki page here: > > > >> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279 > > > >> > > > >> I'm happy to hear further comments on my summary! I'm pretty sure we > > can > > > >> find more pro's and con's for the different options. > > > >> > > > >> My opinion after looking at the options: > > > >> > > > >> - Flink relies on an outdated build tool (Maven), while a good > > > >> alternative is well-established (gradle), and will likely > provide > > a > > > much > > > >> better CI and local build experience through incremental build > and > > > cached > > > >> intermediates. > > > >> Scripting around Maven, or splitting modules / test execution / > > > >> repositories won't solve this problem. We should rather spend > the > > > effort in > > > >> migrating to a modern build tool which will provide us benefits > in > > > the long > > > >> run. > > > >> - Flink relies on a fairly slow build service (Travis CI), while > > > >> simply putting more money onto the problem could cut the build > > time > > > at > > > >> least in half. > > > >> We should consider using a build service that provides bigger > > > machines > > > >> to solve our build time problem. > > > >> > > > >> My opinion is based on many assumptions (gradle is actually as fast > as > > > >> promised (haven't used it before), we can build Flink with gradle, > we > > > find > > > >> sponsors for bigger build machines) that we need to test first > through > > > PoCs. > > > >> > > > >> Best, > > > >> Robert > > > >> > > > >> > > > >> > > > >> > > > >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek < > > [hidden email]> > > > >> wrote: > > > >> > > > >>> I did a quick test: a normal "mvn clean install -DskipTests > > > >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my > > > machine > > > >>> takes about 14 minutes. After removing all mentions of > > > maven-shade-plugin > > > >>> the build time goes down to roughly 11.5 minutes. (Obviously the > > > resulting > > > >>> Flink won’t work, because some expected stuff is not packaged and > > most > > > of > > > >>> the end-to-end tests use the shade plugin to package the jars for > > > testing. > > > >>> > > > >>> Aljoscha > > > >>> > > > >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> > > > wrote: > > > >>>> > > > >>>> Hi all, > > > >>>> > > > >>>> I wanted to understand the impact of the hardware we are using for > > > >>> running > > > >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb > memory > > > >>> [1]. > > > >>>> They are using Google Cloud Compute Engine *n1-standard-2* > > instances. > > > >>>> Running a full "mvn clean verify" takes *03:32 h* on such a > machine > > > >>> type. > > > >>>> Running the same workload on a 32 virtual cores, 64 gb machine, > > takes > > > >>> *1:21 > > > >>>> h*. > > > >>>> > > > >>>> What is interesting are the per-module build time differences. > > > >>>> Modules which are parallelizing tests well greatly benefit from > the > > > >>>> additional cores: > > > >>>> "flink-tests" 36:51 min vs 4:33 min > > > >>>> "flink-runtime" 23:41 min vs 3:47 min > > > >>>> "flink-table-planner" 15:54 min vs 3:13 min > > > >>>> > > > >>>> On the other hand, we have modules which are not parallel at all: > > > >>>> "flink-connector-kafka": 16:32 min vs 15:19 min > > > >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min > > > >>>> Also, the checkstyle plugin is not scaling at all. > > > >>>> > > > >>>> Chesnay reported some significant speedups by reusing forks. > > > >>>> I don't know how much effort it would be to make the Kafka tests > > > >>>> parallelizable. In total, they currently use 30 minutes on the big > > > >>> machine > > > >>>> (while 31 CPUs are idling :) ) > > > >>>> > > > >>>> Let me know what you think about these results. If the community > is > > > >>>> generally interested in further investigating into that > direction, I > > > >>> could > > > >>>> look into software to orchestrate this, as well as sponsors for > such > > > an > > > >>>> infrastructure. > > > >>>> > > > >>>> [1] https://docs.travis-ci.com/user/reference/overview/ > > > >>>> > > > >>>> > > > >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler < > > [hidden email]> > > > >>> wrote: > > > >>>>> @Aljoscha Shading takes a few minutes for a full build; you can > see > > > >>> this > > > >>>>> quite easily by looking at the compile step in the misc profile > > > >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all > modules > > > that > > > >>>>> longer than a fraction of a section are usually caused by shading > > > lots > > > >>>>> of classes. Note that I cannot tell you how much of this is spent > > on > > > >>>>> relocations, and how much on writing the jar. > > > >>>>> > > > >>>>> Personally, I'd very much like us to move all shading to > > > flink-shaded; > > > >>>>> this would finally allows us to use newer maven versions without > > > >>> needing > > > >>>>> cumbersome workarounds for flink-dist. However, this isn't a > > trivial > > > >>>>> affair in some cases; IIRC calcite could be difficult to handle. > > > >>>>> > > > >>>>> On another note, this would also simplify switching the main repo > > to > > > >>>>> another build system, since you would no longer had to deal with > > > >>>>> relocations, just packaging + merging NOTICE files. > > > >>>>> > > > >>>>> @BowenLi I disagree, flink-shaded does not include any tests, > API > > > >>>>> compatibility checks, checkstyle, layered shading (e.g., > > > flink-runtime > > > >>>>> and flink-dist, where both relocate dependencies and one is > bundled > > > by > > > >>>>> the other), and, most importantly, CI (and really, without CI > being > > > >>>>> covered in a PoC there's nothing to discuss). > > > >>>>> > > > >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote: > > > >>>>>> Speaking of flink-shaded, do we have any idea what the impact of > > > >>> shading > > > >>>>> is on the build time? We could get rid of shading completely in > the > > > >>> Flink > > > >>>>> main repository by moving everything that we shade to > flink-shaded. > > > >>>>>> Aljoscha > > > >>>>>> > > > >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> > wrote: > > > >>>>>>> > > > >>>>>>> +1 to Till's points on #2 and #5, especially the potential > > > >>>>> non-disruptive, > > > >>>>>>> gradual migration approach if we decide to go that route. > > > >>>>>>> > > > >>>>>>> To add on, I want to point it out that we can actually start > with > > > >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. > > It's > > > >>> of > > > >>>>> much > > > >>>>>>> smaller size, totally isolated from and not interfered with > flink > > > >>>>> project > > > >>>>>>> [2], and it actually covers most of our practical feature > > > >>> requirements > > > >>>>> for > > > >>>>>>> a build tool - all making it an ideal experimental field. > > > >>>>>>> > > > >>>>>>> [1] https://github.com/apache/flink-shaded > > > >>>>>>> [2] https://github.com/apache/flink > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann < > > > [hidden email]> > > > >>>>> wrote: > > > >>>>>>>> For the sake of keeping the discussion focused and not > > cluttering > > > >>> the > > > >>>>>>>> discussion thread I would suggest to split the detailed > > reporting > > > >>> for > > > >>>>>>>> reusing JVMs to a separate thread and cross linking it from > > here. > > > >>>>>>>> > > > >>>>>>>> Cheers, > > > >>>>>>>> Till > > > >>>>>>>> > > > >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler < > > > >>> [hidden email]> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Update: > > > >>>>>>>>> > > > >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork > > reuse > > > >>> right > > > >>>>>>>>> away, while flink-tests has the potential for huge savings, > but > > > we > > > >>>>> have > > > >>>>>>>>> to figure out some issues first. > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Build link: > > https://travis-ci.org/zentol/flink/builds/572659220 > > > >>>>>>>>> > > > >>>>>>>>> 4/8 profiles failed. > > > >>>>>>>>> > > > >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes > saved > > > in > > > >>>>>>>>> libraries (table-planner). > > > >>>>>>>>> > > > >>>>>>>>> The kafka and connectors profiles both fail in kafka tests > due > > to > > > >>>>>>>>> producer leaks, and no speed up could be confirmed so far: > > > >>>>>>>>> > > > >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread > name: > > > >>>>>>>>> kafka-producer-network-thread | producer-239 > > > >>>>>>>>> at org.junit.Assert.fail(Assert.java:88) > > > >>>>>>>>> at > > > >>>>>>>>> > > > >>> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677) > > > >>>>>>>>> at > > > >>>>>>>>> > > > >>> > > > > > > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210) > > > >>>>>>>>> The tests profile failed due to various errors in migration > > > tests: > > > >>>>>>>>> > > > >>>>>>>>> junit.framework.AssertionFailedError: Did not see the > expected > > > >>>>>>>> accumulator > > > >>>>>>>>> results within time limit. > > > >>>>>>>>> at > > > >>>>>>>>> > > > >>> > > > > > > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141) > > > >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this > one > > > >>> above > > > >>>>>>>>> failed after 19 minutes and is only missing the migration > tests > > > >>> (which > > > >>>>>>>>> currently need 6-7 minutes). So we could save somewhere > between > > > 15 > > > >>> to > > > >>>>> 20 > > > >>>>>>>>> minutes here. > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> Finally, the misc profiles fails in YARN: > > > >>>>>>>>> > > > >>>>>>>>> java.lang.AssertionError > > > >>>>>>>>> at > > > >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64) > > > >>>>>>>>> No significant speedup could be observed in other modules; > for > > > >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it. > > > >>>>>>>>> > > > >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote: > > > >>>>>>>>>> There appears to be a general agreement that 1) should be > > looked > > > >>>>> into; > > > >>>>>>>>>> I've setup a branch with fork reuse being enabled for all > > tests; > > > >>> will > > > >>>>>>>>>> report back the results. > > > >>>>>>>>>> > > > >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote: > > > >>>>>>>>>>> Hello everyone, > > > >>>>>>>>>>> > > > >>>>>>>>>>> improving our build times is a hot topic at the moment so > > let's > > > >>>>>>>>>>> discuss the different ways how they could be reduced. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Current state: > > > >>>>>>>>>>> > > > >>>>>>>>>>> First up, let's look at some numbers: > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1 full build currently consumes 5h of build time total > > ("total > > > >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run > time") > > to > > > >>>>>>>>>>> complete from start to finish. The run time may fluctuate > of > > > >>> course > > > >>>>>>>>>>> depending on the current Travis load. This applies both to > > > >>> builds on > > > >>>>>>>>>>> the Apache and flink-ci Travis. > > > >>>>>>>>>>> > > > >>>>>>>>>>> At the time of writing, the current queue time for PR jobs > > > >>>>> (reminder: > > > >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically > > means > > > >>> that > > > >>>>>>>>>>> we are processing builds at the rate that they come in), > > > however > > > >>> we > > > >>>>>>>>>>> are in an admittedly quiet period right now. > > > >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around > 5-6h > > > as > > > >>>>>>>>>>> everyone was scrambling to get their changes merged in time > > for > > > >>> the > > > >>>>>>>>>>> feature freeze. > > > >>>>>>>>>>> > > > >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where > > > pending > > > >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or > > the > > > >>> PR > > > >>>>>>>>>>> was closed, which should prove especially useful during the > > > rush > > > >>>>>>>>>>> hours we see before feature-freezes.) > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Past approaches > > > >>>>>>>>>>> > > > >>>>>>>>>>> Over the years we have done rather few things to improve > this > > > >>>>>>>>>>> situation (hence our current predicament). > > > >>>>>>>>>>> > > > >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable > > > >>>>> reduction > > > >>>>>>>>>>> in total build times was the introduction of cron jobs, > which > > > >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations > > > >>> (different > > > >>>>>>>>>>> scala/hadoop versions) to 1. > > > >>>>>>>>>>> > > > >>>>>>>>>>> The separation into multiple build profiles was only a > > > >>> work-around > > > >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has > > the > > > >>>>>>>>>>> obvious potential of reducing run time, but we're currently > > > >>> hitting > > > >>>>> a > > > >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime, > > > >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that > they > > > >>> nearly > > > >>>>>>>>>>> consume an entire profile by themselves (and thus no > further > > > >>>>>>>>>>> splitting is possible). > > > >>>>>>>>>>> > > > >>>>>>>>>>> The rework that introduced stages, at the time of > > introduction, > > > >>> did > > > >>>>>>>>>>> also not provide a speed up, although this changed slightly > > > once > > > >>>>> more > > > >>>>>>>>>>> profiles were added and some optimizations to the caching > > have > > > >>> been > > > >>>>>>>>>>> made. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Very recently we modified the surefire-plugin configuration > > for > > > >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, > > > >>> providing > > > >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not > seen > > > any > > > >>>>>>>>>>> negative consequences. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Suggestions > > > >>>>>>>>>>> > > > >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total > > > times > > > >>>>> that > > > >>>>>>>>>>> I have seen recently (in other words, they aren't > necessarily > > > >>> mine > > > >>>>>>>>>>> nor may I agree with all of them). > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules. > > > >>>>>>>>>>> * We've seen significant speedups in the blink > planner, > > > and > > > >>>>> this > > > >>>>>>>>>>> should be applicable for all modules. However, I > > presume > > > >>>>>>>> there's > > > >>>>>>>>>>> a reason why we disabled JVM reuse (information on > > this > > > >>> would > > > >>>>>>>> be > > > >>>>>>>>>>> appreciated) > > > >>>>>>>>>>> 2. Custom differential build scripts > > > >>>>>>>>>>> * Setup custom scripts for determining which modules > > > might be > > > >>>>>>>>>>> affected by change, and manipulate the splits > > > accordingly. > > > >>>>> This > > > >>>>>>>>>>> approach is conceptually quite straight-forward, but > > has > > > >>>>> limits > > > >>>>>>>>>>> since it has to be pessimistic; i.e. a change in > > > flink-core > > > >>>>>>>>>>> _must_ result in testing all modules. > > > >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests > on > > > >>>>> demand. > > > >>>>>>>>>>> * With the introduction of the ci-bot we now have > > > >>> significantly > > > >>>>>>>>>>> more options on how to handle PR builds. One option > > > could > > > >>> be > > > >>>>> to > > > >>>>>>>>>>> only run basic tests when the PR is created (which > may > > > be > > > >>>>> only > > > >>>>>>>>>>> modified modules, or all unit tests, or another > > low-cost > > > >>>>>>>>>>> scheme), and then have a committer trigger other > > builds > > > >>> (full > > > >>>>>>>>>>> test run, e2e tests, etc...) on demand. > > > >>>>>>>>>>> 4. Move more tests into cron builds > > > >>>>>>>>>>> * The budget version of 3); move certain tests that > are > > > >>> either > > > >>>>>>>>>>> expensive (like some runtime tests that take > minutes) > > > or in > > > >>>>>>>>>>> rarely modified modules (like gelly) into cron jobs. > > > >>>>>>>>>>> 5. Gradle > > > >>>>>>>>>>> * Gradle was brought up a few times for it's built-in > > > support > > > >>>>> for > > > >>>>>>>>>>> differential builds; basically providing 2) without > > the > > > >>>>>>>> overhead > > > >>>>>>>>>>> of maintaining additional scripts. > > > >>>>>>>>>>> * To date no PoC was provided that shows it working in > > > our CI > > > >>>>>>>>>>> environment (i.e., handling splits & caching etc). > > > >>>>>>>>>>> * This is the most disruptive change by a fair margin, > > as > > > it > > > >>>>>>>> would > > > >>>>>>>>>>> affect the entire project, developers and > potentially > > > users > > > >>>>> (f > > > >>>>>>>>>>> they build from source). > > > >>>>>>>>>>> 6. CI service > > > >>>>>>>>>>> * Our current artifact caching setup on Travis is > > > basically a > > > >>>>>>>>>>> hack; we're basically abusing the Travis cache, > which > > is > > > >>>>> meant > > > >>>>>>>>>>> for long-term caching, to ship build artifacts > across > > > jobs. > > > >>>>>>>> It's > > > >>>>>>>>>>> brittle at times due to timing/visibility issues and > > on > > > >>>>>>>> branches > > > >>>>>>>>>>> the cleanup processes can interfere with running > > > builds. It > > > >>>>> is > > > >>>>>>>>>>> also not as effective as it could be. > > > >>>>>>>>>>> * There are CI services that provide build artifact > > > caching > > > >>> out > > > >>>>>>>> of > > > >>>>>>>>>>> the box, which could be useful for us. > > > >>>>>>>>>>> * To date, no PoC for using another CI service has > been > > > >>>>> provided. > > > >>>>>>>>>>> > > > >>>>> > > > >>> > > > > > > > > > |
Free forum by Nabble | Edit this page |