[DISCUSS] Reducing build times

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Reducing build times

Chesnay Schepler-3
Hello everyone,

improving our build times is a hot topic at the moment so let's discuss
the different ways how they could be reduced.


        Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"),
and in the ideal case takes about 1h20m ("run time") to complete from
start to finish. The run time may fluctuate of course depending on the
current Travis load. This applies both to builds on the Apache and
flink-ci Travis.

At the time of writing, the current queue time for PR jobs (reminder:
running on flink-ci) is about 30 minutes (which basically means that we
are processing builds at the rate that they come in), however we are in
an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as
everyone was scrambling to get their changes merged in time for the
feature freeze.

(Note: Recently optimizations where added to ci-bot where pending builds
are canceled if a new commit was pushed to the PR or the PR was closed,
which should prove especially useful during the rush hours we see before
feature-freezes.)


        Past approaches

Over the years we have done rather few things to improve this situation
(hence our current predicament).

Beyond the sporadic speedup of some tests, the only notable reduction in
total build times was the introduction of cron jobs, which consolidated
the per-commit matrix from 4 configurations (different scala/hadoop
versions) to 1.

The separation into multiple build profiles was only a work-around for
the 50m limit on Travis. Running tests in parallel has the obvious
potential of reducing run time, but we're currently hitting a hard limit
since a few modules (flink-tests, flink-runtime,
flink-table-planner-blink) are so loaded with tests that they nearly
consume an entire profile by themselves (and thus no further splitting
is possible).

The rework that introduced stages, at the time of introduction, did also
not provide a speed up, although this changed slightly once more
profiles were added and some optimizations to the caching have been made.

Very recently we modified the surefire-plugin configuration for
flink-table-planner-blink to reuse JVM forks for IT cases, providing a
significant speedup (18 minutes!). So far we have not seen any negative
consequences.


        Suggestions

This is a list of /all /suggestions for reducing run/total times that I
have seen recently (in other words, they aren't necessarily mine nor may
I agree with all of them).

 1. Enable JVM reuse for IT cases in more modules.
      * We've seen significant speedups in the blink planner, and this
        should be applicable for all modules. However, I presume there's
        a reason why we disabled JVM reuse (information on this would be
        appreciated)
 2. Custom differential build scripts
      * Setup custom scripts for determining which modules might be
        affected by change, and manipulate the splits accordingly. This
        approach is conceptually quite straight-forward, but has limits
        since it has to be pessimistic; i.e. a change in flink-core
        _must_ result in testing all modules.
 3. Only run smoke tests when PR is opened, run heavy tests on demand.
      * With the introduction of the ci-bot we now have significantly
        more options on how to handle PR builds. One option could be to
        only run basic tests when the PR is created (which may be only
        modified modules, or all unit tests, or another low-cost
        scheme), and then have a committer trigger other builds (full
        test run, e2e tests, etc...) on demand.
 4. Move more tests into cron builds
      * The budget version of 3); move certain tests that are either
        expensive (like some runtime tests that take minutes) or in
        rarely modified modules (like gelly) into cron jobs.
 5. Gradle
      * Gradle was brought up a few times for it's built-in support for
        differential builds; basically providing 2) without the overhead
        of maintaining additional scripts.
      * To date no PoC was provided that shows it working in our CI
        environment (i.e., handling splits & caching etc).
      * This is the most disruptive change by a fair margin, as it would
        affect the entire project, developers and potentially users (f
        they build from source).
 6. CI service
      * Our current artifact caching setup on Travis is basically a
        hack; we're basically abusing the Travis cache, which is meant
        for long-term caching, to ship build artifacts across jobs. It's
        brittle at times due to timing/visibility issues and on branches
        the cleanup processes can interfere with running builds. It is
        also not as effective as it could be.
      * There are CI services that provide build artifact caching out of
        the box, which could be useful for us.
      * To date, no PoC for using another CI service has been provided.

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Zhu Zhu
Thanks Chesnay for bringing up this discussion and sharing those thoughts
to speed up the building process.

I'd +1 for option 2 and 3.

We can benefits a lot from Option 2. Developing table, connectors,
libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
run.
PRs for those modules take up more than half of all the PRs in my
observation.

Option 3 can be a supplementary to option 2 that if the PR is modifying
fundamental modules like flink-core or flink-runtime.
It can even be a switch of the tests scope(basic/full) of a PR, so that
committers do not need to trigger it multiple times.
With it we can postpone the testing of IT cases or connectors before the PR
reaches a stable state.

Thanks,
Zhu Zhu

Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:

> Hello everyone,
>
> improving our build times is a hot topic at the moment so let's discuss
> the different ways how they could be reduced.
>
>
>         Current state:
>
> First up, let's look at some numbers:
>
> 1 full build currently consumes 5h of build time total ("total time"),
> and in the ideal case takes about 1h20m ("run time") to complete from
> start to finish. The run time may fluctuate of course depending on the
> current Travis load. This applies both to builds on the Apache and
> flink-ci Travis.
>
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that we
> are processing builds at the rate that they come in), however we are in
> an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
>
> (Note: Recently optimizations where added to ci-bot where pending builds
> are canceled if a new commit was pushed to the PR or the PR was closed,
> which should prove especially useful during the rush hours we see before
> feature-freezes.)
>
>
>         Past approaches
>
> Over the years we have done rather few things to improve this situation
> (hence our current predicament).
>
> Beyond the sporadic speedup of some tests, the only notable reduction in
> total build times was the introduction of cron jobs, which consolidated
> the per-commit matrix from 4 configurations (different scala/hadoop
> versions) to 1.
>
> The separation into multiple build profiles was only a work-around for
> the 50m limit on Travis. Running tests in parallel has the obvious
> potential of reducing run time, but we're currently hitting a hard limit
> since a few modules (flink-tests, flink-runtime,
> flink-table-planner-blink) are so loaded with tests that they nearly
> consume an entire profile by themselves (and thus no further splitting
> is possible).
>
> The rework that introduced stages, at the time of introduction, did also
> not provide a speed up, although this changed slightly once more
> profiles were added and some optimizations to the caching have been made.
>
> Very recently we modified the surefire-plugin configuration for
> flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> significant speedup (18 minutes!). So far we have not seen any negative
> consequences.
>
>
>         Suggestions
>
> This is a list of /all /suggestions for reducing run/total times that I
> have seen recently (in other words, they aren't necessarily mine nor may
> I agree with all of them).
>
>  1. Enable JVM reuse for IT cases in more modules.
>       * We've seen significant speedups in the blink planner, and this
>         should be applicable for all modules. However, I presume there's
>         a reason why we disabled JVM reuse (information on this would be
>         appreciated)
>  2. Custom differential build scripts
>       * Setup custom scripts for determining which modules might be
>         affected by change, and manipulate the splits accordingly. This
>         approach is conceptually quite straight-forward, but has limits
>         since it has to be pessimistic; i.e. a change in flink-core
>         _must_ result in testing all modules.
>  3. Only run smoke tests when PR is opened, run heavy tests on demand.
>       * With the introduction of the ci-bot we now have significantly
>         more options on how to handle PR builds. One option could be to
>         only run basic tests when the PR is created (which may be only
>         modified modules, or all unit tests, or another low-cost
>         scheme), and then have a committer trigger other builds (full
>         test run, e2e tests, etc...) on demand.
>  4. Move more tests into cron builds
>       * The budget version of 3); move certain tests that are either
>         expensive (like some runtime tests that take minutes) or in
>         rarely modified modules (like gelly) into cron jobs.
>  5. Gradle
>       * Gradle was brought up a few times for it's built-in support for
>         differential builds; basically providing 2) without the overhead
>         of maintaining additional scripts.
>       * To date no PoC was provided that shows it working in our CI
>         environment (i.e., handling splits & caching etc).
>       * This is the most disruptive change by a fair margin, as it would
>         affect the entire project, developers and potentially users (f
>         they build from source).
>  6. CI service
>       * Our current artifact caching setup on Travis is basically a
>         hack; we're basically abusing the Travis cache, which is meant
>         for long-term caching, to ship build artifacts across jobs. It's
>         brittle at times due to timing/visibility issues and on branches
>         the cleanup processes can interfere with running builds. It is
>         also not as effective as it could be.
>       * There are CI services that provide build artifact caching out of
>         the box, which could be useful for us.
>       * To date, no PoC for using another CI service has been provided.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Till Rohrmann
Thanks for starting this discussion Chesnay. I think it has become obvious
to the Flink community that with the existing build setup we cannot really
deliver fast build times which are essential for fast iteration cycles and
high developer productivity. The reasons for this situation are manifold
but it is definitely affected by Flink's project growth, not always optimal
tests and the inflexibility that everything needs to be built. Hence, I
consider the reduction of build times crucial for the project's health and
future growth.

Without necessarily voicing a strong preference for any of the presented
suggestions, I wanted to comment on each of them:

1. This sounds promising. Could the reason why we don't reuse JVMs date
back to the time when we still had a lot of static fields in Flink which
made it hard to reuse JVMs and the potentially mutated global state?

2. Building hand-crafted solutions around a build system in order to
compensate for its limitations which other build systems support out of the
box sounds like the not invented here syndrome to me. Reinventing the wheel
has historically proven to be usually not the best solution and it often
comes with a high maintenance price tag. Moreover, it would add just
another layer of complexity around our existing build system. I think the
current state where we have the maven setup in pom files and for Travis
multiple bash scripts specializing the builds to make it fit the time limit
is already not very transparent/easy to understand.

3. I could see this work but it also requires a very good understanding of
Flink of every committer because the committer needs to know which tests
would be good to run additionally.

4. I would be against this option solely to decrease our build time. My
observation is that the community does not monitor the health of the cron
jobs well enough. In the past the cron jobs have been unstable for as long
as a complete release cycle. Moreover, I've seen that PRs were merged which
passed Travis but broke the cron jobs. Consequently, I fear that this
option would deteriorate Flink's stability.

5. I would rephrase this point into changing the build system. Gradle could
be one candidate but there are also other build systems out there like
Bazel. Changing the build system would indeed be a major endeavour but I
could see the long term benefits of such a change (similar to having a
consistent and enforced code style) in particular if the build system
supports the functionality which we would otherwise build & maintain on our
own. I think there would be ways to make the transition not as disruptive
as described. For example, one could keep the Maven build and the new build
side by side until one is confident enough that the new build produces the
same output as the Maven build. Maybe it would also be possible to migrate
individual modules starting from the leaves. However, I admit that changing
the build system will affect every Flink developer because she needs to
learn & understand it.

6. I would like to learn about other people's experience with different CI
systems. Travis worked okish for Flink so far but we see sometimes problems
with its caching mechanism as Chesnay stated. I think that this topic is
actually orthogonal to the other suggestions.

My gut feeling is that not a single suggestion will be our solution but a
combination of them.

Cheers,
Till

On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[hidden email]> wrote:

> Thanks Chesnay for bringing up this discussion and sharing those thoughts
> to speed up the building process.
>
> I'd +1 for option 2 and 3.
>
> We can benefits a lot from Option 2. Developing table, connectors,
> libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
> run.
> PRs for those modules take up more than half of all the PRs in my
> observation.
>
> Option 3 can be a supplementary to option 2 that if the PR is modifying
> fundamental modules like flink-core or flink-runtime.
> It can even be a switch of the tests scope(basic/full) of a PR, so that
> committers do not need to trigger it multiple times.
> With it we can postpone the testing of IT cases or connectors before the PR
> reaches a stable state.
>
> Thanks,
> Zhu Zhu
>
> Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:
>
> > Hello everyone,
> >
> > improving our build times is a hot topic at the moment so let's discuss
> > the different ways how they could be reduced.
> >
> >
> >         Current state:
> >
> > First up, let's look at some numbers:
> >
> > 1 full build currently consumes 5h of build time total ("total time"),
> > and in the ideal case takes about 1h20m ("run time") to complete from
> > start to finish. The run time may fluctuate of course depending on the
> > current Travis load. This applies both to builds on the Apache and
> > flink-ci Travis.
> >
> > At the time of writing, the current queue time for PR jobs (reminder:
> > running on flink-ci) is about 30 minutes (which basically means that we
> > are processing builds at the rate that they come in), however we are in
> > an admittedly quiet period right now.
> > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > everyone was scrambling to get their changes merged in time for the
> > feature freeze.
> >
> > (Note: Recently optimizations where added to ci-bot where pending builds
> > are canceled if a new commit was pushed to the PR or the PR was closed,
> > which should prove especially useful during the rush hours we see before
> > feature-freezes.)
> >
> >
> >         Past approaches
> >
> > Over the years we have done rather few things to improve this situation
> > (hence our current predicament).
> >
> > Beyond the sporadic speedup of some tests, the only notable reduction in
> > total build times was the introduction of cron jobs, which consolidated
> > the per-commit matrix from 4 configurations (different scala/hadoop
> > versions) to 1.
> >
> > The separation into multiple build profiles was only a work-around for
> > the 50m limit on Travis. Running tests in parallel has the obvious
> > potential of reducing run time, but we're currently hitting a hard limit
> > since a few modules (flink-tests, flink-runtime,
> > flink-table-planner-blink) are so loaded with tests that they nearly
> > consume an entire profile by themselves (and thus no further splitting
> > is possible).
> >
> > The rework that introduced stages, at the time of introduction, did also
> > not provide a speed up, although this changed slightly once more
> > profiles were added and some optimizations to the caching have been made.
> >
> > Very recently we modified the surefire-plugin configuration for
> > flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> > significant speedup (18 minutes!). So far we have not seen any negative
> > consequences.
> >
> >
> >         Suggestions
> >
> > This is a list of /all /suggestions for reducing run/total times that I
> > have seen recently (in other words, they aren't necessarily mine nor may
> > I agree with all of them).
> >
> >  1. Enable JVM reuse for IT cases in more modules.
> >       * We've seen significant speedups in the blink planner, and this
> >         should be applicable for all modules. However, I presume there's
> >         a reason why we disabled JVM reuse (information on this would be
> >         appreciated)
> >  2. Custom differential build scripts
> >       * Setup custom scripts for determining which modules might be
> >         affected by change, and manipulate the splits accordingly. This
> >         approach is conceptually quite straight-forward, but has limits
> >         since it has to be pessimistic; i.e. a change in flink-core
> >         _must_ result in testing all modules.
> >  3. Only run smoke tests when PR is opened, run heavy tests on demand.
> >       * With the introduction of the ci-bot we now have significantly
> >         more options on how to handle PR builds. One option could be to
> >         only run basic tests when the PR is created (which may be only
> >         modified modules, or all unit tests, or another low-cost
> >         scheme), and then have a committer trigger other builds (full
> >         test run, e2e tests, etc...) on demand.
> >  4. Move more tests into cron builds
> >       * The budget version of 3); move certain tests that are either
> >         expensive (like some runtime tests that take minutes) or in
> >         rarely modified modules (like gelly) into cron jobs.
> >  5. Gradle
> >       * Gradle was brought up a few times for it's built-in support for
> >         differential builds; basically providing 2) without the overhead
> >         of maintaining additional scripts.
> >       * To date no PoC was provided that shows it working in our CI
> >         environment (i.e., handling splits & caching etc).
> >       * This is the most disruptive change by a fair margin, as it would
> >         affect the entire project, developers and potentially users (f
> >         they build from source).
> >  6. CI service
> >       * Our current artifact caching setup on Travis is basically a
> >         hack; we're basically abusing the Travis cache, which is meant
> >         for long-term caching, to ship build artifacts across jobs. It's
> >         brittle at times due to timing/visibility issues and on branches
> >         the cleanup processes can interfere with running builds. It is
> >         also not as effective as it could be.
> >       * There are CI services that provide build artifact caching out of
> >         the box, which could be useful for us.
> >       * To date, no PoC for using another CI service has been provided.
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Aleksey Pak
Hi all!

Thanks for starting this discussion.

I'd like to also add my 2 cents:

+1 for #2, differential build scripts.
I've worked on the approach. And with it, I think it's possible to reduce
total build time with relatively low effort, without enforcing any new
build tool and low maintenance cost.

You can check a proposed change (for the old CI setup, when Flink PRs were
running in Apache common CI pool) here:
https://github.com/apache/flink/pull/9065
In the proposed change, the dependency check is not heavily hardcoded and
just uses maven's results for dependency graph analysis.

> This approach is conceptually quite straight-forward, but has limits
since it has to be pessimistic; > i.e. a change in flink-core _must_ result
in testing all modules.

Agree, in Flink case, there are some core modules that would trigger whole
tests run with such approach. For developers who modify such components,
the build time would be the longest. But this approach should really help
for developers who touch more-or-less independent modules.

Even for core modules, it's possible to create "abstraction" barriers by
changing dependency graph. For example, it can look like: flink-core-api
<-- flink-core, flink-core-api <-- flink-connectors.
In that case, only change in flink-core-api would trigger whole tests run.

+1 for #3, separating PR CI runs to different stages.
Imo, it may require more change to current CI setup, compared to #2 and
better it should not be silly. Best, if it integrates with the Flink bot
and triggers some follow up build steps only when some prerequisites are
done.

+1 for #4, to move some tests into cron runs.
But imo, this does not scale well, it applies only to a small subset of
tests.

+1 for #6, to use other CI service(s).
More specifically, GitHub gives build actions for free that can be used to
offload some build steps/PR checks. It can help to move out some PR checks
from the main CI build (for example: documentation builds, license checks,
code formatting checks).

Regards,
Aleksey

On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <[hidden email]> wrote:

> Thanks for starting this discussion Chesnay. I think it has become obvious
> to the Flink community that with the existing build setup we cannot really
> deliver fast build times which are essential for fast iteration cycles and
> high developer productivity. The reasons for this situation are manifold
> but it is definitely affected by Flink's project growth, not always optimal
> tests and the inflexibility that everything needs to be built. Hence, I
> consider the reduction of build times crucial for the project's health and
> future growth.
>
> Without necessarily voicing a strong preference for any of the presented
> suggestions, I wanted to comment on each of them:
>
> 1. This sounds promising. Could the reason why we don't reuse JVMs date
> back to the time when we still had a lot of static fields in Flink which
> made it hard to reuse JVMs and the potentially mutated global state?
>
> 2. Building hand-crafted solutions around a build system in order to
> compensate for its limitations which other build systems support out of the
> box sounds like the not invented here syndrome to me. Reinventing the wheel
> has historically proven to be usually not the best solution and it often
> comes with a high maintenance price tag. Moreover, it would add just
> another layer of complexity around our existing build system. I think the
> current state where we have the maven setup in pom files and for Travis
> multiple bash scripts specializing the builds to make it fit the time limit
> is already not very transparent/easy to understand.
>
> 3. I could see this work but it also requires a very good understanding of
> Flink of every committer because the committer needs to know which tests
> would be good to run additionally.
>
> 4. I would be against this option solely to decrease our build time. My
> observation is that the community does not monitor the health of the cron
> jobs well enough. In the past the cron jobs have been unstable for as long
> as a complete release cycle. Moreover, I've seen that PRs were merged which
> passed Travis but broke the cron jobs. Consequently, I fear that this
> option would deteriorate Flink's stability.
>
> 5. I would rephrase this point into changing the build system. Gradle could
> be one candidate but there are also other build systems out there like
> Bazel. Changing the build system would indeed be a major endeavour but I
> could see the long term benefits of such a change (similar to having a
> consistent and enforced code style) in particular if the build system
> supports the functionality which we would otherwise build & maintain on our
> own. I think there would be ways to make the transition not as disruptive
> as described. For example, one could keep the Maven build and the new build
> side by side until one is confident enough that the new build produces the
> same output as the Maven build. Maybe it would also be possible to migrate
> individual modules starting from the leaves. However, I admit that changing
> the build system will affect every Flink developer because she needs to
> learn & understand it.
>
> 6. I would like to learn about other people's experience with different CI
> systems. Travis worked okish for Flink so far but we see sometimes problems
> with its caching mechanism as Chesnay stated. I think that this topic is
> actually orthogonal to the other suggestions.
>
> My gut feeling is that not a single suggestion will be our solution but a
> combination of them.
>
> Cheers,
> Till
>
> On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[hidden email]> wrote:
>
> > Thanks Chesnay for bringing up this discussion and sharing those thoughts
> > to speed up the building process.
> >
> > I'd +1 for option 2 and 3.
> >
> > We can benefits a lot from Option 2. Developing table, connectors,
> > libraries, docs modules would result in much fewer tests(1/3 to 1/tens)
> to
> > run.
> > PRs for those modules take up more than half of all the PRs in my
> > observation.
> >
> > Option 3 can be a supplementary to option 2 that if the PR is modifying
> > fundamental modules like flink-core or flink-runtime.
> > It can even be a switch of the tests scope(basic/full) of a PR, so that
> > committers do not need to trigger it multiple times.
> > With it we can postpone the testing of IT cases or connectors before the
> PR
> > reaches a stable state.
> >
> > Thanks,
> > Zhu Zhu
> >
> > Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:
> >
> > > Hello everyone,
> > >
> > > improving our build times is a hot topic at the moment so let's discuss
> > > the different ways how they could be reduced.
> > >
> > >
> > >         Current state:
> > >
> > > First up, let's look at some numbers:
> > >
> > > 1 full build currently consumes 5h of build time total ("total time"),
> > > and in the ideal case takes about 1h20m ("run time") to complete from
> > > start to finish. The run time may fluctuate of course depending on the
> > > current Travis load. This applies both to builds on the Apache and
> > > flink-ci Travis.
> > >
> > > At the time of writing, the current queue time for PR jobs (reminder:
> > > running on flink-ci) is about 30 minutes (which basically means that we
> > > are processing builds at the rate that they come in), however we are in
> > > an admittedly quiet period right now.
> > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > everyone was scrambling to get their changes merged in time for the
> > > feature freeze.
> > >
> > > (Note: Recently optimizations where added to ci-bot where pending
> builds
> > > are canceled if a new commit was pushed to the PR or the PR was closed,
> > > which should prove especially useful during the rush hours we see
> before
> > > feature-freezes.)
> > >
> > >
> > >         Past approaches
> > >
> > > Over the years we have done rather few things to improve this situation
> > > (hence our current predicament).
> > >
> > > Beyond the sporadic speedup of some tests, the only notable reduction
> in
> > > total build times was the introduction of cron jobs, which consolidated
> > > the per-commit matrix from 4 configurations (different scala/hadoop
> > > versions) to 1.
> > >
> > > The separation into multiple build profiles was only a work-around for
> > > the 50m limit on Travis. Running tests in parallel has the obvious
> > > potential of reducing run time, but we're currently hitting a hard
> limit
> > > since a few modules (flink-tests, flink-runtime,
> > > flink-table-planner-blink) are so loaded with tests that they nearly
> > > consume an entire profile by themselves (and thus no further splitting
> > > is possible).
> > >
> > > The rework that introduced stages, at the time of introduction, did
> also
> > > not provide a speed up, although this changed slightly once more
> > > profiles were added and some optimizations to the caching have been
> made.
> > >
> > > Very recently we modified the surefire-plugin configuration for
> > > flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> > > significant speedup (18 minutes!). So far we have not seen any negative
> > > consequences.
> > >
> > >
> > >         Suggestions
> > >
> > > This is a list of /all /suggestions for reducing run/total times that I
> > > have seen recently (in other words, they aren't necessarily mine nor
> may
> > > I agree with all of them).
> > >
> > >  1. Enable JVM reuse for IT cases in more modules.
> > >       * We've seen significant speedups in the blink planner, and this
> > >         should be applicable for all modules. However, I presume
> there's
> > >         a reason why we disabled JVM reuse (information on this would
> be
> > >         appreciated)
> > >  2. Custom differential build scripts
> > >       * Setup custom scripts for determining which modules might be
> > >         affected by change, and manipulate the splits accordingly. This
> > >         approach is conceptually quite straight-forward, but has limits
> > >         since it has to be pessimistic; i.e. a change in flink-core
> > >         _must_ result in testing all modules.
> > >  3. Only run smoke tests when PR is opened, run heavy tests on demand.
> > >       * With the introduction of the ci-bot we now have significantly
> > >         more options on how to handle PR builds. One option could be to
> > >         only run basic tests when the PR is created (which may be only
> > >         modified modules, or all unit tests, or another low-cost
> > >         scheme), and then have a committer trigger other builds (full
> > >         test run, e2e tests, etc...) on demand.
> > >  4. Move more tests into cron builds
> > >       * The budget version of 3); move certain tests that are either
> > >         expensive (like some runtime tests that take minutes) or in
> > >         rarely modified modules (like gelly) into cron jobs.
> > >  5. Gradle
> > >       * Gradle was brought up a few times for it's built-in support for
> > >         differential builds; basically providing 2) without the
> overhead
> > >         of maintaining additional scripts.
> > >       * To date no PoC was provided that shows it working in our CI
> > >         environment (i.e., handling splits & caching etc).
> > >       * This is the most disruptive change by a fair margin, as it
> would
> > >         affect the entire project, developers and potentially users (f
> > >         they build from source).
> > >  6. CI service
> > >       * Our current artifact caching setup on Travis is basically a
> > >         hack; we're basically abusing the Travis cache, which is meant
> > >         for long-term caching, to ship build artifacts across jobs.
> It's
> > >         brittle at times due to timing/visibility issues and on
> branches
> > >         the cleanup processes can interfere with running builds. It is
> > >         also not as effective as it could be.
> > >       * There are CI services that provide build artifact caching out
> of
> > >         the box, which could be useful for us.
> > >       * To date, no PoC for using another CI service has been provided.
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Jark Wu-2
Thanks Chesnay for starting this discussion.

+1 for #1, it might be the easiest way to get a significant speedup.
If the only reason is for isolation. I think we can fix the static fields
or global state used in Flink if possible.

+1 for #2, and thanks Aleksey for the prototype. I think it's a good
approach which doesn't introduce too much things to maintain.

+1 for #3(run CRON or e2e tests on demand).
We have this requirement when reviewing some pull requests, because we
don't sure whether it will broken some specific e2e test.
Currently, we have to run it locally by building the whole project. Or
enable CRON jobs for the pushed branch in contributor's own travis.

Besides that, I think FLINK-11464[1] is also a good way to cache
distributions to save a lot of download time.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-11464

On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <[hidden email]> wrote:

> Hi all!
>
> Thanks for starting this discussion.
>
> I'd like to also add my 2 cents:
>
> +1 for #2, differential build scripts.
> I've worked on the approach. And with it, I think it's possible to reduce
> total build time with relatively low effort, without enforcing any new
> build tool and low maintenance cost.
>
> You can check a proposed change (for the old CI setup, when Flink PRs were
> running in Apache common CI pool) here:
> https://github.com/apache/flink/pull/9065
> In the proposed change, the dependency check is not heavily hardcoded and
> just uses maven's results for dependency graph analysis.
>
> > This approach is conceptually quite straight-forward, but has limits
> since it has to be pessimistic; > i.e. a change in flink-core _must_ result
> in testing all modules.
>
> Agree, in Flink case, there are some core modules that would trigger whole
> tests run with such approach. For developers who modify such components,
> the build time would be the longest. But this approach should really help
> for developers who touch more-or-less independent modules.
>
> Even for core modules, it's possible to create "abstraction" barriers by
> changing dependency graph. For example, it can look like: flink-core-api
> <-- flink-core, flink-core-api <-- flink-connectors.
> In that case, only change in flink-core-api would trigger whole tests run.
>
> +1 for #3, separating PR CI runs to different stages.
> Imo, it may require more change to current CI setup, compared to #2 and
> better it should not be silly. Best, if it integrates with the Flink bot
> and triggers some follow up build steps only when some prerequisites are
> done.
>
> +1 for #4, to move some tests into cron runs.
> But imo, this does not scale well, it applies only to a small subset of
> tests.
>
> +1 for #6, to use other CI service(s).
> More specifically, GitHub gives build actions for free that can be used to
> offload some build steps/PR checks. It can help to move out some PR checks
> from the main CI build (for example: documentation builds, license checks,
> code formatting checks).
>
> Regards,
> Aleksey
>
> On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for starting this discussion Chesnay. I think it has become
> obvious
> > to the Flink community that with the existing build setup we cannot
> really
> > deliver fast build times which are essential for fast iteration cycles
> and
> > high developer productivity. The reasons for this situation are manifold
> > but it is definitely affected by Flink's project growth, not always
> optimal
> > tests and the inflexibility that everything needs to be built. Hence, I
> > consider the reduction of build times crucial for the project's health
> and
> > future growth.
> >
> > Without necessarily voicing a strong preference for any of the presented
> > suggestions, I wanted to comment on each of them:
> >
> > 1. This sounds promising. Could the reason why we don't reuse JVMs date
> > back to the time when we still had a lot of static fields in Flink which
> > made it hard to reuse JVMs and the potentially mutated global state?
> >
> > 2. Building hand-crafted solutions around a build system in order to
> > compensate for its limitations which other build systems support out of
> the
> > box sounds like the not invented here syndrome to me. Reinventing the
> wheel
> > has historically proven to be usually not the best solution and it often
> > comes with a high maintenance price tag. Moreover, it would add just
> > another layer of complexity around our existing build system. I think the
> > current state where we have the maven setup in pom files and for Travis
> > multiple bash scripts specializing the builds to make it fit the time
> limit
> > is already not very transparent/easy to understand.
> >
> > 3. I could see this work but it also requires a very good understanding
> of
> > Flink of every committer because the committer needs to know which tests
> > would be good to run additionally.
> >
> > 4. I would be against this option solely to decrease our build time. My
> > observation is that the community does not monitor the health of the cron
> > jobs well enough. In the past the cron jobs have been unstable for as
> long
> > as a complete release cycle. Moreover, I've seen that PRs were merged
> which
> > passed Travis but broke the cron jobs. Consequently, I fear that this
> > option would deteriorate Flink's stability.
> >
> > 5. I would rephrase this point into changing the build system. Gradle
> could
> > be one candidate but there are also other build systems out there like
> > Bazel. Changing the build system would indeed be a major endeavour but I
> > could see the long term benefits of such a change (similar to having a
> > consistent and enforced code style) in particular if the build system
> > supports the functionality which we would otherwise build & maintain on
> our
> > own. I think there would be ways to make the transition not as disruptive
> > as described. For example, one could keep the Maven build and the new
> build
> > side by side until one is confident enough that the new build produces
> the
> > same output as the Maven build. Maybe it would also be possible to
> migrate
> > individual modules starting from the leaves. However, I admit that
> changing
> > the build system will affect every Flink developer because she needs to
> > learn & understand it.
> >
> > 6. I would like to learn about other people's experience with different
> CI
> > systems. Travis worked okish for Flink so far but we see sometimes
> problems
> > with its caching mechanism as Chesnay stated. I think that this topic is
> > actually orthogonal to the other suggestions.
> >
> > My gut feeling is that not a single suggestion will be our solution but a
> > combination of them.
> >
> > Cheers,
> > Till
> >
> > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[hidden email]> wrote:
> >
> > > Thanks Chesnay for bringing up this discussion and sharing those
> thoughts
> > > to speed up the building process.
> > >
> > > I'd +1 for option 2 and 3.
> > >
> > > We can benefits a lot from Option 2. Developing table, connectors,
> > > libraries, docs modules would result in much fewer tests(1/3 to 1/tens)
> > to
> > > run.
> > > PRs for those modules take up more than half of all the PRs in my
> > > observation.
> > >
> > > Option 3 can be a supplementary to option 2 that if the PR is modifying
> > > fundamental modules like flink-core or flink-runtime.
> > > It can even be a switch of the tests scope(basic/full) of a PR, so that
> > > committers do not need to trigger it multiple times.
> > > With it we can postpone the testing of IT cases or connectors before
> the
> > PR
> > > reaches a stable state.
> > >
> > > Thanks,
> > > Zhu Zhu
> > >
> > > Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:
> > >
> > > > Hello everyone,
> > > >
> > > > improving our build times is a hot topic at the moment so let's
> discuss
> > > > the different ways how they could be reduced.
> > > >
> > > >
> > > >         Current state:
> > > >
> > > > First up, let's look at some numbers:
> > > >
> > > > 1 full build currently consumes 5h of build time total ("total
> time"),
> > > > and in the ideal case takes about 1h20m ("run time") to complete from
> > > > start to finish. The run time may fluctuate of course depending on
> the
> > > > current Travis load. This applies both to builds on the Apache and
> > > > flink-ci Travis.
> > > >
> > > > At the time of writing, the current queue time for PR jobs (reminder:
> > > > running on flink-ci) is about 30 minutes (which basically means that
> we
> > > > are processing builds at the rate that they come in), however we are
> in
> > > > an admittedly quiet period right now.
> > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > > everyone was scrambling to get their changes merged in time for the
> > > > feature freeze.
> > > >
> > > > (Note: Recently optimizations where added to ci-bot where pending
> > builds
> > > > are canceled if a new commit was pushed to the PR or the PR was
> closed,
> > > > which should prove especially useful during the rush hours we see
> > before
> > > > feature-freezes.)
> > > >
> > > >
> > > >         Past approaches
> > > >
> > > > Over the years we have done rather few things to improve this
> situation
> > > > (hence our current predicament).
> > > >
> > > > Beyond the sporadic speedup of some tests, the only notable reduction
> > in
> > > > total build times was the introduction of cron jobs, which
> consolidated
> > > > the per-commit matrix from 4 configurations (different scala/hadoop
> > > > versions) to 1.
> > > >
> > > > The separation into multiple build profiles was only a work-around
> for
> > > > the 50m limit on Travis. Running tests in parallel has the obvious
> > > > potential of reducing run time, but we're currently hitting a hard
> > limit
> > > > since a few modules (flink-tests, flink-runtime,
> > > > flink-table-planner-blink) are so loaded with tests that they nearly
> > > > consume an entire profile by themselves (and thus no further
> splitting
> > > > is possible).
> > > >
> > > > The rework that introduced stages, at the time of introduction, did
> > also
> > > > not provide a speed up, although this changed slightly once more
> > > > profiles were added and some optimizations to the caching have been
> > made.
> > > >
> > > > Very recently we modified the surefire-plugin configuration for
> > > > flink-table-planner-blink to reuse JVM forks for IT cases, providing
> a
> > > > significant speedup (18 minutes!). So far we have not seen any
> negative
> > > > consequences.
> > > >
> > > >
> > > >         Suggestions
> > > >
> > > > This is a list of /all /suggestions for reducing run/total times
> that I
> > > > have seen recently (in other words, they aren't necessarily mine nor
> > may
> > > > I agree with all of them).
> > > >
> > > >  1. Enable JVM reuse for IT cases in more modules.
> > > >       * We've seen significant speedups in the blink planner, and
> this
> > > >         should be applicable for all modules. However, I presume
> > there's
> > > >         a reason why we disabled JVM reuse (information on this would
> > be
> > > >         appreciated)
> > > >  2. Custom differential build scripts
> > > >       * Setup custom scripts for determining which modules might be
> > > >         affected by change, and manipulate the splits accordingly.
> This
> > > >         approach is conceptually quite straight-forward, but has
> limits
> > > >         since it has to be pessimistic; i.e. a change in flink-core
> > > >         _must_ result in testing all modules.
> > > >  3. Only run smoke tests when PR is opened, run heavy tests on
> demand.
> > > >       * With the introduction of the ci-bot we now have significantly
> > > >         more options on how to handle PR builds. One option could be
> to
> > > >         only run basic tests when the PR is created (which may be
> only
> > > >         modified modules, or all unit tests, or another low-cost
> > > >         scheme), and then have a committer trigger other builds (full
> > > >         test run, e2e tests, etc...) on demand.
> > > >  4. Move more tests into cron builds
> > > >       * The budget version of 3); move certain tests that are either
> > > >         expensive (like some runtime tests that take minutes) or in
> > > >         rarely modified modules (like gelly) into cron jobs.
> > > >  5. Gradle
> > > >       * Gradle was brought up a few times for it's built-in support
> for
> > > >         differential builds; basically providing 2) without the
> > overhead
> > > >         of maintaining additional scripts.
> > > >       * To date no PoC was provided that shows it working in our CI
> > > >         environment (i.e., handling splits & caching etc).
> > > >       * This is the most disruptive change by a fair margin, as it
> > would
> > > >         affect the entire project, developers and potentially users
> (f
> > > >         they build from source).
> > > >  6. CI service
> > > >       * Our current artifact caching setup on Travis is basically a
> > > >         hack; we're basically abusing the Travis cache, which is
> meant
> > > >         for long-term caching, to ship build artifacts across jobs.
> > It's
> > > >         brittle at times due to timing/visibility issues and on
> > branches
> > > >         the cleanup processes can interfere with running builds. It
> is
> > > >         also not as effective as it could be.
> > > >       * There are CI services that provide build artifact caching out
> > of
> > > >         the box, which could be useful for us.
> > > >       * To date, no PoC for using another CI service has been
> provided.
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Arvid Heise
Thank you for starting the discussion as well!

+1 to 1. it seems to be a quite low-hanging fruit that we should try to
employ as much as possible.

-0 to 2. the build setup is already very complicated. Adding new
functionality that I would expect to come out of the box of a modern build
tool seems like too much effort for me. I'm proposing a 7. action item that
I would like to try out first before making the setup more complicated.

+0 to 3. What is the actual intent here? If it's about failing earlier,
then I'd rather propose to reorder the tests such that unit and smoke tests
of every module are run before IT tests. If it's about being able to
approve a PR quicker, are smoke tests really enough? However, if we have
layered tests, then it would be rather easy to omit IT tests altogether in
specific (local) builds.

-1 to 4. I really want to see when stuff breaks not only once per day (or
whatever the CRON cycle is). I can really see more broken code being merged
into master because of the disconnect.

+1 to 5. Gradle build cache has worked well for me in the past. If there is
a general interest, I can start a POC (or improve upon older POCs). I
currently expect shading to be the most effort.

+1 to 6. Travis had so many drawbacks in the past and now that most of the
senior staff has been layed off, I don't expect any improvements at all.
At my old company, I switched our open source projects to Azure pipelines
with great success. Azure pipelines offers 10 instances for open source
projects and it's payment model is pay-as-you-go [1]. Since artifact
sharing seems to be an issue with Travis anyways, it looks rather easy to
use in pipelines [2].
I'd also expect Github CI to be a good fit for our needs [3], but it's
rather young and I have no experience.

---

7. Option I'd like to try the global build cache that's provided by Gradle
enterprise for Maven first [4]. It basically fingerprints a task
(fingerprint of upstream tasks, source files + black magic) and whenever
the fingerprint matches it fetches the results from the build cache. In
theory, we would get the results of 2. implicitly without any effort. Of
course, Gradle enterprise costs money (which I could inquire if general
interest exists) but it would also allow us to downgrade the Travis plan
(and Travis is really expensive).


[1]
https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
[2]
https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
[3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
[4] https://docs.gradle.com/enterprise/maven-extension/

On Fri, Aug 16, 2019 at 5:20 AM Jark Wu <[hidden email]> wrote:

> Thanks Chesnay for starting this discussion.
>
> +1 for #1, it might be the easiest way to get a significant speedup.
> If the only reason is for isolation. I think we can fix the static fields
> or global state used in Flink if possible.
>
> +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> approach which doesn't introduce too much things to maintain.
>
> +1 for #3(run CRON or e2e tests on demand).
> We have this requirement when reviewing some pull requests, because we
> don't sure whether it will broken some specific e2e test.
> Currently, we have to run it locally by building the whole project. Or
> enable CRON jobs for the pushed branch in contributor's own travis.
>
> Besides that, I think FLINK-11464[1] is also a good way to cache
> distributions to save a lot of download time.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-11464
>
> On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <[hidden email]> wrote:
>
> > Hi all!
> >
> > Thanks for starting this discussion.
> >
> > I'd like to also add my 2 cents:
> >
> > +1 for #2, differential build scripts.
> > I've worked on the approach. And with it, I think it's possible to reduce
> > total build time with relatively low effort, without enforcing any new
> > build tool and low maintenance cost.
> >
> > You can check a proposed change (for the old CI setup, when Flink PRs
> were
> > running in Apache common CI pool) here:
> > https://github.com/apache/flink/pull/9065
> > In the proposed change, the dependency check is not heavily hardcoded and
> > just uses maven's results for dependency graph analysis.
> >
> > > This approach is conceptually quite straight-forward, but has limits
> > since it has to be pessimistic; > i.e. a change in flink-core _must_
> result
> > in testing all modules.
> >
> > Agree, in Flink case, there are some core modules that would trigger
> whole
> > tests run with such approach. For developers who modify such components,
> > the build time would be the longest. But this approach should really help
> > for developers who touch more-or-less independent modules.
> >
> > Even for core modules, it's possible to create "abstraction" barriers by
> > changing dependency graph. For example, it can look like: flink-core-api
> > <-- flink-core, flink-core-api <-- flink-connectors.
> > In that case, only change in flink-core-api would trigger whole tests
> run.
> >
> > +1 for #3, separating PR CI runs to different stages.
> > Imo, it may require more change to current CI setup, compared to #2 and
> > better it should not be silly. Best, if it integrates with the Flink bot
> > and triggers some follow up build steps only when some prerequisites are
> > done.
> >
> > +1 for #4, to move some tests into cron runs.
> > But imo, this does not scale well, it applies only to a small subset of
> > tests.
> >
> > +1 for #6, to use other CI service(s).
> > More specifically, GitHub gives build actions for free that can be used
> to
> > offload some build steps/PR checks. It can help to move out some PR
> checks
> > from the main CI build (for example: documentation builds, license
> checks,
> > code formatting checks).
> >
> > Regards,
> > Aleksey
> >
> > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Thanks for starting this discussion Chesnay. I think it has become
> > obvious
> > > to the Flink community that with the existing build setup we cannot
> > really
> > > deliver fast build times which are essential for fast iteration cycles
> > and
> > > high developer productivity. The reasons for this situation are
> manifold
> > > but it is definitely affected by Flink's project growth, not always
> > optimal
> > > tests and the inflexibility that everything needs to be built. Hence, I
> > > consider the reduction of build times crucial for the project's health
> > and
> > > future growth.
> > >
> > > Without necessarily voicing a strong preference for any of the
> presented
> > > suggestions, I wanted to comment on each of them:
> > >
> > > 1. This sounds promising. Could the reason why we don't reuse JVMs date
> > > back to the time when we still had a lot of static fields in Flink
> which
> > > made it hard to reuse JVMs and the potentially mutated global state?
> > >
> > > 2. Building hand-crafted solutions around a build system in order to
> > > compensate for its limitations which other build systems support out of
> > the
> > > box sounds like the not invented here syndrome to me. Reinventing the
> > wheel
> > > has historically proven to be usually not the best solution and it
> often
> > > comes with a high maintenance price tag. Moreover, it would add just
> > > another layer of complexity around our existing build system. I think
> the
> > > current state where we have the maven setup in pom files and for Travis
> > > multiple bash scripts specializing the builds to make it fit the time
> > limit
> > > is already not very transparent/easy to understand.
> > >
> > > 3. I could see this work but it also requires a very good understanding
> > of
> > > Flink of every committer because the committer needs to know which
> tests
> > > would be good to run additionally.
> > >
> > > 4. I would be against this option solely to decrease our build time. My
> > > observation is that the community does not monitor the health of the
> cron
> > > jobs well enough. In the past the cron jobs have been unstable for as
> > long
> > > as a complete release cycle. Moreover, I've seen that PRs were merged
> > which
> > > passed Travis but broke the cron jobs. Consequently, I fear that this
> > > option would deteriorate Flink's stability.
> > >
> > > 5. I would rephrase this point into changing the build system. Gradle
> > could
> > > be one candidate but there are also other build systems out there like
> > > Bazel. Changing the build system would indeed be a major endeavour but
> I
> > > could see the long term benefits of such a change (similar to having a
> > > consistent and enforced code style) in particular if the build system
> > > supports the functionality which we would otherwise build & maintain on
> > our
> > > own. I think there would be ways to make the transition not as
> disruptive
> > > as described. For example, one could keep the Maven build and the new
> > build
> > > side by side until one is confident enough that the new build produces
> > the
> > > same output as the Maven build. Maybe it would also be possible to
> > migrate
> > > individual modules starting from the leaves. However, I admit that
> > changing
> > > the build system will affect every Flink developer because she needs to
> > > learn & understand it.
> > >
> > > 6. I would like to learn about other people's experience with different
> > CI
> > > systems. Travis worked okish for Flink so far but we see sometimes
> > problems
> > > with its caching mechanism as Chesnay stated. I think that this topic
> is
> > > actually orthogonal to the other suggestions.
> > >
> > > My gut feeling is that not a single suggestion will be our solution
> but a
> > > combination of them.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[hidden email]> wrote:
> > >
> > > > Thanks Chesnay for bringing up this discussion and sharing those
> > thoughts
> > > > to speed up the building process.
> > > >
> > > > I'd +1 for option 2 and 3.
> > > >
> > > > We can benefits a lot from Option 2. Developing table, connectors,
> > > > libraries, docs modules would result in much fewer tests(1/3 to
> 1/tens)
> > > to
> > > > run.
> > > > PRs for those modules take up more than half of all the PRs in my
> > > > observation.
> > > >
> > > > Option 3 can be a supplementary to option 2 that if the PR is
> modifying
> > > > fundamental modules like flink-core or flink-runtime.
> > > > It can even be a switch of the tests scope(basic/full) of a PR, so
> that
> > > > committers do not need to trigger it multiple times.
> > > > With it we can postpone the testing of IT cases or connectors before
> > the
> > > PR
> > > > reaches a stable state.
> > > >
> > > > Thanks,
> > > > Zhu Zhu
> > > >
> > > > Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > improving our build times is a hot topic at the moment so let's
> > discuss
> > > > > the different ways how they could be reduced.
> > > > >
> > > > >
> > > > >         Current state:
> > > > >
> > > > > First up, let's look at some numbers:
> > > > >
> > > > > 1 full build currently consumes 5h of build time total ("total
> > time"),
> > > > > and in the ideal case takes about 1h20m ("run time") to complete
> from
> > > > > start to finish. The run time may fluctuate of course depending on
> > the
> > > > > current Travis load. This applies both to builds on the Apache and
> > > > > flink-ci Travis.
> > > > >
> > > > > At the time of writing, the current queue time for PR jobs
> (reminder:
> > > > > running on flink-ci) is about 30 minutes (which basically means
> that
> > we
> > > > > are processing builds at the rate that they come in), however we
> are
> > in
> > > > > an admittedly quiet period right now.
> > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > > > everyone was scrambling to get their changes merged in time for the
> > > > > feature freeze.
> > > > >
> > > > > (Note: Recently optimizations where added to ci-bot where pending
> > > builds
> > > > > are canceled if a new commit was pushed to the PR or the PR was
> > closed,
> > > > > which should prove especially useful during the rush hours we see
> > > before
> > > > > feature-freezes.)
> > > > >
> > > > >
> > > > >         Past approaches
> > > > >
> > > > > Over the years we have done rather few things to improve this
> > situation
> > > > > (hence our current predicament).
> > > > >
> > > > > Beyond the sporadic speedup of some tests, the only notable
> reduction
> > > in
> > > > > total build times was the introduction of cron jobs, which
> > consolidated
> > > > > the per-commit matrix from 4 configurations (different scala/hadoop
> > > > > versions) to 1.
> > > > >
> > > > > The separation into multiple build profiles was only a work-around
> > for
> > > > > the 50m limit on Travis. Running tests in parallel has the obvious
> > > > > potential of reducing run time, but we're currently hitting a hard
> > > limit
> > > > > since a few modules (flink-tests, flink-runtime,
> > > > > flink-table-planner-blink) are so loaded with tests that they
> nearly
> > > > > consume an entire profile by themselves (and thus no further
> > splitting
> > > > > is possible).
> > > > >
> > > > > The rework that introduced stages, at the time of introduction, did
> > > also
> > > > > not provide a speed up, although this changed slightly once more
> > > > > profiles were added and some optimizations to the caching have been
> > > made.
> > > > >
> > > > > Very recently we modified the surefire-plugin configuration for
> > > > > flink-table-planner-blink to reuse JVM forks for IT cases,
> providing
> > a
> > > > > significant speedup (18 minutes!). So far we have not seen any
> > negative
> > > > > consequences.
> > > > >
> > > > >
> > > > >         Suggestions
> > > > >
> > > > > This is a list of /all /suggestions for reducing run/total times
> > that I
> > > > > have seen recently (in other words, they aren't necessarily mine
> nor
> > > may
> > > > > I agree with all of them).
> > > > >
> > > > >  1. Enable JVM reuse for IT cases in more modules.
> > > > >       * We've seen significant speedups in the blink planner, and
> > this
> > > > >         should be applicable for all modules. However, I presume
> > > there's
> > > > >         a reason why we disabled JVM reuse (information on this
> would
> > > be
> > > > >         appreciated)
> > > > >  2. Custom differential build scripts
> > > > >       * Setup custom scripts for determining which modules might be
> > > > >         affected by change, and manipulate the splits accordingly.
> > This
> > > > >         approach is conceptually quite straight-forward, but has
> > limits
> > > > >         since it has to be pessimistic; i.e. a change in flink-core
> > > > >         _must_ result in testing all modules.
> > > > >  3. Only run smoke tests when PR is opened, run heavy tests on
> > demand.
> > > > >       * With the introduction of the ci-bot we now have
> significantly
> > > > >         more options on how to handle PR builds. One option could
> be
> > to
> > > > >         only run basic tests when the PR is created (which may be
> > only
> > > > >         modified modules, or all unit tests, or another low-cost
> > > > >         scheme), and then have a committer trigger other builds
> (full
> > > > >         test run, e2e tests, etc...) on demand.
> > > > >  4. Move more tests into cron builds
> > > > >       * The budget version of 3); move certain tests that are
> either
> > > > >         expensive (like some runtime tests that take minutes) or in
> > > > >         rarely modified modules (like gelly) into cron jobs.
> > > > >  5. Gradle
> > > > >       * Gradle was brought up a few times for it's built-in support
> > for
> > > > >         differential builds; basically providing 2) without the
> > > overhead
> > > > >         of maintaining additional scripts.
> > > > >       * To date no PoC was provided that shows it working in our CI
> > > > >         environment (i.e., handling splits & caching etc).
> > > > >       * This is the most disruptive change by a fair margin, as it
> > > would
> > > > >         affect the entire project, developers and potentially users
> > (f
> > > > >         they build from source).
> > > > >  6. CI service
> > > > >       * Our current artifact caching setup on Travis is basically a
> > > > >         hack; we're basically abusing the Travis cache, which is
> > meant
> > > > >         for long-term caching, to ship build artifacts across jobs.
> > > It's
> > > > >         brittle at times due to timing/visibility issues and on
> > > branches
> > > > >         the cleanup processes can interfere with running builds. It
> > is
> > > > >         also not as effective as it could be.
> > > > >       * There are CI services that provide build artifact caching
> out
> > > of
> > > > >         the box, which could be useful for us.
> > > > >       * To date, no PoC for using another CI service has been
> > provided.
> > > > >
> > > > >
> > > >
> > >
> >
>


--

Arvid Heise | Senior Software Engineer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Xiyuan Wang
6. CI service
    I'm not very familar with tarvis, but according to its offical
doc[1][2]. Is it possible to run jobs in parallel? AFAIK, many CI system
supports this kind of feature.

[1]:
https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines
[2]: https://docs.travis-ci.com/user/build-matrix/

Arvid Heise <[hidden email]> 于2019年8月16日周五 下午4:14写道:

> Thank you for starting the discussion as well!
>
> +1 to 1. it seems to be a quite low-hanging fruit that we should try to
> employ as much as possible.
>
> -0 to 2. the build setup is already very complicated. Adding new
> functionality that I would expect to come out of the box of a modern build
> tool seems like too much effort for me. I'm proposing a 7. action item that
> I would like to try out first before making the setup more complicated.
>
> +0 to 3. What is the actual intent here? If it's about failing earlier,
> then I'd rather propose to reorder the tests such that unit and smoke tests
> of every module are run before IT tests. If it's about being able to
> approve a PR quicker, are smoke tests really enough? However, if we have
> layered tests, then it would be rather easy to omit IT tests altogether in
> specific (local) builds.
>
> -1 to 4. I really want to see when stuff breaks not only once per day (or
> whatever the CRON cycle is). I can really see more broken code being merged
> into master because of the disconnect.
>
> +1 to 5. Gradle build cache has worked well for me in the past. If there is
> a general interest, I can start a POC (or improve upon older POCs). I
> currently expect shading to be the most effort.
>
> +1 to 6. Travis had so many drawbacks in the past and now that most of the
> senior staff has been layed off, I don't expect any improvements at all.
> At my old company, I switched our open source projects to Azure pipelines
> with great success. Azure pipelines offers 10 instances for open source
> projects and it's payment model is pay-as-you-go [1]. Since artifact
> sharing seems to be an issue with Travis anyways, it looks rather easy to
> use in pipelines [2].
> I'd also expect Github CI to be a good fit for our needs [3], but it's
> rather young and I have no experience.
>
> ---
>
> 7. Option I'd like to try the global build cache that's provided by Gradle
> enterprise for Maven first [4]. It basically fingerprints a task
> (fingerprint of upstream tasks, source files + black magic) and whenever
> the fingerprint matches it fetches the results from the build cache. In
> theory, we would get the results of 2. implicitly without any effort. Of
> course, Gradle enterprise costs money (which I could inquire if general
> interest exists) but it would also allow us to downgrade the Travis plan
> (and Travis is really expensive).
>
>
> [1]
>
> https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
> [2]
>
> https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
> [3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
> [4] https://docs.gradle.com/enterprise/maven-extension/
>
> On Fri, Aug 16, 2019 at 5:20 AM Jark Wu <[hidden email]> wrote:
>
> > Thanks Chesnay for starting this discussion.
> >
> > +1 for #1, it might be the easiest way to get a significant speedup.
> > If the only reason is for isolation. I think we can fix the static fields
> > or global state used in Flink if possible.
> >
> > +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> > approach which doesn't introduce too much things to maintain.
> >
> > +1 for #3(run CRON or e2e tests on demand).
> > We have this requirement when reviewing some pull requests, because we
> > don't sure whether it will broken some specific e2e test.
> > Currently, we have to run it locally by building the whole project. Or
> > enable CRON jobs for the pushed branch in contributor's own travis.
> >
> > Besides that, I think FLINK-11464[1] is also a good way to cache
> > distributions to save a lot of download time.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11464
> >
> > On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <[hidden email]> wrote:
> >
> > > Hi all!
> > >
> > > Thanks for starting this discussion.
> > >
> > > I'd like to also add my 2 cents:
> > >
> > > +1 for #2, differential build scripts.
> > > I've worked on the approach. And with it, I think it's possible to
> reduce
> > > total build time with relatively low effort, without enforcing any new
> > > build tool and low maintenance cost.
> > >
> > > You can check a proposed change (for the old CI setup, when Flink PRs
> > were
> > > running in Apache common CI pool) here:
> > > https://github.com/apache/flink/pull/9065
> > > In the proposed change, the dependency check is not heavily hardcoded
> and
> > > just uses maven's results for dependency graph analysis.
> > >
> > > > This approach is conceptually quite straight-forward, but has limits
> > > since it has to be pessimistic; > i.e. a change in flink-core _must_
> > result
> > > in testing all modules.
> > >
> > > Agree, in Flink case, there are some core modules that would trigger
> > whole
> > > tests run with such approach. For developers who modify such
> components,
> > > the build time would be the longest. But this approach should really
> help
> > > for developers who touch more-or-less independent modules.
> > >
> > > Even for core modules, it's possible to create "abstraction" barriers
> by
> > > changing dependency graph. For example, it can look like:
> flink-core-api
> > > <-- flink-core, flink-core-api <-- flink-connectors.
> > > In that case, only change in flink-core-api would trigger whole tests
> > run.
> > >
> > > +1 for #3, separating PR CI runs to different stages.
> > > Imo, it may require more change to current CI setup, compared to #2 and
> > > better it should not be silly. Best, if it integrates with the Flink
> bot
> > > and triggers some follow up build steps only when some prerequisites
> are
> > > done.
> > >
> > > +1 for #4, to move some tests into cron runs.
> > > But imo, this does not scale well, it applies only to a small subset of
> > > tests.
> > >
> > > +1 for #6, to use other CI service(s).
> > > More specifically, GitHub gives build actions for free that can be used
> > to
> > > offload some build steps/PR checks. It can help to move out some PR
> > checks
> > > from the main CI build (for example: documentation builds, license
> > checks,
> > > code formatting checks).
> > >
> > > Regards,
> > > Aleksey
> > >
> > > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for starting this discussion Chesnay. I think it has become
> > > obvious
> > > > to the Flink community that with the existing build setup we cannot
> > > really
> > > > deliver fast build times which are essential for fast iteration
> cycles
> > > and
> > > > high developer productivity. The reasons for this situation are
> > manifold
> > > > but it is definitely affected by Flink's project growth, not always
> > > optimal
> > > > tests and the inflexibility that everything needs to be built.
> Hence, I
> > > > consider the reduction of build times crucial for the project's
> health
> > > and
> > > > future growth.
> > > >
> > > > Without necessarily voicing a strong preference for any of the
> > presented
> > > > suggestions, I wanted to comment on each of them:
> > > >
> > > > 1. This sounds promising. Could the reason why we don't reuse JVMs
> date
> > > > back to the time when we still had a lot of static fields in Flink
> > which
> > > > made it hard to reuse JVMs and the potentially mutated global state?
> > > >
> > > > 2. Building hand-crafted solutions around a build system in order to
> > > > compensate for its limitations which other build systems support out
> of
> > > the
> > > > box sounds like the not invented here syndrome to me. Reinventing the
> > > wheel
> > > > has historically proven to be usually not the best solution and it
> > often
> > > > comes with a high maintenance price tag. Moreover, it would add just
> > > > another layer of complexity around our existing build system. I think
> > the
> > > > current state where we have the maven setup in pom files and for
> Travis
> > > > multiple bash scripts specializing the builds to make it fit the time
> > > limit
> > > > is already not very transparent/easy to understand.
> > > >
> > > > 3. I could see this work but it also requires a very good
> understanding
> > > of
> > > > Flink of every committer because the committer needs to know which
> > tests
> > > > would be good to run additionally.
> > > >
> > > > 4. I would be against this option solely to decrease our build time.
> My
> > > > observation is that the community does not monitor the health of the
> > cron
> > > > jobs well enough. In the past the cron jobs have been unstable for as
> > > long
> > > > as a complete release cycle. Moreover, I've seen that PRs were merged
> > > which
> > > > passed Travis but broke the cron jobs. Consequently, I fear that this
> > > > option would deteriorate Flink's stability.
> > > >
> > > > 5. I would rephrase this point into changing the build system. Gradle
> > > could
> > > > be one candidate but there are also other build systems out there
> like
> > > > Bazel. Changing the build system would indeed be a major endeavour
> but
> > I
> > > > could see the long term benefits of such a change (similar to having
> a
> > > > consistent and enforced code style) in particular if the build system
> > > > supports the functionality which we would otherwise build & maintain
> on
> > > our
> > > > own. I think there would be ways to make the transition not as
> > disruptive
> > > > as described. For example, one could keep the Maven build and the new
> > > build
> > > > side by side until one is confident enough that the new build
> produces
> > > the
> > > > same output as the Maven build. Maybe it would also be possible to
> > > migrate
> > > > individual modules starting from the leaves. However, I admit that
> > > changing
> > > > the build system will affect every Flink developer because she needs
> to
> > > > learn & understand it.
> > > >
> > > > 6. I would like to learn about other people's experience with
> different
> > > CI
> > > > systems. Travis worked okish for Flink so far but we see sometimes
> > > problems
> > > > with its caching mechanism as Chesnay stated. I think that this topic
> > is
> > > > actually orthogonal to the other suggestions.
> > > >
> > > > My gut feeling is that not a single suggestion will be our solution
> > but a
> > > > combination of them.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[hidden email]> wrote:
> > > >
> > > > > Thanks Chesnay for bringing up this discussion and sharing those
> > > thoughts
> > > > > to speed up the building process.
> > > > >
> > > > > I'd +1 for option 2 and 3.
> > > > >
> > > > > We can benefits a lot from Option 2. Developing table, connectors,
> > > > > libraries, docs modules would result in much fewer tests(1/3 to
> > 1/tens)
> > > > to
> > > > > run.
> > > > > PRs for those modules take up more than half of all the PRs in my
> > > > > observation.
> > > > >
> > > > > Option 3 can be a supplementary to option 2 that if the PR is
> > modifying
> > > > > fundamental modules like flink-core or flink-runtime.
> > > > > It can even be a switch of the tests scope(basic/full) of a PR, so
> > that
> > > > > committers do not need to trigger it multiple times.
> > > > > With it we can postpone the testing of IT cases or connectors
> before
> > > the
> > > > PR
> > > > > reaches a stable state.
> > > > >
> > > > > Thanks,
> > > > > Zhu Zhu
> > > > >
> > > > > Chesnay Schepler <[hidden email]> 于2019年8月15日周四 下午3:38写道:
> > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > improving our build times is a hot topic at the moment so let's
> > > discuss
> > > > > > the different ways how they could be reduced.
> > > > > >
> > > > > >
> > > > > >         Current state:
> > > > > >
> > > > > > First up, let's look at some numbers:
> > > > > >
> > > > > > 1 full build currently consumes 5h of build time total ("total
> > > time"),
> > > > > > and in the ideal case takes about 1h20m ("run time") to complete
> > from
> > > > > > start to finish. The run time may fluctuate of course depending
> on
> > > the
> > > > > > current Travis load. This applies both to builds on the Apache
> and
> > > > > > flink-ci Travis.
> > > > > >
> > > > > > At the time of writing, the current queue time for PR jobs
> > (reminder:
> > > > > > running on flink-ci) is about 30 minutes (which basically means
> > that
> > > we
> > > > > > are processing builds at the rate that they come in), however we
> > are
> > > in
> > > > > > an admittedly quiet period right now.
> > > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > > > > everyone was scrambling to get their changes merged in time for
> the
> > > > > > feature freeze.
> > > > > >
> > > > > > (Note: Recently optimizations where added to ci-bot where pending
> > > > builds
> > > > > > are canceled if a new commit was pushed to the PR or the PR was
> > > closed,
> > > > > > which should prove especially useful during the rush hours we see
> > > > before
> > > > > > feature-freezes.)
> > > > > >
> > > > > >
> > > > > >         Past approaches
> > > > > >
> > > > > > Over the years we have done rather few things to improve this
> > > situation
> > > > > > (hence our current predicament).
> > > > > >
> > > > > > Beyond the sporadic speedup of some tests, the only notable
> > reduction
> > > > in
> > > > > > total build times was the introduction of cron jobs, which
> > > consolidated
> > > > > > the per-commit matrix from 4 configurations (different
> scala/hadoop
> > > > > > versions) to 1.
> > > > > >
> > > > > > The separation into multiple build profiles was only a
> work-around
> > > for
> > > > > > the 50m limit on Travis. Running tests in parallel has the
> obvious
> > > > > > potential of reducing run time, but we're currently hitting a
> hard
> > > > limit
> > > > > > since a few modules (flink-tests, flink-runtime,
> > > > > > flink-table-planner-blink) are so loaded with tests that they
> > nearly
> > > > > > consume an entire profile by themselves (and thus no further
> > > splitting
> > > > > > is possible).
> > > > > >
> > > > > > The rework that introduced stages, at the time of introduction,
> did
> > > > also
> > > > > > not provide a speed up, although this changed slightly once more
> > > > > > profiles were added and some optimizations to the caching have
> been
> > > > made.
> > > > > >
> > > > > > Very recently we modified the surefire-plugin configuration for
> > > > > > flink-table-planner-blink to reuse JVM forks for IT cases,
> > providing
> > > a
> > > > > > significant speedup (18 minutes!). So far we have not seen any
> > > negative
> > > > > > consequences.
> > > > > >
> > > > > >
> > > > > >         Suggestions
> > > > > >
> > > > > > This is a list of /all /suggestions for reducing run/total times
> > > that I
> > > > > > have seen recently (in other words, they aren't necessarily mine
> > nor
> > > > may
> > > > > > I agree with all of them).
> > > > > >
> > > > > >  1. Enable JVM reuse for IT cases in more modules.
> > > > > >       * We've seen significant speedups in the blink planner, and
> > > this
> > > > > >         should be applicable for all modules. However, I presume
> > > > there's
> > > > > >         a reason why we disabled JVM reuse (information on this
> > would
> > > > be
> > > > > >         appreciated)
> > > > > >  2. Custom differential build scripts
> > > > > >       * Setup custom scripts for determining which modules might
> be
> > > > > >         affected by change, and manipulate the splits
> accordingly.
> > > This
> > > > > >         approach is conceptually quite straight-forward, but has
> > > limits
> > > > > >         since it has to be pessimistic; i.e. a change in
> flink-core
> > > > > >         _must_ result in testing all modules.
> > > > > >  3. Only run smoke tests when PR is opened, run heavy tests on
> > > demand.
> > > > > >       * With the introduction of the ci-bot we now have
> > significantly
> > > > > >         more options on how to handle PR builds. One option could
> > be
> > > to
> > > > > >         only run basic tests when the PR is created (which may be
> > > only
> > > > > >         modified modules, or all unit tests, or another low-cost
> > > > > >         scheme), and then have a committer trigger other builds
> > (full
> > > > > >         test run, e2e tests, etc...) on demand.
> > > > > >  4. Move more tests into cron builds
> > > > > >       * The budget version of 3); move certain tests that are
> > either
> > > > > >         expensive (like some runtime tests that take minutes) or
> in
> > > > > >         rarely modified modules (like gelly) into cron jobs.
> > > > > >  5. Gradle
> > > > > >       * Gradle was brought up a few times for it's built-in
> support
> > > for
> > > > > >         differential builds; basically providing 2) without the
> > > > overhead
> > > > > >         of maintaining additional scripts.
> > > > > >       * To date no PoC was provided that shows it working in our
> CI
> > > > > >         environment (i.e., handling splits & caching etc).
> > > > > >       * This is the most disruptive change by a fair margin, as
> it
> > > > would
> > > > > >         affect the entire project, developers and potentially
> users
> > > (f
> > > > > >         they build from source).
> > > > > >  6. CI service
> > > > > >       * Our current artifact caching setup on Travis is
> basically a
> > > > > >         hack; we're basically abusing the Travis cache, which is
> > > meant
> > > > > >         for long-term caching, to ship build artifacts across
> jobs.
> > > > It's
> > > > > >         brittle at times due to timing/visibility issues and on
> > > > branches
> > > > > >         the cleanup processes can interfere with running builds.
> It
> > > is
> > > > > >         also not as effective as it could be.
> > > > > >       * There are CI services that provide build artifact caching
> > out
> > > > of
> > > > > >         the box, which could be useful for us.
> > > > > >       * To date, no PoC for using another CI service has been
> > > provided.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Arvid Heise | Senior Software Engineer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Chesnay Schepler-3
In reply to this post by Chesnay Schepler-3
There appears to be a general agreement that 1) should be looked into;
I've setup a branch with fork reuse being enabled for all tests; will
report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

> Hello everyone,
>
> improving our build times is a hot topic at the moment so let's
> discuss the different ways how they could be reduced.
>
>
>        Current state:
>
> First up, let's look at some numbers:
>
> 1 full build currently consumes 5h of build time total ("total time"),
> and in the ideal case takes about 1h20m ("run time") to complete from
> start to finish. The run time may fluctuate of course depending on the
> current Travis load. This applies both to builds on the Apache and
> flink-ci Travis.
>
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that
> we are processing builds at the rate that they come in), however we
> are in an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
>
> (Note: Recently optimizations where added to ci-bot where pending
> builds are canceled if a new commit was pushed to the PR or the PR was
> closed, which should prove especially useful during the rush hours we
> see before feature-freezes.)
>
>
>        Past approaches
>
> Over the years we have done rather few things to improve this
> situation (hence our current predicament).
>
> Beyond the sporadic speedup of some tests, the only notable reduction
> in total build times was the introduction of cron jobs, which
> consolidated the per-commit matrix from 4 configurations (different
> scala/hadoop versions) to 1.
>
> The separation into multiple build profiles was only a work-around for
> the 50m limit on Travis. Running tests in parallel has the obvious
> potential of reducing run time, but we're currently hitting a hard
> limit since a few modules (flink-tests, flink-runtime,
> flink-table-planner-blink) are so loaded with tests that they nearly
> consume an entire profile by themselves (and thus no further splitting
> is possible).
>
> The rework that introduced stages, at the time of introduction, did
> also not provide a speed up, although this changed slightly once more
> profiles were added and some optimizations to the caching have been made.
>
> Very recently we modified the surefire-plugin configuration for
> flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> significant speedup (18 minutes!). So far we have not seen any
> negative consequences.
>
>
>        Suggestions
>
> This is a list of /all /suggestions for reducing run/total times that
> I have seen recently (in other words, they aren't necessarily mine nor
> may I agree with all of them).
>
> 1. Enable JVM reuse for IT cases in more modules.
>      * We've seen significant speedups in the blink planner, and this
>        should be applicable for all modules. However, I presume there's
>        a reason why we disabled JVM reuse (information on this would be
>        appreciated)
> 2. Custom differential build scripts
>      * Setup custom scripts for determining which modules might be
>        affected by change, and manipulate the splits accordingly. This
>        approach is conceptually quite straight-forward, but has limits
>        since it has to be pessimistic; i.e. a change in flink-core
>        _must_ result in testing all modules.
> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
>      * With the introduction of the ci-bot we now have significantly
>        more options on how to handle PR builds. One option could be to
>        only run basic tests when the PR is created (which may be only
>        modified modules, or all unit tests, or another low-cost
>        scheme), and then have a committer trigger other builds (full
>        test run, e2e tests, etc...) on demand.
> 4. Move more tests into cron builds
>      * The budget version of 3); move certain tests that are either
>        expensive (like some runtime tests that take minutes) or in
>        rarely modified modules (like gelly) into cron jobs.
> 5. Gradle
>      * Gradle was brought up a few times for it's built-in support for
>        differential builds; basically providing 2) without the overhead
>        of maintaining additional scripts.
>      * To date no PoC was provided that shows it working in our CI
>        environment (i.e., handling splits & caching etc).
>      * This is the most disruptive change by a fair margin, as it would
>        affect the entire project, developers and potentially users (f
>        they build from source).
> 6. CI service
>      * Our current artifact caching setup on Travis is basically a
>        hack; we're basically abusing the Travis cache, which is meant
>        for long-term caching, to ship build artifacts across jobs. It's
>        brittle at times due to timing/visibility issues and on branches
>        the cleanup processes can interfere with running builds. It is
>        also not as effective as it could be.
>      * There are CI services that provide build artifact caching out of
>        the box, which could be useful for us.
>      * To date, no PoC for using another CI service has been provided.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Chesnay Schepler-3
Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right
away, while flink-tests has the potential for huge savings, but we have
to figure out some issues first.


Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name: kafka-producer-network-thread | producer-239
        at org.junit.Assert.fail(Assert.java:88)
        at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
        at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)


The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected accumulator results within time limit.
        at org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above
failed after 19 minutes and is only missing the migration tests (which
currently need 6-7 minutes). So we could save somewhere between 15 to 20
minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
        at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

> There appears to be a general agreement that 1) should be looked into;
> I've setup a branch with fork reuse being enabled for all tests; will
> report back the results.
>
> On 15/08/2019 09:38, Chesnay Schepler wrote:
>> Hello everyone,
>>
>> improving our build times is a hot topic at the moment so let's
>> discuss the different ways how they could be reduced.
>>
>>
>>        Current state:
>>
>> First up, let's look at some numbers:
>>
>> 1 full build currently consumes 5h of build time total ("total
>> time"), and in the ideal case takes about 1h20m ("run time") to
>> complete from start to finish. The run time may fluctuate of course
>> depending on the current Travis load. This applies both to builds on
>> the Apache and flink-ci Travis.
>>
>> At the time of writing, the current queue time for PR jobs (reminder:
>> running on flink-ci) is about 30 minutes (which basically means that
>> we are processing builds at the rate that they come in), however we
>> are in an admittedly quiet period right now.
>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>> everyone was scrambling to get their changes merged in time for the
>> feature freeze.
>>
>> (Note: Recently optimizations where added to ci-bot where pending
>> builds are canceled if a new commit was pushed to the PR or the PR
>> was closed, which should prove especially useful during the rush
>> hours we see before feature-freezes.)
>>
>>
>>        Past approaches
>>
>> Over the years we have done rather few things to improve this
>> situation (hence our current predicament).
>>
>> Beyond the sporadic speedup of some tests, the only notable reduction
>> in total build times was the introduction of cron jobs, which
>> consolidated the per-commit matrix from 4 configurations (different
>> scala/hadoop versions) to 1.
>>
>> The separation into multiple build profiles was only a work-around
>> for the 50m limit on Travis. Running tests in parallel has the
>> obvious potential of reducing run time, but we're currently hitting a
>> hard limit since a few modules (flink-tests, flink-runtime,
>> flink-table-planner-blink) are so loaded with tests that they nearly
>> consume an entire profile by themselves (and thus no further
>> splitting is possible).
>>
>> The rework that introduced stages, at the time of introduction, did
>> also not provide a speed up, although this changed slightly once more
>> profiles were added and some optimizations to the caching have been
>> made.
>>
>> Very recently we modified the surefire-plugin configuration for
>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
>> a significant speedup (18 minutes!). So far we have not seen any
>> negative consequences.
>>
>>
>>        Suggestions
>>
>> This is a list of /all /suggestions for reducing run/total times that
>> I have seen recently (in other words, they aren't necessarily mine
>> nor may I agree with all of them).
>>
>> 1. Enable JVM reuse for IT cases in more modules.
>>      * We've seen significant speedups in the blink planner, and this
>>        should be applicable for all modules. However, I presume there's
>>        a reason why we disabled JVM reuse (information on this would be
>>        appreciated)
>> 2. Custom differential build scripts
>>      * Setup custom scripts for determining which modules might be
>>        affected by change, and manipulate the splits accordingly. This
>>        approach is conceptually quite straight-forward, but has limits
>>        since it has to be pessimistic; i.e. a change in flink-core
>>        _must_ result in testing all modules.
>> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
>>      * With the introduction of the ci-bot we now have significantly
>>        more options on how to handle PR builds. One option could be to
>>        only run basic tests when the PR is created (which may be only
>>        modified modules, or all unit tests, or another low-cost
>>        scheme), and then have a committer trigger other builds (full
>>        test run, e2e tests, etc...) on demand.
>> 4. Move more tests into cron builds
>>      * The budget version of 3); move certain tests that are either
>>        expensive (like some runtime tests that take minutes) or in
>>        rarely modified modules (like gelly) into cron jobs.
>> 5. Gradle
>>      * Gradle was brought up a few times for it's built-in support for
>>        differential builds; basically providing 2) without the overhead
>>        of maintaining additional scripts.
>>      * To date no PoC was provided that shows it working in our CI
>>        environment (i.e., handling splits & caching etc).
>>      * This is the most disruptive change by a fair margin, as it would
>>        affect the entire project, developers and potentially users (f
>>        they build from source).
>> 6. CI service
>>      * Our current artifact caching setup on Travis is basically a
>>        hack; we're basically abusing the Travis cache, which is meant
>>        for long-term caching, to ship build artifacts across jobs. It's
>>        brittle at times due to timing/visibility issues and on branches
>>        the cleanup processes can interfere with running builds. It is
>>        also not as effective as it could be.
>>      * There are CI services that provide build artifact caching out of
>>        the box, which could be useful for us.
>>      * To date, no PoC for using another CI service has been provided.
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Till Rohrmann
For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]> wrote:

> Update:
>
> TL;DR: table-planner is a good candidate for enabling fork reuse right
> away, while flink-tests has the potential for huge savings, but we have
> to figure out some issues first.
>
>
> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>
> 4/8 profiles failed.
>
> No speedup in libraries, python, blink_planner, 7 minutes saved in
> libraries (table-planner).
>
> The kafka and connectors profiles both fail in kafka tests due to
> producer leaks, and no speed up could be confirmed so far:
>
> java.lang.AssertionError: Detected producer leak. Thread name:
> kafka-producer-network-thread | producer-239
>         at org.junit.Assert.fail(Assert.java:88)
>         at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>         at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>
>
> The tests profile failed due to various errors in migration tests:
>
> junit.framework.AssertionFailedError: Did not see the expected accumulator
> results within time limit.
>         at
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>
> *However*, a normal tests run takes 40 minutes, while this one above
> failed after 19 minutes and is only missing the migration tests (which
> currently need 6-7 minutes). So we could save somewhere between 15 to 20
> minutes here.
>
>
> Finally, the misc profiles fails in YARN:
>
> java.lang.AssertionError
>         at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>
> No significant speedup could be observed in other modules; for
> flink-yarn-tests we can maybe get a minute or 2 out of it.
>
> On 16/08/2019 10:43, Chesnay Schepler wrote:
> > There appears to be a general agreement that 1) should be looked into;
> > I've setup a branch with fork reuse being enabled for all tests; will
> > report back the results.
> >
> > On 15/08/2019 09:38, Chesnay Schepler wrote:
> >> Hello everyone,
> >>
> >> improving our build times is a hot topic at the moment so let's
> >> discuss the different ways how they could be reduced.
> >>
> >>
> >>        Current state:
> >>
> >> First up, let's look at some numbers:
> >>
> >> 1 full build currently consumes 5h of build time total ("total
> >> time"), and in the ideal case takes about 1h20m ("run time") to
> >> complete from start to finish. The run time may fluctuate of course
> >> depending on the current Travis load. This applies both to builds on
> >> the Apache and flink-ci Travis.
> >>
> >> At the time of writing, the current queue time for PR jobs (reminder:
> >> running on flink-ci) is about 30 minutes (which basically means that
> >> we are processing builds at the rate that they come in), however we
> >> are in an admittedly quiet period right now.
> >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> >> everyone was scrambling to get their changes merged in time for the
> >> feature freeze.
> >>
> >> (Note: Recently optimizations where added to ci-bot where pending
> >> builds are canceled if a new commit was pushed to the PR or the PR
> >> was closed, which should prove especially useful during the rush
> >> hours we see before feature-freezes.)
> >>
> >>
> >>        Past approaches
> >>
> >> Over the years we have done rather few things to improve this
> >> situation (hence our current predicament).
> >>
> >> Beyond the sporadic speedup of some tests, the only notable reduction
> >> in total build times was the introduction of cron jobs, which
> >> consolidated the per-commit matrix from 4 configurations (different
> >> scala/hadoop versions) to 1.
> >>
> >> The separation into multiple build profiles was only a work-around
> >> for the 50m limit on Travis. Running tests in parallel has the
> >> obvious potential of reducing run time, but we're currently hitting a
> >> hard limit since a few modules (flink-tests, flink-runtime,
> >> flink-table-planner-blink) are so loaded with tests that they nearly
> >> consume an entire profile by themselves (and thus no further
> >> splitting is possible).
> >>
> >> The rework that introduced stages, at the time of introduction, did
> >> also not provide a speed up, although this changed slightly once more
> >> profiles were added and some optimizations to the caching have been
> >> made.
> >>
> >> Very recently we modified the surefire-plugin configuration for
> >> flink-table-planner-blink to reuse JVM forks for IT cases, providing
> >> a significant speedup (18 minutes!). So far we have not seen any
> >> negative consequences.
> >>
> >>
> >>        Suggestions
> >>
> >> This is a list of /all /suggestions for reducing run/total times that
> >> I have seen recently (in other words, they aren't necessarily mine
> >> nor may I agree with all of them).
> >>
> >> 1. Enable JVM reuse for IT cases in more modules.
> >>      * We've seen significant speedups in the blink planner, and this
> >>        should be applicable for all modules. However, I presume there's
> >>        a reason why we disabled JVM reuse (information on this would be
> >>        appreciated)
> >> 2. Custom differential build scripts
> >>      * Setup custom scripts for determining which modules might be
> >>        affected by change, and manipulate the splits accordingly. This
> >>        approach is conceptually quite straight-forward, but has limits
> >>        since it has to be pessimistic; i.e. a change in flink-core
> >>        _must_ result in testing all modules.
> >> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
> >>      * With the introduction of the ci-bot we now have significantly
> >>        more options on how to handle PR builds. One option could be to
> >>        only run basic tests when the PR is created (which may be only
> >>        modified modules, or all unit tests, or another low-cost
> >>        scheme), and then have a committer trigger other builds (full
> >>        test run, e2e tests, etc...) on demand.
> >> 4. Move more tests into cron builds
> >>      * The budget version of 3); move certain tests that are either
> >>        expensive (like some runtime tests that take minutes) or in
> >>        rarely modified modules (like gelly) into cron jobs.
> >> 5. Gradle
> >>      * Gradle was brought up a few times for it's built-in support for
> >>        differential builds; basically providing 2) without the overhead
> >>        of maintaining additional scripts.
> >>      * To date no PoC was provided that shows it working in our CI
> >>        environment (i.e., handling splits & caching etc).
> >>      * This is the most disruptive change by a fair margin, as it would
> >>        affect the entire project, developers and potentially users (f
> >>        they build from source).
> >> 6. CI service
> >>      * Our current artifact caching setup on Travis is basically a
> >>        hack; we're basically abusing the Travis cache, which is meant
> >>        for long-term caching, to ship build artifacts across jobs. It's
> >>        brittle at times due to timing/visibility issues and on branches
> >>        the cleanup processes can interfere with running builds. It is
> >>        also not as effective as it could be.
> >>      * There are CI services that provide build artifact caching out of
> >>        the box, which could be useful for us.
> >>      * To date, no PoC for using another CI service has been provided.
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

bowen.li
+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]> wrote:

> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]>
> wrote:
>
> > Update:
> >
> > TL;DR: table-planner is a good candidate for enabling fork reuse right
> > away, while flink-tests has the potential for huge savings, but we have
> > to figure out some issues first.
> >
> >
> > Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >
> > 4/8 profiles failed.
> >
> > No speedup in libraries, python, blink_planner, 7 minutes saved in
> > libraries (table-planner).
> >
> > The kafka and connectors profiles both fail in kafka tests due to
> > producer leaks, and no speed up could be confirmed so far:
> >
> > java.lang.AssertionError: Detected producer leak. Thread name:
> > kafka-producer-network-thread | producer-239
> >         at org.junit.Assert.fail(Assert.java:88)
> >         at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >         at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >
> >
> > The tests profile failed due to various errors in migration tests:
> >
> > junit.framework.AssertionFailedError: Did not see the expected
> accumulator
> > results within time limit.
> >         at
> >
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >
> > *However*, a normal tests run takes 40 minutes, while this one above
> > failed after 19 minutes and is only missing the migration tests (which
> > currently need 6-7 minutes). So we could save somewhere between 15 to 20
> > minutes here.
> >
> >
> > Finally, the misc profiles fails in YARN:
> >
> > java.lang.AssertionError
> >         at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >
> > No significant speedup could be observed in other modules; for
> > flink-yarn-tests we can maybe get a minute or 2 out of it.
> >
> > On 16/08/2019 10:43, Chesnay Schepler wrote:
> > > There appears to be a general agreement that 1) should be looked into;
> > > I've setup a branch with fork reuse being enabled for all tests; will
> > > report back the results.
> > >
> > > On 15/08/2019 09:38, Chesnay Schepler wrote:
> > >> Hello everyone,
> > >>
> > >> improving our build times is a hot topic at the moment so let's
> > >> discuss the different ways how they could be reduced.
> > >>
> > >>
> > >>        Current state:
> > >>
> > >> First up, let's look at some numbers:
> > >>
> > >> 1 full build currently consumes 5h of build time total ("total
> > >> time"), and in the ideal case takes about 1h20m ("run time") to
> > >> complete from start to finish. The run time may fluctuate of course
> > >> depending on the current Travis load. This applies both to builds on
> > >> the Apache and flink-ci Travis.
> > >>
> > >> At the time of writing, the current queue time for PR jobs (reminder:
> > >> running on flink-ci) is about 30 minutes (which basically means that
> > >> we are processing builds at the rate that they come in), however we
> > >> are in an admittedly quiet period right now.
> > >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > >> everyone was scrambling to get their changes merged in time for the
> > >> feature freeze.
> > >>
> > >> (Note: Recently optimizations where added to ci-bot where pending
> > >> builds are canceled if a new commit was pushed to the PR or the PR
> > >> was closed, which should prove especially useful during the rush
> > >> hours we see before feature-freezes.)
> > >>
> > >>
> > >>        Past approaches
> > >>
> > >> Over the years we have done rather few things to improve this
> > >> situation (hence our current predicament).
> > >>
> > >> Beyond the sporadic speedup of some tests, the only notable reduction
> > >> in total build times was the introduction of cron jobs, which
> > >> consolidated the per-commit matrix from 4 configurations (different
> > >> scala/hadoop versions) to 1.
> > >>
> > >> The separation into multiple build profiles was only a work-around
> > >> for the 50m limit on Travis. Running tests in parallel has the
> > >> obvious potential of reducing run time, but we're currently hitting a
> > >> hard limit since a few modules (flink-tests, flink-runtime,
> > >> flink-table-planner-blink) are so loaded with tests that they nearly
> > >> consume an entire profile by themselves (and thus no further
> > >> splitting is possible).
> > >>
> > >> The rework that introduced stages, at the time of introduction, did
> > >> also not provide a speed up, although this changed slightly once more
> > >> profiles were added and some optimizations to the caching have been
> > >> made.
> > >>
> > >> Very recently we modified the surefire-plugin configuration for
> > >> flink-table-planner-blink to reuse JVM forks for IT cases, providing
> > >> a significant speedup (18 minutes!). So far we have not seen any
> > >> negative consequences.
> > >>
> > >>
> > >>        Suggestions
> > >>
> > >> This is a list of /all /suggestions for reducing run/total times that
> > >> I have seen recently (in other words, they aren't necessarily mine
> > >> nor may I agree with all of them).
> > >>
> > >> 1. Enable JVM reuse for IT cases in more modules.
> > >>      * We've seen significant speedups in the blink planner, and this
> > >>        should be applicable for all modules. However, I presume
> there's
> > >>        a reason why we disabled JVM reuse (information on this would
> be
> > >>        appreciated)
> > >> 2. Custom differential build scripts
> > >>      * Setup custom scripts for determining which modules might be
> > >>        affected by change, and manipulate the splits accordingly. This
> > >>        approach is conceptually quite straight-forward, but has limits
> > >>        since it has to be pessimistic; i.e. a change in flink-core
> > >>        _must_ result in testing all modules.
> > >> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
> > >>      * With the introduction of the ci-bot we now have significantly
> > >>        more options on how to handle PR builds. One option could be to
> > >>        only run basic tests when the PR is created (which may be only
> > >>        modified modules, or all unit tests, or another low-cost
> > >>        scheme), and then have a committer trigger other builds (full
> > >>        test run, e2e tests, etc...) on demand.
> > >> 4. Move more tests into cron builds
> > >>      * The budget version of 3); move certain tests that are either
> > >>        expensive (like some runtime tests that take minutes) or in
> > >>        rarely modified modules (like gelly) into cron jobs.
> > >> 5. Gradle
> > >>      * Gradle was brought up a few times for it's built-in support for
> > >>        differential builds; basically providing 2) without the
> overhead
> > >>        of maintaining additional scripts.
> > >>      * To date no PoC was provided that shows it working in our CI
> > >>        environment (i.e., handling splits & caching etc).
> > >>      * This is the most disruptive change by a fair margin, as it
> would
> > >>        affect the entire project, developers and potentially users (f
> > >>        they build from source).
> > >> 6. CI service
> > >>      * Our current artifact caching setup on Travis is basically a
> > >>        hack; we're basically abusing the Travis cache, which is meant
> > >>        for long-term caching, to ship build artifacts across jobs.
> It's
> > >>        brittle at times due to timing/visibility issues and on
> branches
> > >>        the cleanup processes can interfere with running builds. It is
> > >>        also not as effective as it could be.
> > >>      * There are CI services that provide build artifact caching out
> of
> > >>        the box, which could be useful for us.
> > >>      * To date, no PoC for using another CI service has been provided.
> > >>
> > >>
> > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Aljoscha Krettek-2
Speaking of flink-shaded, do we have any idea what the impact of shading is on the build time? We could get rid of shading completely in the Flink main repository by moving everything that we shade to flink-shaded.

Aljoscha

> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
>
> +1 to Till's points on #2 and #5, especially the potential non-disruptive,
> gradual migration approach if we decide to go that route.
>
> To add on, I want to point it out that we can actually start with
> flink-shaded project [1] which is a perfect candidate for PoC. It's of much
> smaller size, totally isolated from and not interfered with flink project
> [2], and it actually covers most of our practical feature requirements for
> a build tool - all making it an ideal experimental field.
>
> [1] https://github.com/apache/flink-shaded
> [2] https://github.com/apache/flink
>
>
> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]> wrote:
>
>> For the sake of keeping the discussion focused and not cluttering the
>> discussion thread I would suggest to split the detailed reporting for
>> reusing JVMs to a separate thread and cross linking it from here.
>>
>> Cheers,
>> Till
>>
>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]>
>> wrote:
>>
>>> Update:
>>>
>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>> away, while flink-tests has the potential for huge savings, but we have
>>> to figure out some issues first.
>>>
>>>
>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>>
>>> 4/8 profiles failed.
>>>
>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>> libraries (table-planner).
>>>
>>> The kafka and connectors profiles both fail in kafka tests due to
>>> producer leaks, and no speed up could be confirmed so far:
>>>
>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>> kafka-producer-network-thread | producer-239
>>>        at org.junit.Assert.fail(Assert.java:88)
>>>        at
>>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>        at
>>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>>
>>>
>>> The tests profile failed due to various errors in migration tests:
>>>
>>> junit.framework.AssertionFailedError: Did not see the expected
>> accumulator
>>> results within time limit.
>>>        at
>>>
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>>
>>> *However*, a normal tests run takes 40 minutes, while this one above
>>> failed after 19 minutes and is only missing the migration tests (which
>>> currently need 6-7 minutes). So we could save somewhere between 15 to 20
>>> minutes here.
>>>
>>>
>>> Finally, the misc profiles fails in YARN:
>>>
>>> java.lang.AssertionError
>>>        at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>>
>>> No significant speedup could be observed in other modules; for
>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>>
>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>>>> There appears to be a general agreement that 1) should be looked into;
>>>> I've setup a branch with fork reuse being enabled for all tests; will
>>>> report back the results.
>>>>
>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>>>>> Hello everyone,
>>>>>
>>>>> improving our build times is a hot topic at the moment so let's
>>>>> discuss the different ways how they could be reduced.
>>>>>
>>>>>
>>>>>       Current state:
>>>>>
>>>>> First up, let's look at some numbers:
>>>>>
>>>>> 1 full build currently consumes 5h of build time total ("total
>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>>>>> complete from start to finish. The run time may fluctuate of course
>>>>> depending on the current Travis load. This applies both to builds on
>>>>> the Apache and flink-ci Travis.
>>>>>
>>>>> At the time of writing, the current queue time for PR jobs (reminder:
>>>>> running on flink-ci) is about 30 minutes (which basically means that
>>>>> we are processing builds at the rate that they come in), however we
>>>>> are in an admittedly quiet period right now.
>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>>>>> everyone was scrambling to get their changes merged in time for the
>>>>> feature freeze.
>>>>>
>>>>> (Note: Recently optimizations where added to ci-bot where pending
>>>>> builds are canceled if a new commit was pushed to the PR or the PR
>>>>> was closed, which should prove especially useful during the rush
>>>>> hours we see before feature-freezes.)
>>>>>
>>>>>
>>>>>       Past approaches
>>>>>
>>>>> Over the years we have done rather few things to improve this
>>>>> situation (hence our current predicament).
>>>>>
>>>>> Beyond the sporadic speedup of some tests, the only notable reduction
>>>>> in total build times was the introduction of cron jobs, which
>>>>> consolidated the per-commit matrix from 4 configurations (different
>>>>> scala/hadoop versions) to 1.
>>>>>
>>>>> The separation into multiple build profiles was only a work-around
>>>>> for the 50m limit on Travis. Running tests in parallel has the
>>>>> obvious potential of reducing run time, but we're currently hitting a
>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>>>>> flink-table-planner-blink) are so loaded with tests that they nearly
>>>>> consume an entire profile by themselves (and thus no further
>>>>> splitting is possible).
>>>>>
>>>>> The rework that introduced stages, at the time of introduction, did
>>>>> also not provide a speed up, although this changed slightly once more
>>>>> profiles were added and some optimizations to the caching have been
>>>>> made.
>>>>>
>>>>> Very recently we modified the surefire-plugin configuration for
>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
>>>>> a significant speedup (18 minutes!). So far we have not seen any
>>>>> negative consequences.
>>>>>
>>>>>
>>>>>       Suggestions
>>>>>
>>>>> This is a list of /all /suggestions for reducing run/total times that
>>>>> I have seen recently (in other words, they aren't necessarily mine
>>>>> nor may I agree with all of them).
>>>>>
>>>>> 1. Enable JVM reuse for IT cases in more modules.
>>>>>     * We've seen significant speedups in the blink planner, and this
>>>>>       should be applicable for all modules. However, I presume
>> there's
>>>>>       a reason why we disabled JVM reuse (information on this would
>> be
>>>>>       appreciated)
>>>>> 2. Custom differential build scripts
>>>>>     * Setup custom scripts for determining which modules might be
>>>>>       affected by change, and manipulate the splits accordingly. This
>>>>>       approach is conceptually quite straight-forward, but has limits
>>>>>       since it has to be pessimistic; i.e. a change in flink-core
>>>>>       _must_ result in testing all modules.
>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
>>>>>     * With the introduction of the ci-bot we now have significantly
>>>>>       more options on how to handle PR builds. One option could be to
>>>>>       only run basic tests when the PR is created (which may be only
>>>>>       modified modules, or all unit tests, or another low-cost
>>>>>       scheme), and then have a committer trigger other builds (full
>>>>>       test run, e2e tests, etc...) on demand.
>>>>> 4. Move more tests into cron builds
>>>>>     * The budget version of 3); move certain tests that are either
>>>>>       expensive (like some runtime tests that take minutes) or in
>>>>>       rarely modified modules (like gelly) into cron jobs.
>>>>> 5. Gradle
>>>>>     * Gradle was brought up a few times for it's built-in support for
>>>>>       differential builds; basically providing 2) without the
>> overhead
>>>>>       of maintaining additional scripts.
>>>>>     * To date no PoC was provided that shows it working in our CI
>>>>>       environment (i.e., handling splits & caching etc).
>>>>>     * This is the most disruptive change by a fair margin, as it
>> would
>>>>>       affect the entire project, developers and potentially users (f
>>>>>       they build from source).
>>>>> 6. CI service
>>>>>     * Our current artifact caching setup on Travis is basically a
>>>>>       hack; we're basically abusing the Travis cache, which is meant
>>>>>       for long-term caching, to ship build artifacts across jobs.
>> It's
>>>>>       brittle at times due to timing/visibility issues and on
>> branches
>>>>>       the cleanup processes can interfere with running builds. It is
>>>>>       also not as effective as it could be.
>>>>>     * There are CI services that provide build artifact caching out
>> of
>>>>>       the box, which could be useful for us.
>>>>>     * To date, no PoC for using another CI service has been provided.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Chesnay Schepler-3
@Aljoscha Shading takes a few minutes for a full build; you can see this
quite easily by looking at the compile step in the misc profile
<https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
longer than a fraction of a section are usually caused by shading lots
of classes. Note that I cannot tell you how much of this is spent on
relocations, and how much on writing the jar.

Personally, I'd very much like us to move all shading to flink-shaded;
this would finally allows us to use newer maven versions without needing
cumbersome workarounds for flink-dist. However, this isn't a trivial
affair in some cases; IIRC calcite could be difficult to handle.

On another note, this would also simplify switching the main repo to
another build system, since you would no longer had to deal with
relocations, just packaging + merging NOTICE files.

@BowenLi I disagree, flink-shaded does not include any tests,  API
compatibility checks, checkstyle, layered shading (e.g., flink-runtime
and flink-dist, where both relocate dependencies and one is bundled by
the other), and, most importantly, CI (and really, without CI being
covered in a PoC there's nothing to discuss).

On 16/08/2019 15:13, Aljoscha Krettek wrote:

> Speaking of flink-shaded, do we have any idea what the impact of shading is on the build time? We could get rid of shading completely in the Flink main repository by moving everything that we shade to flink-shaded.
>
> Aljoscha
>
>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
>>
>> +1 to Till's points on #2 and #5, especially the potential non-disruptive,
>> gradual migration approach if we decide to go that route.
>>
>> To add on, I want to point it out that we can actually start with
>> flink-shaded project [1] which is a perfect candidate for PoC. It's of much
>> smaller size, totally isolated from and not interfered with flink project
>> [2], and it actually covers most of our practical feature requirements for
>> a build tool - all making it an ideal experimental field.
>>
>> [1] https://github.com/apache/flink-shaded
>> [2] https://github.com/apache/flink
>>
>>
>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]> wrote:
>>
>>> For the sake of keeping the discussion focused and not cluttering the
>>> discussion thread I would suggest to split the detailed reporting for
>>> reusing JVMs to a separate thread and cross linking it from here.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]>
>>> wrote:
>>>
>>>> Update:
>>>>
>>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>>> away, while flink-tests has the potential for huge savings, but we have
>>>> to figure out some issues first.
>>>>
>>>>
>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>>>
>>>> 4/8 profiles failed.
>>>>
>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>>> libraries (table-planner).
>>>>
>>>> The kafka and connectors profiles both fail in kafka tests due to
>>>> producer leaks, and no speed up could be confirmed so far:
>>>>
>>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>>> kafka-producer-network-thread | producer-239
>>>>         at org.junit.Assert.fail(Assert.java:88)
>>>>         at
>>>>
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>>         at
>>>>
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>>>
>>>> The tests profile failed due to various errors in migration tests:
>>>>
>>>> junit.framework.AssertionFailedError: Did not see the expected
>>> accumulator
>>>> results within time limit.
>>>>         at
>>>>
>>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>>> *However*, a normal tests run takes 40 minutes, while this one above
>>>> failed after 19 minutes and is only missing the migration tests (which
>>>> currently need 6-7 minutes). So we could save somewhere between 15 to 20
>>>> minutes here.
>>>>
>>>>
>>>> Finally, the misc profiles fails in YARN:
>>>>
>>>> java.lang.AssertionError
>>>>         at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>>>
>>>> No significant speedup could be observed in other modules; for
>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>>>
>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>>>>> There appears to be a general agreement that 1) should be looked into;
>>>>> I've setup a branch with fork reuse being enabled for all tests; will
>>>>> report back the results.
>>>>>
>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> improving our build times is a hot topic at the moment so let's
>>>>>> discuss the different ways how they could be reduced.
>>>>>>
>>>>>>
>>>>>>        Current state:
>>>>>>
>>>>>> First up, let's look at some numbers:
>>>>>>
>>>>>> 1 full build currently consumes 5h of build time total ("total
>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>>>>>> complete from start to finish. The run time may fluctuate of course
>>>>>> depending on the current Travis load. This applies both to builds on
>>>>>> the Apache and flink-ci Travis.
>>>>>>
>>>>>> At the time of writing, the current queue time for PR jobs (reminder:
>>>>>> running on flink-ci) is about 30 minutes (which basically means that
>>>>>> we are processing builds at the rate that they come in), however we
>>>>>> are in an admittedly quiet period right now.
>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>>>>>> everyone was scrambling to get their changes merged in time for the
>>>>>> feature freeze.
>>>>>>
>>>>>> (Note: Recently optimizations where added to ci-bot where pending
>>>>>> builds are canceled if a new commit was pushed to the PR or the PR
>>>>>> was closed, which should prove especially useful during the rush
>>>>>> hours we see before feature-freezes.)
>>>>>>
>>>>>>
>>>>>>        Past approaches
>>>>>>
>>>>>> Over the years we have done rather few things to improve this
>>>>>> situation (hence our current predicament).
>>>>>>
>>>>>> Beyond the sporadic speedup of some tests, the only notable reduction
>>>>>> in total build times was the introduction of cron jobs, which
>>>>>> consolidated the per-commit matrix from 4 configurations (different
>>>>>> scala/hadoop versions) to 1.
>>>>>>
>>>>>> The separation into multiple build profiles was only a work-around
>>>>>> for the 50m limit on Travis. Running tests in parallel has the
>>>>>> obvious potential of reducing run time, but we're currently hitting a
>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>>>>>> flink-table-planner-blink) are so loaded with tests that they nearly
>>>>>> consume an entire profile by themselves (and thus no further
>>>>>> splitting is possible).
>>>>>>
>>>>>> The rework that introduced stages, at the time of introduction, did
>>>>>> also not provide a speed up, although this changed slightly once more
>>>>>> profiles were added and some optimizations to the caching have been
>>>>>> made.
>>>>>>
>>>>>> Very recently we modified the surefire-plugin configuration for
>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
>>>>>> a significant speedup (18 minutes!). So far we have not seen any
>>>>>> negative consequences.
>>>>>>
>>>>>>
>>>>>>        Suggestions
>>>>>>
>>>>>> This is a list of /all /suggestions for reducing run/total times that
>>>>>> I have seen recently (in other words, they aren't necessarily mine
>>>>>> nor may I agree with all of them).
>>>>>>
>>>>>> 1. Enable JVM reuse for IT cases in more modules.
>>>>>>      * We've seen significant speedups in the blink planner, and this
>>>>>>        should be applicable for all modules. However, I presume
>>> there's
>>>>>>        a reason why we disabled JVM reuse (information on this would
>>> be
>>>>>>        appreciated)
>>>>>> 2. Custom differential build scripts
>>>>>>      * Setup custom scripts for determining which modules might be
>>>>>>        affected by change, and manipulate the splits accordingly. This
>>>>>>        approach is conceptually quite straight-forward, but has limits
>>>>>>        since it has to be pessimistic; i.e. a change in flink-core
>>>>>>        _must_ result in testing all modules.
>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on demand.
>>>>>>      * With the introduction of the ci-bot we now have significantly
>>>>>>        more options on how to handle PR builds. One option could be to
>>>>>>        only run basic tests when the PR is created (which may be only
>>>>>>        modified modules, or all unit tests, or another low-cost
>>>>>>        scheme), and then have a committer trigger other builds (full
>>>>>>        test run, e2e tests, etc...) on demand.
>>>>>> 4. Move more tests into cron builds
>>>>>>      * The budget version of 3); move certain tests that are either
>>>>>>        expensive (like some runtime tests that take minutes) or in
>>>>>>        rarely modified modules (like gelly) into cron jobs.
>>>>>> 5. Gradle
>>>>>>      * Gradle was brought up a few times for it's built-in support for
>>>>>>        differential builds; basically providing 2) without the
>>> overhead
>>>>>>        of maintaining additional scripts.
>>>>>>      * To date no PoC was provided that shows it working in our CI
>>>>>>        environment (i.e., handling splits & caching etc).
>>>>>>      * This is the most disruptive change by a fair margin, as it
>>> would
>>>>>>        affect the entire project, developers and potentially users (f
>>>>>>        they build from source).
>>>>>> 6. CI service
>>>>>>      * Our current artifact caching setup on Travis is basically a
>>>>>>        hack; we're basically abusing the Travis cache, which is meant
>>>>>>        for long-term caching, to ship build artifacts across jobs.
>>> It's
>>>>>>        brittle at times due to timing/visibility issues and on
>>> branches
>>>>>>        the cleanup processes can interfere with running builds. It is
>>>>>>        also not as effective as it could be.
>>>>>>      * There are CI services that provide build artifact caching out
>>> of
>>>>>>        the box, which could be useful for us.
>>>>>>      * To date, no PoC for using another CI service has been provided.
>>>>>>
>>>>>>
>>>>>
>>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Robert Metzger
Hi all,

I wanted to understand the impact of the hardware we are using for running
our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
They are using Google Cloud Compute Engine *n1-standard-2* instances.
Running a full "mvn clean verify" takes *03:32 h* on such a machine type.

Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
h*.

What is interesting are the per-module build time differences.
Modules which are parallelizing tests well greatly benefit from the
additional cores:
"flink-tests" 36:51 min vs 4:33 min
"flink-runtime" 23:41 min vs 3:47 min
"flink-table-planner" 15:54 min vs 3:13 min

On the other hand, we have modules which are not parallel at all:
"flink-connector-kafka": 16:32 min vs 15:19 min
"flink-connector-kafka-0.11": 9:52 min vs 7:46 min
Also, the checkstyle plugin is not scaling at all.

Chesnay reported some significant speedups by reusing forks.
I don't know how much effort it would be to make the Kafka tests
parallelizable. In total, they currently use 30 minutes on the big machine
(while 31 CPUs are idling :) )

Let me know what you think about these results. If the community is
generally interested in further investigating into that direction, I could
look into software to orchestrate this, as well as sponsors for such an
infrastructure.

[1] https://docs.travis-ci.com/user/reference/overview/


On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]> wrote:

> @Aljoscha Shading takes a few minutes for a full build; you can see this
> quite easily by looking at the compile step in the misc profile
> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
> longer than a fraction of a section are usually caused by shading lots
> of classes. Note that I cannot tell you how much of this is spent on
> relocations, and how much on writing the jar.
>
> Personally, I'd very much like us to move all shading to flink-shaded;
> this would finally allows us to use newer maven versions without needing
> cumbersome workarounds for flink-dist. However, this isn't a trivial
> affair in some cases; IIRC calcite could be difficult to handle.
>
> On another note, this would also simplify switching the main repo to
> another build system, since you would no longer had to deal with
> relocations, just packaging + merging NOTICE files.
>
> @BowenLi I disagree, flink-shaded does not include any tests,  API
> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
> and flink-dist, where both relocate dependencies and one is bundled by
> the other), and, most importantly, CI (and really, without CI being
> covered in a PoC there's nothing to discuss).
>
> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> > Speaking of flink-shaded, do we have any idea what the impact of shading
> is on the build time? We could get rid of shading completely in the Flink
> main repository by moving everything that we shade to flink-shaded.
> >
> > Aljoscha
> >
> >> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
> >>
> >> +1 to Till's points on #2 and #5, especially the potential
> non-disruptive,
> >> gradual migration approach if we decide to go that route.
> >>
> >> To add on, I want to point it out that we can actually start with
> >> flink-shaded project [1] which is a perfect candidate for PoC. It's of
> much
> >> smaller size, totally isolated from and not interfered with flink
> project
> >> [2], and it actually covers most of our practical feature requirements
> for
> >> a build tool - all making it an ideal experimental field.
> >>
> >> [1] https://github.com/apache/flink-shaded
> >> [2] https://github.com/apache/flink
> >>
> >>
> >> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]>
> wrote:
> >>
> >>> For the sake of keeping the discussion focused and not cluttering the
> >>> discussion thread I would suggest to split the detailed reporting for
> >>> reusing JVMs to a separate thread and cross linking it from here.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]>
> >>> wrote:
> >>>
> >>>> Update:
> >>>>
> >>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
> >>>> away, while flink-tests has the potential for huge savings, but we
> have
> >>>> to figure out some issues first.
> >>>>
> >>>>
> >>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >>>>
> >>>> 4/8 profiles failed.
> >>>>
> >>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
> >>>> libraries (table-planner).
> >>>>
> >>>> The kafka and connectors profiles both fail in kafka tests due to
> >>>> producer leaks, and no speed up could be confirmed so far:
> >>>>
> >>>> java.lang.AssertionError: Detected producer leak. Thread name:
> >>>> kafka-producer-network-thread | producer-239
> >>>>         at org.junit.Assert.fail(Assert.java:88)
> >>>>         at
> >>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >>>>         at
> >>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >>>>
> >>>> The tests profile failed due to various errors in migration tests:
> >>>>
> >>>> junit.framework.AssertionFailedError: Did not see the expected
> >>> accumulator
> >>>> results within time limit.
> >>>>         at
> >>>>
> >>>
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >>>> *However*, a normal tests run takes 40 minutes, while this one above
> >>>> failed after 19 minutes and is only missing the migration tests (which
> >>>> currently need 6-7 minutes). So we could save somewhere between 15 to
> 20
> >>>> minutes here.
> >>>>
> >>>>
> >>>> Finally, the misc profiles fails in YARN:
> >>>>
> >>>> java.lang.AssertionError
> >>>>         at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >>>>
> >>>> No significant speedup could be observed in other modules; for
> >>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> >>>>
> >>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> >>>>> There appears to be a general agreement that 1) should be looked
> into;
> >>>>> I've setup a branch with fork reuse being enabled for all tests; will
> >>>>> report back the results.
> >>>>>
> >>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> >>>>>> Hello everyone,
> >>>>>>
> >>>>>> improving our build times is a hot topic at the moment so let's
> >>>>>> discuss the different ways how they could be reduced.
> >>>>>>
> >>>>>>
> >>>>>>        Current state:
> >>>>>>
> >>>>>> First up, let's look at some numbers:
> >>>>>>
> >>>>>> 1 full build currently consumes 5h of build time total ("total
> >>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
> >>>>>> complete from start to finish. The run time may fluctuate of course
> >>>>>> depending on the current Travis load. This applies both to builds on
> >>>>>> the Apache and flink-ci Travis.
> >>>>>>
> >>>>>> At the time of writing, the current queue time for PR jobs
> (reminder:
> >>>>>> running on flink-ci) is about 30 minutes (which basically means that
> >>>>>> we are processing builds at the rate that they come in), however we
> >>>>>> are in an admittedly quiet period right now.
> >>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> >>>>>> everyone was scrambling to get their changes merged in time for the
> >>>>>> feature freeze.
> >>>>>>
> >>>>>> (Note: Recently optimizations where added to ci-bot where pending
> >>>>>> builds are canceled if a new commit was pushed to the PR or the PR
> >>>>>> was closed, which should prove especially useful during the rush
> >>>>>> hours we see before feature-freezes.)
> >>>>>>
> >>>>>>
> >>>>>>        Past approaches
> >>>>>>
> >>>>>> Over the years we have done rather few things to improve this
> >>>>>> situation (hence our current predicament).
> >>>>>>
> >>>>>> Beyond the sporadic speedup of some tests, the only notable
> reduction
> >>>>>> in total build times was the introduction of cron jobs, which
> >>>>>> consolidated the per-commit matrix from 4 configurations (different
> >>>>>> scala/hadoop versions) to 1.
> >>>>>>
> >>>>>> The separation into multiple build profiles was only a work-around
> >>>>>> for the 50m limit on Travis. Running tests in parallel has the
> >>>>>> obvious potential of reducing run time, but we're currently hitting
> a
> >>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> >>>>>> flink-table-planner-blink) are so loaded with tests that they nearly
> >>>>>> consume an entire profile by themselves (and thus no further
> >>>>>> splitting is possible).
> >>>>>>
> >>>>>> The rework that introduced stages, at the time of introduction, did
> >>>>>> also not provide a speed up, although this changed slightly once
> more
> >>>>>> profiles were added and some optimizations to the caching have been
> >>>>>> made.
> >>>>>>
> >>>>>> Very recently we modified the surefire-plugin configuration for
> >>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
> >>>>>> a significant speedup (18 minutes!). So far we have not seen any
> >>>>>> negative consequences.
> >>>>>>
> >>>>>>
> >>>>>>        Suggestions
> >>>>>>
> >>>>>> This is a list of /all /suggestions for reducing run/total times
> that
> >>>>>> I have seen recently (in other words, they aren't necessarily mine
> >>>>>> nor may I agree with all of them).
> >>>>>>
> >>>>>> 1. Enable JVM reuse for IT cases in more modules.
> >>>>>>      * We've seen significant speedups in the blink planner, and
> this
> >>>>>>        should be applicable for all modules. However, I presume
> >>> there's
> >>>>>>        a reason why we disabled JVM reuse (information on this would
> >>> be
> >>>>>>        appreciated)
> >>>>>> 2. Custom differential build scripts
> >>>>>>      * Setup custom scripts for determining which modules might be
> >>>>>>        affected by change, and manipulate the splits accordingly.
> This
> >>>>>>        approach is conceptually quite straight-forward, but has
> limits
> >>>>>>        since it has to be pessimistic; i.e. a change in flink-core
> >>>>>>        _must_ result in testing all modules.
> >>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> demand.
> >>>>>>      * With the introduction of the ci-bot we now have significantly
> >>>>>>        more options on how to handle PR builds. One option could be
> to
> >>>>>>        only run basic tests when the PR is created (which may be
> only
> >>>>>>        modified modules, or all unit tests, or another low-cost
> >>>>>>        scheme), and then have a committer trigger other builds (full
> >>>>>>        test run, e2e tests, etc...) on demand.
> >>>>>> 4. Move more tests into cron builds
> >>>>>>      * The budget version of 3); move certain tests that are either
> >>>>>>        expensive (like some runtime tests that take minutes) or in
> >>>>>>        rarely modified modules (like gelly) into cron jobs.
> >>>>>> 5. Gradle
> >>>>>>      * Gradle was brought up a few times for it's built-in support
> for
> >>>>>>        differential builds; basically providing 2) without the
> >>> overhead
> >>>>>>        of maintaining additional scripts.
> >>>>>>      * To date no PoC was provided that shows it working in our CI
> >>>>>>        environment (i.e., handling splits & caching etc).
> >>>>>>      * This is the most disruptive change by a fair margin, as it
> >>> would
> >>>>>>        affect the entire project, developers and potentially users
> (f
> >>>>>>        they build from source).
> >>>>>> 6. CI service
> >>>>>>      * Our current artifact caching setup on Travis is basically a
> >>>>>>        hack; we're basically abusing the Travis cache, which is
> meant
> >>>>>>        for long-term caching, to ship build artifacts across jobs.
> >>> It's
> >>>>>>        brittle at times due to timing/visibility issues and on
> >>> branches
> >>>>>>        the cleanup processes can interfere with running builds. It
> is
> >>>>>>        also not as effective as it could be.
> >>>>>>      * There are CI services that provide build artifact caching out
> >>> of
> >>>>>>        the box, which could be useful for us.
> >>>>>>      * To date, no PoC for using another CI service has been
> provided.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Aljoscha Krettek-2
I did a quick test: a normal "mvn clean install -DskipTests -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine takes about 14 minutes. After removing all mentions of maven-shade-plugin the build time goes down to roughly 11.5 minutes. (Obviously the resulting Flink won’t work, because some expected stuff is not packaged and most of the end-to-end tests use the shade plugin to package the jars for testing.

Aljoscha

> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> wrote:
>
> Hi all,
>
> I wanted to understand the impact of the hardware we are using for running
> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
>
> Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
> h*.
>
> What is interesting are the per-module build time differences.
> Modules which are parallelizing tests well greatly benefit from the
> additional cores:
> "flink-tests" 36:51 min vs 4:33 min
> "flink-runtime" 23:41 min vs 3:47 min
> "flink-table-planner" 15:54 min vs 3:13 min
>
> On the other hand, we have modules which are not parallel at all:
> "flink-connector-kafka": 16:32 min vs 15:19 min
> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> Also, the checkstyle plugin is not scaling at all.
>
> Chesnay reported some significant speedups by reusing forks.
> I don't know how much effort it would be to make the Kafka tests
> parallelizable. In total, they currently use 30 minutes on the big machine
> (while 31 CPUs are idling :) )
>
> Let me know what you think about these results. If the community is
> generally interested in further investigating into that direction, I could
> look into software to orchestrate this, as well as sponsors for such an
> infrastructure.
>
> [1] https://docs.travis-ci.com/user/reference/overview/
>
>
> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]> wrote:
>
>> @Aljoscha Shading takes a few minutes for a full build; you can see this
>> quite easily by looking at the compile step in the misc profile
>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
>> longer than a fraction of a section are usually caused by shading lots
>> of classes. Note that I cannot tell you how much of this is spent on
>> relocations, and how much on writing the jar.
>>
>> Personally, I'd very much like us to move all shading to flink-shaded;
>> this would finally allows us to use newer maven versions without needing
>> cumbersome workarounds for flink-dist. However, this isn't a trivial
>> affair in some cases; IIRC calcite could be difficult to handle.
>>
>> On another note, this would also simplify switching the main repo to
>> another build system, since you would no longer had to deal with
>> relocations, just packaging + merging NOTICE files.
>>
>> @BowenLi I disagree, flink-shaded does not include any tests,  API
>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>> and flink-dist, where both relocate dependencies and one is bundled by
>> the other), and, most importantly, CI (and really, without CI being
>> covered in a PoC there's nothing to discuss).
>>
>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>>> Speaking of flink-shaded, do we have any idea what the impact of shading
>> is on the build time? We could get rid of shading completely in the Flink
>> main repository by moving everything that we shade to flink-shaded.
>>>
>>> Aljoscha
>>>
>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
>>>>
>>>> +1 to Till's points on #2 and #5, especially the potential
>> non-disruptive,
>>>> gradual migration approach if we decide to go that route.
>>>>
>>>> To add on, I want to point it out that we can actually start with
>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's of
>> much
>>>> smaller size, totally isolated from and not interfered with flink
>> project
>>>> [2], and it actually covers most of our practical feature requirements
>> for
>>>> a build tool - all making it an ideal experimental field.
>>>>
>>>> [1] https://github.com/apache/flink-shaded
>>>> [2] https://github.com/apache/flink
>>>>
>>>>
>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]>
>> wrote:
>>>>
>>>>> For the sake of keeping the discussion focused and not cluttering the
>>>>> discussion thread I would suggest to split the detailed reporting for
>>>>> reusing JVMs to a separate thread and cross linking it from here.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Update:
>>>>>>
>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>>>>>> away, while flink-tests has the potential for huge savings, but we
>> have
>>>>>> to figure out some issues first.
>>>>>>
>>>>>>
>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>>>>>
>>>>>> 4/8 profiles failed.
>>>>>>
>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>>>>> libraries (table-planner).
>>>>>>
>>>>>> The kafka and connectors profiles both fail in kafka tests due to
>>>>>> producer leaks, and no speed up could be confirmed so far:
>>>>>>
>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>>>>> kafka-producer-network-thread | producer-239
>>>>>>        at org.junit.Assert.fail(Assert.java:88)
>>>>>>        at
>>>>>>
>>>>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>>>>        at
>>>>>>
>>>>>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>>>>>
>>>>>> The tests profile failed due to various errors in migration tests:
>>>>>>
>>>>>> junit.framework.AssertionFailedError: Did not see the expected
>>>>> accumulator
>>>>>> results within time limit.
>>>>>>        at
>>>>>>
>>>>>
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>>>>> *However*, a normal tests run takes 40 minutes, while this one above
>>>>>> failed after 19 minutes and is only missing the migration tests (which
>>>>>> currently need 6-7 minutes). So we could save somewhere between 15 to
>> 20
>>>>>> minutes here.
>>>>>>
>>>>>>
>>>>>> Finally, the misc profiles fails in YARN:
>>>>>>
>>>>>> java.lang.AssertionError
>>>>>>        at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>>>>>
>>>>>> No significant speedup could be observed in other modules; for
>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>>>>>
>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>>>>>>> There appears to be a general agreement that 1) should be looked
>> into;
>>>>>>> I've setup a branch with fork reuse being enabled for all tests; will
>>>>>>> report back the results.
>>>>>>>
>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>>>>>>>> Hello everyone,
>>>>>>>>
>>>>>>>> improving our build times is a hot topic at the moment so let's
>>>>>>>> discuss the different ways how they could be reduced.
>>>>>>>>
>>>>>>>>
>>>>>>>>       Current state:
>>>>>>>>
>>>>>>>> First up, let's look at some numbers:
>>>>>>>>
>>>>>>>> 1 full build currently consumes 5h of build time total ("total
>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>>>>>>>> complete from start to finish. The run time may fluctuate of course
>>>>>>>> depending on the current Travis load. This applies both to builds on
>>>>>>>> the Apache and flink-ci Travis.
>>>>>>>>
>>>>>>>> At the time of writing, the current queue time for PR jobs
>> (reminder:
>>>>>>>> running on flink-ci) is about 30 minutes (which basically means that
>>>>>>>> we are processing builds at the rate that they come in), however we
>>>>>>>> are in an admittedly quiet period right now.
>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>>>>>>>> everyone was scrambling to get their changes merged in time for the
>>>>>>>> feature freeze.
>>>>>>>>
>>>>>>>> (Note: Recently optimizations where added to ci-bot where pending
>>>>>>>> builds are canceled if a new commit was pushed to the PR or the PR
>>>>>>>> was closed, which should prove especially useful during the rush
>>>>>>>> hours we see before feature-freezes.)
>>>>>>>>
>>>>>>>>
>>>>>>>>       Past approaches
>>>>>>>>
>>>>>>>> Over the years we have done rather few things to improve this
>>>>>>>> situation (hence our current predicament).
>>>>>>>>
>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
>> reduction
>>>>>>>> in total build times was the introduction of cron jobs, which
>>>>>>>> consolidated the per-commit matrix from 4 configurations (different
>>>>>>>> scala/hadoop versions) to 1.
>>>>>>>>
>>>>>>>> The separation into multiple build profiles was only a work-around
>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
>>>>>>>> obvious potential of reducing run time, but we're currently hitting
>> a
>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>>>>>>>> flink-table-planner-blink) are so loaded with tests that they nearly
>>>>>>>> consume an entire profile by themselves (and thus no further
>>>>>>>> splitting is possible).
>>>>>>>>
>>>>>>>> The rework that introduced stages, at the time of introduction, did
>>>>>>>> also not provide a speed up, although this changed slightly once
>> more
>>>>>>>> profiles were added and some optimizations to the caching have been
>>>>>>>> made.
>>>>>>>>
>>>>>>>> Very recently we modified the surefire-plugin configuration for
>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases, providing
>>>>>>>> a significant speedup (18 minutes!). So far we have not seen any
>>>>>>>> negative consequences.
>>>>>>>>
>>>>>>>>
>>>>>>>>       Suggestions
>>>>>>>>
>>>>>>>> This is a list of /all /suggestions for reducing run/total times
>> that
>>>>>>>> I have seen recently (in other words, they aren't necessarily mine
>>>>>>>> nor may I agree with all of them).
>>>>>>>>
>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
>>>>>>>>     * We've seen significant speedups in the blink planner, and
>> this
>>>>>>>>       should be applicable for all modules. However, I presume
>>>>> there's
>>>>>>>>       a reason why we disabled JVM reuse (information on this would
>>>>> be
>>>>>>>>       appreciated)
>>>>>>>> 2. Custom differential build scripts
>>>>>>>>     * Setup custom scripts for determining which modules might be
>>>>>>>>       affected by change, and manipulate the splits accordingly.
>> This
>>>>>>>>       approach is conceptually quite straight-forward, but has
>> limits
>>>>>>>>       since it has to be pessimistic; i.e. a change in flink-core
>>>>>>>>       _must_ result in testing all modules.
>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
>> demand.
>>>>>>>>     * With the introduction of the ci-bot we now have significantly
>>>>>>>>       more options on how to handle PR builds. One option could be
>> to
>>>>>>>>       only run basic tests when the PR is created (which may be
>> only
>>>>>>>>       modified modules, or all unit tests, or another low-cost
>>>>>>>>       scheme), and then have a committer trigger other builds (full
>>>>>>>>       test run, e2e tests, etc...) on demand.
>>>>>>>> 4. Move more tests into cron builds
>>>>>>>>     * The budget version of 3); move certain tests that are either
>>>>>>>>       expensive (like some runtime tests that take minutes) or in
>>>>>>>>       rarely modified modules (like gelly) into cron jobs.
>>>>>>>> 5. Gradle
>>>>>>>>     * Gradle was brought up a few times for it's built-in support
>> for
>>>>>>>>       differential builds; basically providing 2) without the
>>>>> overhead
>>>>>>>>       of maintaining additional scripts.
>>>>>>>>     * To date no PoC was provided that shows it working in our CI
>>>>>>>>       environment (i.e., handling splits & caching etc).
>>>>>>>>     * This is the most disruptive change by a fair margin, as it
>>>>> would
>>>>>>>>       affect the entire project, developers and potentially users
>> (f
>>>>>>>>       they build from source).
>>>>>>>> 6. CI service
>>>>>>>>     * Our current artifact caching setup on Travis is basically a
>>>>>>>>       hack; we're basically abusing the Travis cache, which is
>> meant
>>>>>>>>       for long-term caching, to ship build artifacts across jobs.
>>>>> It's
>>>>>>>>       brittle at times due to timing/visibility issues and on
>>>>> branches
>>>>>>>>       the cleanup processes can interfere with running builds. It
>> is
>>>>>>>>       also not as effective as it could be.
>>>>>>>>     * There are CI services that provide build artifact caching out
>>>>> of
>>>>>>>>       the box, which could be useful for us.
>>>>>>>>     * To date, no PoC for using another CI service has been
>> provided.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Robert Metzger
Hi all,

I have summarized all arguments mentioned so far + some additional research
into a Wiki page here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we can
find more pro's and con's for the different options.

My opinion after looking at the options:

   - Flink relies on an outdated build tool (Maven), while a good
   alternative is well-established (gradle), and will likely provide a much
   better CI and local build experience through incremental build and cached
   intermediates.
   Scripting around Maven, or splitting modules / test execution /
   repositories won't solve this problem. We should rather spend the effort in
   migrating to a modern build tool which will provide us benefits in the long
   run.
   - Flink relies on a fairly slow build service (Travis CI), while simply
   putting more money onto the problem could cut the build time at least in
   half.
   We should consider using a build service that provides bigger machines
   to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we find
sponsors for bigger build machines) that we need to test first through PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]>
wrote:

> I did a quick test: a normal "mvn clean install -DskipTests
> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
> takes about 14 minutes. After removing all mentions of maven-shade-plugin
> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
> Flink won’t work, because some expected stuff is not packaged and most of
> the end-to-end tests use the shade plugin to package the jars for testing.
>
> Aljoscha
>
> > On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> wrote:
> >
> > Hi all,
> >
> > I wanted to understand the impact of the hardware we are using for
> running
> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
> > Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> >
> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
> *1:21
> > h*.
> >
> > What is interesting are the per-module build time differences.
> > Modules which are parallelizing tests well greatly benefit from the
> > additional cores:
> > "flink-tests" 36:51 min vs 4:33 min
> > "flink-runtime" 23:41 min vs 3:47 min
> > "flink-table-planner" 15:54 min vs 3:13 min
> >
> > On the other hand, we have modules which are not parallel at all:
> > "flink-connector-kafka": 16:32 min vs 15:19 min
> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> > Also, the checkstyle plugin is not scaling at all.
> >
> > Chesnay reported some significant speedups by reusing forks.
> > I don't know how much effort it would be to make the Kafka tests
> > parallelizable. In total, they currently use 30 minutes on the big
> machine
> > (while 31 CPUs are idling :) )
> >
> > Let me know what you think about these results. If the community is
> > generally interested in further investigating into that direction, I
> could
> > look into software to orchestrate this, as well as sponsors for such an
> > infrastructure.
> >
> > [1] https://docs.travis-ci.com/user/reference/overview/
> >
> >
> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]>
> wrote:
> >
> >> @Aljoscha Shading takes a few minutes for a full build; you can see this
> >> quite easily by looking at the compile step in the misc profile
> >> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
> >> longer than a fraction of a section are usually caused by shading lots
> >> of classes. Note that I cannot tell you how much of this is spent on
> >> relocations, and how much on writing the jar.
> >>
> >> Personally, I'd very much like us to move all shading to flink-shaded;
> >> this would finally allows us to use newer maven versions without needing
> >> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >> affair in some cases; IIRC calcite could be difficult to handle.
> >>
> >> On another note, this would also simplify switching the main repo to
> >> another build system, since you would no longer had to deal with
> >> relocations, just packaging + merging NOTICE files.
> >>
> >> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
> >> and flink-dist, where both relocate dependencies and one is bundled by
> >> the other), and, most importantly, CI (and really, without CI being
> >> covered in a PoC there's nothing to discuss).
> >>
> >> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >>> Speaking of flink-shaded, do we have any idea what the impact of
> shading
> >> is on the build time? We could get rid of shading completely in the
> Flink
> >> main repository by moving everything that we shade to flink-shaded.
> >>>
> >>> Aljoscha
> >>>
> >>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
> >>>>
> >>>> +1 to Till's points on #2 and #5, especially the potential
> >> non-disruptive,
> >>>> gradual migration approach if we decide to go that route.
> >>>>
> >>>> To add on, I want to point it out that we can actually start with
> >>>> flink-shaded project [1] which is a perfect candidate for PoC. It's of
> >> much
> >>>> smaller size, totally isolated from and not interfered with flink
> >> project
> >>>> [2], and it actually covers most of our practical feature requirements
> >> for
> >>>> a build tool - all making it an ideal experimental field.
> >>>>
> >>>> [1] https://github.com/apache/flink-shaded
> >>>> [2] https://github.com/apache/flink
> >>>>
> >>>>
> >>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]>
> >> wrote:
> >>>>
> >>>>> For the sake of keeping the discussion focused and not cluttering the
> >>>>> discussion thread I would suggest to split the detailed reporting for
> >>>>> reusing JVMs to a separate thread and cross linking it from here.
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <[hidden email]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Update:
> >>>>>>
> >>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
> right
> >>>>>> away, while flink-tests has the potential for huge savings, but we
> >> have
> >>>>>> to figure out some issues first.
> >>>>>>
> >>>>>>
> >>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >>>>>>
> >>>>>> 4/8 profiles failed.
> >>>>>>
> >>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
> >>>>>> libraries (table-planner).
> >>>>>>
> >>>>>> The kafka and connectors profiles both fail in kafka tests due to
> >>>>>> producer leaks, and no speed up could be confirmed so far:
> >>>>>>
> >>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
> >>>>>> kafka-producer-network-thread | producer-239
> >>>>>>        at org.junit.Assert.fail(Assert.java:88)
> >>>>>>        at
> >>>>>>
> >>>>>
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >>>>>>        at
> >>>>>>
> >>>>>
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >>>>>>
> >>>>>> The tests profile failed due to various errors in migration tests:
> >>>>>>
> >>>>>> junit.framework.AssertionFailedError: Did not see the expected
> >>>>> accumulator
> >>>>>> results within time limit.
> >>>>>>        at
> >>>>>>
> >>>>>
> >>
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >>>>>> *However*, a normal tests run takes 40 minutes, while this one above
> >>>>>> failed after 19 minutes and is only missing the migration tests
> (which
> >>>>>> currently need 6-7 minutes). So we could save somewhere between 15
> to
> >> 20
> >>>>>> minutes here.
> >>>>>>
> >>>>>>
> >>>>>> Finally, the misc profiles fails in YARN:
> >>>>>>
> >>>>>> java.lang.AssertionError
> >>>>>>        at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >>>>>>
> >>>>>> No significant speedup could be observed in other modules; for
> >>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> >>>>>>
> >>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> >>>>>>> There appears to be a general agreement that 1) should be looked
> >> into;
> >>>>>>> I've setup a branch with fork reuse being enabled for all tests;
> will
> >>>>>>> report back the results.
> >>>>>>>
> >>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> >>>>>>>> Hello everyone,
> >>>>>>>>
> >>>>>>>> improving our build times is a hot topic at the moment so let's
> >>>>>>>> discuss the different ways how they could be reduced.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>       Current state:
> >>>>>>>>
> >>>>>>>> First up, let's look at some numbers:
> >>>>>>>>
> >>>>>>>> 1 full build currently consumes 5h of build time total ("total
> >>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
> >>>>>>>> complete from start to finish. The run time may fluctuate of
> course
> >>>>>>>> depending on the current Travis load. This applies both to builds
> on
> >>>>>>>> the Apache and flink-ci Travis.
> >>>>>>>>
> >>>>>>>> At the time of writing, the current queue time for PR jobs
> >> (reminder:
> >>>>>>>> running on flink-ci) is about 30 minutes (which basically means
> that
> >>>>>>>> we are processing builds at the rate that they come in), however
> we
> >>>>>>>> are in an admittedly quiet period right now.
> >>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> >>>>>>>> everyone was scrambling to get their changes merged in time for
> the
> >>>>>>>> feature freeze.
> >>>>>>>>
> >>>>>>>> (Note: Recently optimizations where added to ci-bot where pending
> >>>>>>>> builds are canceled if a new commit was pushed to the PR or the PR
> >>>>>>>> was closed, which should prove especially useful during the rush
> >>>>>>>> hours we see before feature-freezes.)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>       Past approaches
> >>>>>>>>
> >>>>>>>> Over the years we have done rather few things to improve this
> >>>>>>>> situation (hence our current predicament).
> >>>>>>>>
> >>>>>>>> Beyond the sporadic speedup of some tests, the only notable
> >> reduction
> >>>>>>>> in total build times was the introduction of cron jobs, which
> >>>>>>>> consolidated the per-commit matrix from 4 configurations
> (different
> >>>>>>>> scala/hadoop versions) to 1.
> >>>>>>>>
> >>>>>>>> The separation into multiple build profiles was only a work-around
> >>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
> >>>>>>>> obvious potential of reducing run time, but we're currently
> hitting
> >> a
> >>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> >>>>>>>> flink-table-planner-blink) are so loaded with tests that they
> nearly
> >>>>>>>> consume an entire profile by themselves (and thus no further
> >>>>>>>> splitting is possible).
> >>>>>>>>
> >>>>>>>> The rework that introduced stages, at the time of introduction,
> did
> >>>>>>>> also not provide a speed up, although this changed slightly once
> >> more
> >>>>>>>> profiles were added and some optimizations to the caching have
> been
> >>>>>>>> made.
> >>>>>>>>
> >>>>>>>> Very recently we modified the surefire-plugin configuration for
> >>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
> providing
> >>>>>>>> a significant speedup (18 minutes!). So far we have not seen any
> >>>>>>>> negative consequences.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>       Suggestions
> >>>>>>>>
> >>>>>>>> This is a list of /all /suggestions for reducing run/total times
> >> that
> >>>>>>>> I have seen recently (in other words, they aren't necessarily mine
> >>>>>>>> nor may I agree with all of them).
> >>>>>>>>
> >>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
> >>>>>>>>     * We've seen significant speedups in the blink planner, and
> >> this
> >>>>>>>>       should be applicable for all modules. However, I presume
> >>>>> there's
> >>>>>>>>       a reason why we disabled JVM reuse (information on this
> would
> >>>>> be
> >>>>>>>>       appreciated)
> >>>>>>>> 2. Custom differential build scripts
> >>>>>>>>     * Setup custom scripts for determining which modules might be
> >>>>>>>>       affected by change, and manipulate the splits accordingly.
> >> This
> >>>>>>>>       approach is conceptually quite straight-forward, but has
> >> limits
> >>>>>>>>       since it has to be pessimistic; i.e. a change in flink-core
> >>>>>>>>       _must_ result in testing all modules.
> >>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> >> demand.
> >>>>>>>>     * With the introduction of the ci-bot we now have
> significantly
> >>>>>>>>       more options on how to handle PR builds. One option could be
> >> to
> >>>>>>>>       only run basic tests when the PR is created (which may be
> >> only
> >>>>>>>>       modified modules, or all unit tests, or another low-cost
> >>>>>>>>       scheme), and then have a committer trigger other builds
> (full
> >>>>>>>>       test run, e2e tests, etc...) on demand.
> >>>>>>>> 4. Move more tests into cron builds
> >>>>>>>>     * The budget version of 3); move certain tests that are either
> >>>>>>>>       expensive (like some runtime tests that take minutes) or in
> >>>>>>>>       rarely modified modules (like gelly) into cron jobs.
> >>>>>>>> 5. Gradle
> >>>>>>>>     * Gradle was brought up a few times for it's built-in support
> >> for
> >>>>>>>>       differential builds; basically providing 2) without the
> >>>>> overhead
> >>>>>>>>       of maintaining additional scripts.
> >>>>>>>>     * To date no PoC was provided that shows it working in our CI
> >>>>>>>>       environment (i.e., handling splits & caching etc).
> >>>>>>>>     * This is the most disruptive change by a fair margin, as it
> >>>>> would
> >>>>>>>>       affect the entire project, developers and potentially users
> >> (f
> >>>>>>>>       they build from source).
> >>>>>>>> 6. CI service
> >>>>>>>>     * Our current artifact caching setup on Travis is basically a
> >>>>>>>>       hack; we're basically abusing the Travis cache, which is
> >> meant
> >>>>>>>>       for long-term caching, to ship build artifacts across jobs.
> >>>>> It's
> >>>>>>>>       brittle at times due to timing/visibility issues and on
> >>>>> branches
> >>>>>>>>       the cleanup processes can interfere with running builds. It
> >> is
> >>>>>>>>       also not as effective as it could be.
> >>>>>>>>     * There are CI services that provide build artifact caching
> out
> >>>>> of
> >>>>>>>>       the box, which could be useful for us.
> >>>>>>>>     * To date, no PoC for using another CI service has been
> >> provided.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Robert Metzger
Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and to
avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on Azure
Pipelines, using more powerful hardware. Alibaba kindly provided me with
two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure Pipelines
with our apache/flink GitHub as a build infrastructure that exists next to
Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot of
nice options for us to improve the build experience (statistics about tests
(flaky tests etc.), nice docker support, plenty of free build resources for
open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> wrote:

> Hi all,
>
> I have summarized all arguments mentioned so far + some additional
> research into a Wiki page here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
>
> I'm happy to hear further comments on my summary! I'm pretty sure we can
> find more pro's and con's for the different options.
>
> My opinion after looking at the options:
>
>    - Flink relies on an outdated build tool (Maven), while a good
>    alternative is well-established (gradle), and will likely provide a much
>    better CI and local build experience through incremental build and cached
>    intermediates.
>    Scripting around Maven, or splitting modules / test execution /
>    repositories won't solve this problem. We should rather spend the effort in
>    migrating to a modern build tool which will provide us benefits in the long
>    run.
>    - Flink relies on a fairly slow build service (Travis CI), while
>    simply putting more money onto the problem could cut the build time at
>    least in half.
>    We should consider using a build service that provides bigger machines
>    to solve our build time problem.
>
> My opinion is based on many assumptions (gradle is actually as fast as
> promised (haven't used it before), we can build Flink with gradle, we find
> sponsors for bigger build machines) that we need to test first through PoCs.
>
> Best,
> Robert
>
>
>
>
> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]>
> wrote:
>
>> I did a quick test: a normal "mvn clean install -DskipTests
>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
>> takes about 14 minutes. After removing all mentions of maven-shade-plugin
>> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
>> Flink won’t work, because some expected stuff is not packaged and most of
>> the end-to-end tests use the shade plugin to package the jars for testing.
>>
>> Aljoscha
>>
>> > On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> wrote:
>> >
>> > Hi all,
>> >
>> > I wanted to understand the impact of the hardware we are using for
>> running
>> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
>> [1].
>> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
>> > Running a full "mvn clean verify" takes *03:32 h* on such a machine
>> type.
>> >
>> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
>> *1:21
>> > h*.
>> >
>> > What is interesting are the per-module build time differences.
>> > Modules which are parallelizing tests well greatly benefit from the
>> > additional cores:
>> > "flink-tests" 36:51 min vs 4:33 min
>> > "flink-runtime" 23:41 min vs 3:47 min
>> > "flink-table-planner" 15:54 min vs 3:13 min
>> >
>> > On the other hand, we have modules which are not parallel at all:
>> > "flink-connector-kafka": 16:32 min vs 15:19 min
>> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
>> > Also, the checkstyle plugin is not scaling at all.
>> >
>> > Chesnay reported some significant speedups by reusing forks.
>> > I don't know how much effort it would be to make the Kafka tests
>> > parallelizable. In total, they currently use 30 minutes on the big
>> machine
>> > (while 31 CPUs are idling :) )
>> >
>> > Let me know what you think about these results. If the community is
>> > generally interested in further investigating into that direction, I
>> could
>> > look into software to orchestrate this, as well as sponsors for such an
>> > infrastructure.
>> >
>> > [1] https://docs.travis-ci.com/user/reference/overview/
>> >
>> >
>> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]>
>> wrote:
>> >
>> >> @Aljoscha Shading takes a few minutes for a full build; you can see
>> this
>> >> quite easily by looking at the compile step in the misc profile
>> >> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
>> >> longer than a fraction of a section are usually caused by shading lots
>> >> of classes. Note that I cannot tell you how much of this is spent on
>> >> relocations, and how much on writing the jar.
>> >>
>> >> Personally, I'd very much like us to move all shading to flink-shaded;
>> >> this would finally allows us to use newer maven versions without
>> needing
>> >> cumbersome workarounds for flink-dist. However, this isn't a trivial
>> >> affair in some cases; IIRC calcite could be difficult to handle.
>> >>
>> >> On another note, this would also simplify switching the main repo to
>> >> another build system, since you would no longer had to deal with
>> >> relocations, just packaging + merging NOTICE files.
>> >>
>> >> @BowenLi I disagree, flink-shaded does not include any tests,  API
>> >> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>> >> and flink-dist, where both relocate dependencies and one is bundled by
>> >> the other), and, most importantly, CI (and really, without CI being
>> >> covered in a PoC there's nothing to discuss).
>> >>
>> >> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>> >>> Speaking of flink-shaded, do we have any idea what the impact of
>> shading
>> >> is on the build time? We could get rid of shading completely in the
>> Flink
>> >> main repository by moving everything that we shade to flink-shaded.
>> >>>
>> >>> Aljoscha
>> >>>
>> >>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
>> >>>>
>> >>>> +1 to Till's points on #2 and #5, especially the potential
>> >> non-disruptive,
>> >>>> gradual migration approach if we decide to go that route.
>> >>>>
>> >>>> To add on, I want to point it out that we can actually start with
>> >>>> flink-shaded project [1] which is a perfect candidate for PoC. It's
>> of
>> >> much
>> >>>> smaller size, totally isolated from and not interfered with flink
>> >> project
>> >>>> [2], and it actually covers most of our practical feature
>> requirements
>> >> for
>> >>>> a build tool - all making it an ideal experimental field.
>> >>>>
>> >>>> [1] https://github.com/apache/flink-shaded
>> >>>> [2] https://github.com/apache/flink
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]>
>> >> wrote:
>> >>>>
>> >>>>> For the sake of keeping the discussion focused and not cluttering
>> the
>> >>>>> discussion thread I would suggest to split the detailed reporting
>> for
>> >>>>> reusing JVMs to a separate thread and cross linking it from here.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Till
>> >>>>>
>> >>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
>> [hidden email]>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Update:
>> >>>>>>
>> >>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
>> right
>> >>>>>> away, while flink-tests has the potential for huge savings, but we
>> >> have
>> >>>>>> to figure out some issues first.
>> >>>>>>
>> >>>>>>
>> >>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>> >>>>>>
>> >>>>>> 4/8 profiles failed.
>> >>>>>>
>> >>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>> >>>>>> libraries (table-planner).
>> >>>>>>
>> >>>>>> The kafka and connectors profiles both fail in kafka tests due to
>> >>>>>> producer leaks, and no speed up could be confirmed so far:
>> >>>>>>
>> >>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
>> >>>>>> kafka-producer-network-thread | producer-239
>> >>>>>>        at org.junit.Assert.fail(Assert.java:88)
>> >>>>>>        at
>> >>>>>>
>> >>>>>
>> >>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>> >>>>>>        at
>> >>>>>>
>> >>>>>
>> >>
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>> >>>>>>
>> >>>>>> The tests profile failed due to various errors in migration tests:
>> >>>>>>
>> >>>>>> junit.framework.AssertionFailedError: Did not see the expected
>> >>>>> accumulator
>> >>>>>> results within time limit.
>> >>>>>>        at
>> >>>>>>
>> >>>>>
>> >>
>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>> >>>>>> *However*, a normal tests run takes 40 minutes, while this one
>> above
>> >>>>>> failed after 19 minutes and is only missing the migration tests
>> (which
>> >>>>>> currently need 6-7 minutes). So we could save somewhere between 15
>> to
>> >> 20
>> >>>>>> minutes here.
>> >>>>>>
>> >>>>>>
>> >>>>>> Finally, the misc profiles fails in YARN:
>> >>>>>>
>> >>>>>> java.lang.AssertionError
>> >>>>>>        at
>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>> >>>>>>
>> >>>>>> No significant speedup could be observed in other modules; for
>> >>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>> >>>>>>
>> >>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>> >>>>>>> There appears to be a general agreement that 1) should be looked
>> >> into;
>> >>>>>>> I've setup a branch with fork reuse being enabled for all tests;
>> will
>> >>>>>>> report back the results.
>> >>>>>>>
>> >>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>> >>>>>>>> Hello everyone,
>> >>>>>>>>
>> >>>>>>>> improving our build times is a hot topic at the moment so let's
>> >>>>>>>> discuss the different ways how they could be reduced.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>       Current state:
>> >>>>>>>>
>> >>>>>>>> First up, let's look at some numbers:
>> >>>>>>>>
>> >>>>>>>> 1 full build currently consumes 5h of build time total ("total
>> >>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>> >>>>>>>> complete from start to finish. The run time may fluctuate of
>> course
>> >>>>>>>> depending on the current Travis load. This applies both to
>> builds on
>> >>>>>>>> the Apache and flink-ci Travis.
>> >>>>>>>>
>> >>>>>>>> At the time of writing, the current queue time for PR jobs
>> >> (reminder:
>> >>>>>>>> running on flink-ci) is about 30 minutes (which basically means
>> that
>> >>>>>>>> we are processing builds at the rate that they come in), however
>> we
>> >>>>>>>> are in an admittedly quiet period right now.
>> >>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>> >>>>>>>> everyone was scrambling to get their changes merged in time for
>> the
>> >>>>>>>> feature freeze.
>> >>>>>>>>
>> >>>>>>>> (Note: Recently optimizations where added to ci-bot where pending
>> >>>>>>>> builds are canceled if a new commit was pushed to the PR or the
>> PR
>> >>>>>>>> was closed, which should prove especially useful during the rush
>> >>>>>>>> hours we see before feature-freezes.)
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>       Past approaches
>> >>>>>>>>
>> >>>>>>>> Over the years we have done rather few things to improve this
>> >>>>>>>> situation (hence our current predicament).
>> >>>>>>>>
>> >>>>>>>> Beyond the sporadic speedup of some tests, the only notable
>> >> reduction
>> >>>>>>>> in total build times was the introduction of cron jobs, which
>> >>>>>>>> consolidated the per-commit matrix from 4 configurations
>> (different
>> >>>>>>>> scala/hadoop versions) to 1.
>> >>>>>>>>
>> >>>>>>>> The separation into multiple build profiles was only a
>> work-around
>> >>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
>> >>>>>>>> obvious potential of reducing run time, but we're currently
>> hitting
>> >> a
>> >>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>> >>>>>>>> flink-table-planner-blink) are so loaded with tests that they
>> nearly
>> >>>>>>>> consume an entire profile by themselves (and thus no further
>> >>>>>>>> splitting is possible).
>> >>>>>>>>
>> >>>>>>>> The rework that introduced stages, at the time of introduction,
>> did
>> >>>>>>>> also not provide a speed up, although this changed slightly once
>> >> more
>> >>>>>>>> profiles were added and some optimizations to the caching have
>> been
>> >>>>>>>> made.
>> >>>>>>>>
>> >>>>>>>> Very recently we modified the surefire-plugin configuration for
>> >>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
>> providing
>> >>>>>>>> a significant speedup (18 minutes!). So far we have not seen any
>> >>>>>>>> negative consequences.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>       Suggestions
>> >>>>>>>>
>> >>>>>>>> This is a list of /all /suggestions for reducing run/total times
>> >> that
>> >>>>>>>> I have seen recently (in other words, they aren't necessarily
>> mine
>> >>>>>>>> nor may I agree with all of them).
>> >>>>>>>>
>> >>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
>> >>>>>>>>     * We've seen significant speedups in the blink planner, and
>> >> this
>> >>>>>>>>       should be applicable for all modules. However, I presume
>> >>>>> there's
>> >>>>>>>>       a reason why we disabled JVM reuse (information on this
>> would
>> >>>>> be
>> >>>>>>>>       appreciated)
>> >>>>>>>> 2. Custom differential build scripts
>> >>>>>>>>     * Setup custom scripts for determining which modules might be
>> >>>>>>>>       affected by change, and manipulate the splits accordingly.
>> >> This
>> >>>>>>>>       approach is conceptually quite straight-forward, but has
>> >> limits
>> >>>>>>>>       since it has to be pessimistic; i.e. a change in flink-core
>> >>>>>>>>       _must_ result in testing all modules.
>> >>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
>> >> demand.
>> >>>>>>>>     * With the introduction of the ci-bot we now have
>> significantly
>> >>>>>>>>       more options on how to handle PR builds. One option could
>> be
>> >> to
>> >>>>>>>>       only run basic tests when the PR is created (which may be
>> >> only
>> >>>>>>>>       modified modules, or all unit tests, or another low-cost
>> >>>>>>>>       scheme), and then have a committer trigger other builds
>> (full
>> >>>>>>>>       test run, e2e tests, etc...) on demand.
>> >>>>>>>> 4. Move more tests into cron builds
>> >>>>>>>>     * The budget version of 3); move certain tests that are
>> either
>> >>>>>>>>       expensive (like some runtime tests that take minutes) or in
>> >>>>>>>>       rarely modified modules (like gelly) into cron jobs.
>> >>>>>>>> 5. Gradle
>> >>>>>>>>     * Gradle was brought up a few times for it's built-in support
>> >> for
>> >>>>>>>>       differential builds; basically providing 2) without the
>> >>>>> overhead
>> >>>>>>>>       of maintaining additional scripts.
>> >>>>>>>>     * To date no PoC was provided that shows it working in our CI
>> >>>>>>>>       environment (i.e., handling splits & caching etc).
>> >>>>>>>>     * This is the most disruptive change by a fair margin, as it
>> >>>>> would
>> >>>>>>>>       affect the entire project, developers and potentially users
>> >> (f
>> >>>>>>>>       they build from source).
>> >>>>>>>> 6. CI service
>> >>>>>>>>     * Our current artifact caching setup on Travis is basically a
>> >>>>>>>>       hack; we're basically abusing the Travis cache, which is
>> >> meant
>> >>>>>>>>       for long-term caching, to ship build artifacts across jobs.
>> >>>>> It's
>> >>>>>>>>       brittle at times due to timing/visibility issues and on
>> >>>>> branches
>> >>>>>>>>       the cleanup processes can interfere with running builds. It
>> >> is
>> >>>>>>>>       also not as effective as it could be.
>> >>>>>>>>     * There are CI services that provide build artifact caching
>> out
>> >>>>> of
>> >>>>>>>>       the box, which could be useful for us.
>> >>>>>>>>     * To date, no PoC for using another CI service has been
>> >> provided.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>
>> >>
>> >>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Arvid Heise
+1 for Azure Pipelines, had very good experiences in the past with it and
the open source and payment models are much better.

The upcoming Github CI/CD seems also like a promising alternative, but from
the first looks, it seems like the small brother of Azure Pipeline. So, any
effort going into Azure Pipelines is probably also going into this
direction.

Best,

Arvid

On Tue, Sep 3, 2019 at 6:57 PM Robert Metzger <[hidden email]> wrote:

> Hi all,
>
> I wanted to give a short update on this:
> - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> working on making all modules compile and test with Gradle. We've also
> identified some problematic areas (shading being the most obvious one)
> which we will analyse as part of the PoC.
> The goal is to see how much Gradle helps to parallelise our build, and to
> avoid duplicate work (incremental builds).
>
> - I am working on setting up a Flink testing infrastructure based on Azure
> Pipelines, using more powerful hardware. Alibaba kindly provided me with
> two 32 core machines (temporarily), and another company reached out to
> privately, looking into options for cheap, fast machines :)
> If nobody in the community disagrees, I am going to set up Azure Pipelines
> with our apache/flink GitHub as a build infrastructure that exists next to
> Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> equally or even more reliable than Travis, and I want to see what the
> required maintenance work is.
> On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> nice options for us to improve the build experience (statistics about tests
> (flaky tests etc.), nice docker support, plenty of free build resources for
> open source projects, ...)
>
> Best,
> Robert
>
>
>
>
>
> On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I have summarized all arguments mentioned so far + some additional
> > research into a Wiki page here:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >
> > I'm happy to hear further comments on my summary! I'm pretty sure we can
> > find more pro's and con's for the different options.
> >
> > My opinion after looking at the options:
> >
> >    - Flink relies on an outdated build tool (Maven), while a good
> >    alternative is well-established (gradle), and will likely provide a
> much
> >    better CI and local build experience through incremental build and
> cached
> >    intermediates.
> >    Scripting around Maven, or splitting modules / test execution /
> >    repositories won't solve this problem. We should rather spend the
> effort in
> >    migrating to a modern build tool which will provide us benefits in
> the long
> >    run.
> >    - Flink relies on a fairly slow build service (Travis CI), while
> >    simply putting more money onto the problem could cut the build time at
> >    least in half.
> >    We should consider using a build service that provides bigger machines
> >    to solve our build time problem.
> >
> > My opinion is based on many assumptions (gradle is actually as fast as
> > promised (haven't used it before), we can build Flink with gradle, we
> find
> > sponsors for bigger build machines) that we need to test first through
> PoCs.
> >
> > Best,
> > Robert
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]>
> > wrote:
> >
> >> I did a quick test: a normal "mvn clean install -DskipTests
> >> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >> Flink won’t work, because some expected stuff is not packaged and most
> of
> >> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>
> >> Aljoscha
> >>
> >> > On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]>
> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > I wanted to understand the impact of the hardware we are using for
> >> running
> >> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> >> [1].
> >> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
> >> > Running a full "mvn clean verify" takes *03:32 h* on such a machine
> >> type.
> >> >
> >> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
> >> *1:21
> >> > h*.
> >> >
> >> > What is interesting are the per-module build time differences.
> >> > Modules which are parallelizing tests well greatly benefit from the
> >> > additional cores:
> >> > "flink-tests" 36:51 min vs 4:33 min
> >> > "flink-runtime" 23:41 min vs 3:47 min
> >> > "flink-table-planner" 15:54 min vs 3:13 min
> >> >
> >> > On the other hand, we have modules which are not parallel at all:
> >> > "flink-connector-kafka": 16:32 min vs 15:19 min
> >> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> >> > Also, the checkstyle plugin is not scaling at all.
> >> >
> >> > Chesnay reported some significant speedups by reusing forks.
> >> > I don't know how much effort it would be to make the Kafka tests
> >> > parallelizable. In total, they currently use 30 minutes on the big
> >> machine
> >> > (while 31 CPUs are idling :) )
> >> >
> >> > Let me know what you think about these results. If the community is
> >> > generally interested in further investigating into that direction, I
> >> could
> >> > look into software to orchestrate this, as well as sponsors for such
> an
> >> > infrastructure.
> >> >
> >> > [1] https://docs.travis-ci.com/user/reference/overview/
> >> >
> >> >
> >> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]>
> >> wrote:
> >> >
> >> >> @Aljoscha Shading takes a few minutes for a full build; you can see
> >> this
> >> >> quite easily by looking at the compile step in the misc profile
> >> >> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules
> that
> >> >> longer than a fraction of a section are usually caused by shading
> lots
> >> >> of classes. Note that I cannot tell you how much of this is spent on
> >> >> relocations, and how much on writing the jar.
> >> >>
> >> >> Personally, I'd very much like us to move all shading to
> flink-shaded;
> >> >> this would finally allows us to use newer maven versions without
> >> needing
> >> >> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >> >> affair in some cases; IIRC calcite could be difficult to handle.
> >> >>
> >> >> On another note, this would also simplify switching the main repo to
> >> >> another build system, since you would no longer had to deal with
> >> >> relocations, just packaging + merging NOTICE files.
> >> >>
> >> >> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >> >> compatibility checks, checkstyle, layered shading (e.g.,
> flink-runtime
> >> >> and flink-dist, where both relocate dependencies and one is bundled
> by
> >> >> the other), and, most importantly, CI (and really, without CI being
> >> >> covered in a PoC there's nothing to discuss).
> >> >>
> >> >> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >> >>> Speaking of flink-shaded, do we have any idea what the impact of
> >> shading
> >> >> is on the build time? We could get rid of shading completely in the
> >> Flink
> >> >> main repository by moving everything that we shade to flink-shaded.
> >> >>>
> >> >>> Aljoscha
> >> >>>
> >> >>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
> >> >>>>
> >> >>>> +1 to Till's points on #2 and #5, especially the potential
> >> >> non-disruptive,
> >> >>>> gradual migration approach if we decide to go that route.
> >> >>>>
> >> >>>> To add on, I want to point it out that we can actually start with
> >> >>>> flink-shaded project [1] which is a perfect candidate for PoC. It's
> >> of
> >> >> much
> >> >>>> smaller size, totally isolated from and not interfered with flink
> >> >> project
> >> >>>> [2], and it actually covers most of our practical feature
> >> requirements
> >> >> for
> >> >>>> a build tool - all making it an ideal experimental field.
> >> >>>>
> >> >>>> [1] https://github.com/apache/flink-shaded
> >> >>>> [2] https://github.com/apache/flink
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <
> [hidden email]>
> >> >> wrote:
> >> >>>>
> >> >>>>> For the sake of keeping the discussion focused and not cluttering
> >> the
> >> >>>>> discussion thread I would suggest to split the detailed reporting
> >> for
> >> >>>>> reusing JVMs to a separate thread and cross linking it from here.
> >> >>>>>
> >> >>>>> Cheers,
> >> >>>>> Till
> >> >>>>>
> >> >>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
> >> [hidden email]>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Update:
> >> >>>>>>
> >> >>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
> >> right
> >> >>>>>> away, while flink-tests has the potential for huge savings, but
> we
> >> >> have
> >> >>>>>> to figure out some issues first.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >> >>>>>>
> >> >>>>>> 4/8 profiles failed.
> >> >>>>>>
> >> >>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved
> in
> >> >>>>>> libraries (table-planner).
> >> >>>>>>
> >> >>>>>> The kafka and connectors profiles both fail in kafka tests due to
> >> >>>>>> producer leaks, and no speed up could be confirmed so far:
> >> >>>>>>
> >> >>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
> >> >>>>>> kafka-producer-network-thread | producer-239
> >> >>>>>>        at org.junit.Assert.fail(Assert.java:88)
> >> >>>>>>        at
> >> >>>>>>
> >> >>>>>
> >> >>
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >> >>>>>>        at
> >> >>>>>>
> >> >>>>>
> >> >>
> >>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >> >>>>>>
> >> >>>>>> The tests profile failed due to various errors in migration
> tests:
> >> >>>>>>
> >> >>>>>> junit.framework.AssertionFailedError: Did not see the expected
> >> >>>>> accumulator
> >> >>>>>> results within time limit.
> >> >>>>>>        at
> >> >>>>>>
> >> >>>>>
> >> >>
> >>
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >> >>>>>> *However*, a normal tests run takes 40 minutes, while this one
> >> above
> >> >>>>>> failed after 19 minutes and is only missing the migration tests
> >> (which
> >> >>>>>> currently need 6-7 minutes). So we could save somewhere between
> 15
> >> to
> >> >> 20
> >> >>>>>> minutes here.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Finally, the misc profiles fails in YARN:
> >> >>>>>>
> >> >>>>>> java.lang.AssertionError
> >> >>>>>>        at
> >> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >> >>>>>>
> >> >>>>>> No significant speedup could be observed in other modules; for
> >> >>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> >> >>>>>>
> >> >>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> >> >>>>>>> There appears to be a general agreement that 1) should be looked
> >> >> into;
> >> >>>>>>> I've setup a branch with fork reuse being enabled for all tests;
> >> will
> >> >>>>>>> report back the results.
> >> >>>>>>>
> >> >>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> >> >>>>>>>> Hello everyone,
> >> >>>>>>>>
> >> >>>>>>>> improving our build times is a hot topic at the moment so let's
> >> >>>>>>>> discuss the different ways how they could be reduced.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>       Current state:
> >> >>>>>>>>
> >> >>>>>>>> First up, let's look at some numbers:
> >> >>>>>>>>
> >> >>>>>>>> 1 full build currently consumes 5h of build time total ("total
> >> >>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
> >> >>>>>>>> complete from start to finish. The run time may fluctuate of
> >> course
> >> >>>>>>>> depending on the current Travis load. This applies both to
> >> builds on
> >> >>>>>>>> the Apache and flink-ci Travis.
> >> >>>>>>>>
> >> >>>>>>>> At the time of writing, the current queue time for PR jobs
> >> >> (reminder:
> >> >>>>>>>> running on flink-ci) is about 30 minutes (which basically means
> >> that
> >> >>>>>>>> we are processing builds at the rate that they come in),
> however
> >> we
> >> >>>>>>>> are in an admittedly quiet period right now.
> >> >>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h
> as
> >> >>>>>>>> everyone was scrambling to get their changes merged in time for
> >> the
> >> >>>>>>>> feature freeze.
> >> >>>>>>>>
> >> >>>>>>>> (Note: Recently optimizations where added to ci-bot where
> pending
> >> >>>>>>>> builds are canceled if a new commit was pushed to the PR or the
> >> PR
> >> >>>>>>>> was closed, which should prove especially useful during the
> rush
> >> >>>>>>>> hours we see before feature-freezes.)
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>       Past approaches
> >> >>>>>>>>
> >> >>>>>>>> Over the years we have done rather few things to improve this
> >> >>>>>>>> situation (hence our current predicament).
> >> >>>>>>>>
> >> >>>>>>>> Beyond the sporadic speedup of some tests, the only notable
> >> >> reduction
> >> >>>>>>>> in total build times was the introduction of cron jobs, which
> >> >>>>>>>> consolidated the per-commit matrix from 4 configurations
> >> (different
> >> >>>>>>>> scala/hadoop versions) to 1.
> >> >>>>>>>>
> >> >>>>>>>> The separation into multiple build profiles was only a
> >> work-around
> >> >>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
> >> >>>>>>>> obvious potential of reducing run time, but we're currently
> >> hitting
> >> >> a
> >> >>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> >> >>>>>>>> flink-table-planner-blink) are so loaded with tests that they
> >> nearly
> >> >>>>>>>> consume an entire profile by themselves (and thus no further
> >> >>>>>>>> splitting is possible).
> >> >>>>>>>>
> >> >>>>>>>> The rework that introduced stages, at the time of introduction,
> >> did
> >> >>>>>>>> also not provide a speed up, although this changed slightly
> once
> >> >> more
> >> >>>>>>>> profiles were added and some optimizations to the caching have
> >> been
> >> >>>>>>>> made.
> >> >>>>>>>>
> >> >>>>>>>> Very recently we modified the surefire-plugin configuration for
> >> >>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
> >> providing
> >> >>>>>>>> a significant speedup (18 minutes!). So far we have not seen
> any
> >> >>>>>>>> negative consequences.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>       Suggestions
> >> >>>>>>>>
> >> >>>>>>>> This is a list of /all /suggestions for reducing run/total
> times
> >> >> that
> >> >>>>>>>> I have seen recently (in other words, they aren't necessarily
> >> mine
> >> >>>>>>>> nor may I agree with all of them).
> >> >>>>>>>>
> >> >>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
> >> >>>>>>>>     * We've seen significant speedups in the blink planner, and
> >> >> this
> >> >>>>>>>>       should be applicable for all modules. However, I presume
> >> >>>>> there's
> >> >>>>>>>>       a reason why we disabled JVM reuse (information on this
> >> would
> >> >>>>> be
> >> >>>>>>>>       appreciated)
> >> >>>>>>>> 2. Custom differential build scripts
> >> >>>>>>>>     * Setup custom scripts for determining which modules might
> be
> >> >>>>>>>>       affected by change, and manipulate the splits
> accordingly.
> >> >> This
> >> >>>>>>>>       approach is conceptually quite straight-forward, but has
> >> >> limits
> >> >>>>>>>>       since it has to be pessimistic; i.e. a change in
> flink-core
> >> >>>>>>>>       _must_ result in testing all modules.
> >> >>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> >> >> demand.
> >> >>>>>>>>     * With the introduction of the ci-bot we now have
> >> significantly
> >> >>>>>>>>       more options on how to handle PR builds. One option could
> >> be
> >> >> to
> >> >>>>>>>>       only run basic tests when the PR is created (which may be
> >> >> only
> >> >>>>>>>>       modified modules, or all unit tests, or another low-cost
> >> >>>>>>>>       scheme), and then have a committer trigger other builds
> >> (full
> >> >>>>>>>>       test run, e2e tests, etc...) on demand.
> >> >>>>>>>> 4. Move more tests into cron builds
> >> >>>>>>>>     * The budget version of 3); move certain tests that are
> >> either
> >> >>>>>>>>       expensive (like some runtime tests that take minutes) or
> in
> >> >>>>>>>>       rarely modified modules (like gelly) into cron jobs.
> >> >>>>>>>> 5. Gradle
> >> >>>>>>>>     * Gradle was brought up a few times for it's built-in
> support
> >> >> for
> >> >>>>>>>>       differential builds; basically providing 2) without the
> >> >>>>> overhead
> >> >>>>>>>>       of maintaining additional scripts.
> >> >>>>>>>>     * To date no PoC was provided that shows it working in our
> CI
> >> >>>>>>>>       environment (i.e., handling splits & caching etc).
> >> >>>>>>>>     * This is the most disruptive change by a fair margin, as
> it
> >> >>>>> would
> >> >>>>>>>>       affect the entire project, developers and potentially
> users
> >> >> (f
> >> >>>>>>>>       they build from source).
> >> >>>>>>>> 6. CI service
> >> >>>>>>>>     * Our current artifact caching setup on Travis is
> basically a
> >> >>>>>>>>       hack; we're basically abusing the Travis cache, which is
> >> >> meant
> >> >>>>>>>>       for long-term caching, to ship build artifacts across
> jobs.
> >> >>>>> It's
> >> >>>>>>>>       brittle at times due to timing/visibility issues and on
> >> >>>>> branches
> >> >>>>>>>>       the cleanup processes can interfere with running builds.
> It
> >> >> is
> >> >>>>>>>>       also not as effective as it could be.
> >> >>>>>>>>     * There are CI services that provide build artifact caching
> >> out
> >> >>>>> of
> >> >>>>>>>>       the box, which could be useful for us.
> >> >>>>>>>>     * To date, no PoC for using another CI service has been
> >> >> provided.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>
> >> >>
> >> >>
> >>
> >>
>


--

Arvid Heise | Senior Software Engineer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Chesnay Schepler-3
In reply to this post by Robert Metzger
Will using more powerful for the project make it more difficult to
ensure that contributor builds are still running in a reasonable time?

As an example of this happening on Travis, contributors currently cannot
run all e2e tests since they timeout, but on apache we have a larger
timeout.

On 03/09/2019 18:57, Robert Metzger wrote:

> Hi all,
>
> I wanted to give a short update on this:
> - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> working on making all modules compile and test with Gradle. We've also
> identified some problematic areas (shading being the most obvious one)
> which we will analyse as part of the PoC.
> The goal is to see how much Gradle helps to parallelise our build, and to
> avoid duplicate work (incremental builds).
>
> - I am working on setting up a Flink testing infrastructure based on Azure
> Pipelines, using more powerful hardware. Alibaba kindly provided me with
> two 32 core machines (temporarily), and another company reached out to
> privately, looking into options for cheap, fast machines :)
> If nobody in the community disagrees, I am going to set up Azure Pipelines
> with our apache/flink GitHub as a build infrastructure that exists next to
> Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> equally or even more reliable than Travis, and I want to see what the
> required maintenance work is.
> On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> nice options for us to improve the build experience (statistics about tests
> (flaky tests etc.), nice docker support, plenty of free build resources for
> open source projects, ...)
>
> Best,
> Robert
>
>
>
>
>
> On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]> wrote:
>
>> Hi all,
>>
>> I have summarized all arguments mentioned so far + some additional
>> research into a Wiki page here:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
>>
>> I'm happy to hear further comments on my summary! I'm pretty sure we can
>> find more pro's and con's for the different options.
>>
>> My opinion after looking at the options:
>>
>>     - Flink relies on an outdated build tool (Maven), while a good
>>     alternative is well-established (gradle), and will likely provide a much
>>     better CI and local build experience through incremental build and cached
>>     intermediates.
>>     Scripting around Maven, or splitting modules / test execution /
>>     repositories won't solve this problem. We should rather spend the effort in
>>     migrating to a modern build tool which will provide us benefits in the long
>>     run.
>>     - Flink relies on a fairly slow build service (Travis CI), while
>>     simply putting more money onto the problem could cut the build time at
>>     least in half.
>>     We should consider using a build service that provides bigger machines
>>     to solve our build time problem.
>>
>> My opinion is based on many assumptions (gradle is actually as fast as
>> promised (haven't used it before), we can build Flink with gradle, we find
>> sponsors for bigger build machines) that we need to test first through PoCs.
>>
>> Best,
>> Robert
>>
>>
>>
>>
>> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]>
>> wrote:
>>
>>> I did a quick test: a normal "mvn clean install -DskipTests
>>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
>>> takes about 14 minutes. After removing all mentions of maven-shade-plugin
>>> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
>>> Flink won’t work, because some expected stuff is not packaged and most of
>>> the end-to-end tests use the shade plugin to package the jars for testing.
>>>
>>> Aljoscha
>>>
>>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I wanted to understand the impact of the hardware we are using for
>>> running
>>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
>>> [1].
>>>> They are using Google Cloud Compute Engine *n1-standard-2* instances.
>>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine
>>> type.
>>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes
>>> *1:21
>>>> h*.
>>>>
>>>> What is interesting are the per-module build time differences.
>>>> Modules which are parallelizing tests well greatly benefit from the
>>>> additional cores:
>>>> "flink-tests" 36:51 min vs 4:33 min
>>>> "flink-runtime" 23:41 min vs 3:47 min
>>>> "flink-table-planner" 15:54 min vs 3:13 min
>>>>
>>>> On the other hand, we have modules which are not parallel at all:
>>>> "flink-connector-kafka": 16:32 min vs 15:19 min
>>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
>>>> Also, the checkstyle plugin is not scaling at all.
>>>>
>>>> Chesnay reported some significant speedups by reusing forks.
>>>> I don't know how much effort it would be to make the Kafka tests
>>>> parallelizable. In total, they currently use 30 minutes on the big
>>> machine
>>>> (while 31 CPUs are idling :) )
>>>>
>>>> Let me know what you think about these results. If the community is
>>>> generally interested in further investigating into that direction, I
>>> could
>>>> look into software to orchestrate this, as well as sponsors for such an
>>>> infrastructure.
>>>>
>>>> [1] https://docs.travis-ci.com/user/reference/overview/
>>>>
>>>>
>>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]>
>>> wrote:
>>>>> @Aljoscha Shading takes a few minutes for a full build; you can see
>>> this
>>>>> quite easily by looking at the compile step in the misc profile
>>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
>>>>> longer than a fraction of a section are usually caused by shading lots
>>>>> of classes. Note that I cannot tell you how much of this is spent on
>>>>> relocations, and how much on writing the jar.
>>>>>
>>>>> Personally, I'd very much like us to move all shading to flink-shaded;
>>>>> this would finally allows us to use newer maven versions without
>>> needing
>>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial
>>>>> affair in some cases; IIRC calcite could be difficult to handle.
>>>>>
>>>>> On another note, this would also simplify switching the main repo to
>>>>> another build system, since you would no longer had to deal with
>>>>> relocations, just packaging + merging NOTICE files.
>>>>>
>>>>> @BowenLi I disagree, flink-shaded does not include any tests,  API
>>>>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>>>>> and flink-dist, where both relocate dependencies and one is bundled by
>>>>> the other), and, most importantly, CI (and really, without CI being
>>>>> covered in a PoC there's nothing to discuss).
>>>>>
>>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>>>>>> Speaking of flink-shaded, do we have any idea what the impact of
>>> shading
>>>>> is on the build time? We could get rid of shading completely in the
>>> Flink
>>>>> main repository by moving everything that we shade to flink-shaded.
>>>>>> Aljoscha
>>>>>>
>>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
>>>>>>>
>>>>>>> +1 to Till's points on #2 and #5, especially the potential
>>>>> non-disruptive,
>>>>>>> gradual migration approach if we decide to go that route.
>>>>>>>
>>>>>>> To add on, I want to point it out that we can actually start with
>>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's
>>> of
>>>>> much
>>>>>>> smaller size, totally isolated from and not interfered with flink
>>>>> project
>>>>>>> [2], and it actually covers most of our practical feature
>>> requirements
>>>>> for
>>>>>>> a build tool - all making it an ideal experimental field.
>>>>>>>
>>>>>>> [1] https://github.com/apache/flink-shaded
>>>>>>> [2] https://github.com/apache/flink
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>>>> For the sake of keeping the discussion focused and not cluttering
>>> the
>>>>>>>> discussion thread I would suggest to split the detailed reporting
>>> for
>>>>>>>> reusing JVMs to a separate thread and cross linking it from here.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
>>> [hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Update:
>>>>>>>>>
>>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
>>> right
>>>>>>>>> away, while flink-tests has the potential for huge savings, but we
>>>>> have
>>>>>>>>> to figure out some issues first.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>>>>>>>>>
>>>>>>>>> 4/8 profiles failed.
>>>>>>>>>
>>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved in
>>>>>>>>> libraries (table-planner).
>>>>>>>>>
>>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to
>>>>>>>>> producer leaks, and no speed up could be confirmed so far:
>>>>>>>>>
>>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
>>>>>>>>> kafka-producer-network-thread | producer-239
>>>>>>>>>         at org.junit.Assert.fail(Assert.java:88)
>>>>>>>>>         at
>>>>>>>>>
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
>>>>>>>>>         at
>>>>>>>>>
>>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
>>>>>>>>> The tests profile failed due to various errors in migration tests:
>>>>>>>>>
>>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected
>>>>>>>> accumulator
>>>>>>>>> results within time limit.
>>>>>>>>>         at
>>>>>>>>>
>>> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
>>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one
>>> above
>>>>>>>>> failed after 19 minutes and is only missing the migration tests
>>> (which
>>>>>>>>> currently need 6-7 minutes). So we could save somewhere between 15
>>> to
>>>>> 20
>>>>>>>>> minutes here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Finally, the misc profiles fails in YARN:
>>>>>>>>>
>>>>>>>>> java.lang.AssertionError
>>>>>>>>>         at
>>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
>>>>>>>>> No significant speedup could be observed in other modules; for
>>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
>>>>>>>>>
>>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
>>>>>>>>>> There appears to be a general agreement that 1) should be looked
>>>>> into;
>>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests;
>>> will
>>>>>>>>>> report back the results.
>>>>>>>>>>
>>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>
>>>>>>>>>>> improving our build times is a hot topic at the moment so let's
>>>>>>>>>>> discuss the different ways how they could be reduced.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        Current state:
>>>>>>>>>>>
>>>>>>>>>>> First up, let's look at some numbers:
>>>>>>>>>>>
>>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total
>>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
>>>>>>>>>>> complete from start to finish. The run time may fluctuate of
>>> course
>>>>>>>>>>> depending on the current Travis load. This applies both to
>>> builds on
>>>>>>>>>>> the Apache and flink-ci Travis.
>>>>>>>>>>>
>>>>>>>>>>> At the time of writing, the current queue time for PR jobs
>>>>> (reminder:
>>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means
>>> that
>>>>>>>>>>> we are processing builds at the rate that they come in), however
>>> we
>>>>>>>>>>> are in an admittedly quiet period right now.
>>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
>>>>>>>>>>> everyone was scrambling to get their changes merged in time for
>>> the
>>>>>>>>>>> feature freeze.
>>>>>>>>>>>
>>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where pending
>>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the
>>> PR
>>>>>>>>>>> was closed, which should prove especially useful during the rush
>>>>>>>>>>> hours we see before feature-freezes.)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        Past approaches
>>>>>>>>>>>
>>>>>>>>>>> Over the years we have done rather few things to improve this
>>>>>>>>>>> situation (hence our current predicament).
>>>>>>>>>>>
>>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
>>>>> reduction
>>>>>>>>>>> in total build times was the introduction of cron jobs, which
>>>>>>>>>>> consolidated the per-commit matrix from 4 configurations
>>> (different
>>>>>>>>>>> scala/hadoop versions) to 1.
>>>>>>>>>>>
>>>>>>>>>>> The separation into multiple build profiles was only a
>>> work-around
>>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
>>>>>>>>>>> obvious potential of reducing run time, but we're currently
>>> hitting
>>>>> a
>>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
>>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they
>>> nearly
>>>>>>>>>>> consume an entire profile by themselves (and thus no further
>>>>>>>>>>> splitting is possible).
>>>>>>>>>>>
>>>>>>>>>>> The rework that introduced stages, at the time of introduction,
>>> did
>>>>>>>>>>> also not provide a speed up, although this changed slightly once
>>>>> more
>>>>>>>>>>> profiles were added and some optimizations to the caching have
>>> been
>>>>>>>>>>> made.
>>>>>>>>>>>
>>>>>>>>>>> Very recently we modified the surefire-plugin configuration for
>>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
>>> providing
>>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen any
>>>>>>>>>>> negative consequences.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        Suggestions
>>>>>>>>>>>
>>>>>>>>>>> This is a list of /all /suggestions for reducing run/total times
>>>>> that
>>>>>>>>>>> I have seen recently (in other words, they aren't necessarily
>>> mine
>>>>>>>>>>> nor may I agree with all of them).
>>>>>>>>>>>
>>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
>>>>>>>>>>>      * We've seen significant speedups in the blink planner, and
>>>>> this
>>>>>>>>>>>        should be applicable for all modules. However, I presume
>>>>>>>> there's
>>>>>>>>>>>        a reason why we disabled JVM reuse (information on this
>>> would
>>>>>>>> be
>>>>>>>>>>>        appreciated)
>>>>>>>>>>> 2. Custom differential build scripts
>>>>>>>>>>>      * Setup custom scripts for determining which modules might be
>>>>>>>>>>>        affected by change, and manipulate the splits accordingly.
>>>>> This
>>>>>>>>>>>        approach is conceptually quite straight-forward, but has
>>>>> limits
>>>>>>>>>>>        since it has to be pessimistic; i.e. a change in flink-core
>>>>>>>>>>>        _must_ result in testing all modules.
>>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
>>>>> demand.
>>>>>>>>>>>      * With the introduction of the ci-bot we now have
>>> significantly
>>>>>>>>>>>        more options on how to handle PR builds. One option could
>>> be
>>>>> to
>>>>>>>>>>>        only run basic tests when the PR is created (which may be
>>>>> only
>>>>>>>>>>>        modified modules, or all unit tests, or another low-cost
>>>>>>>>>>>        scheme), and then have a committer trigger other builds
>>> (full
>>>>>>>>>>>        test run, e2e tests, etc...) on demand.
>>>>>>>>>>> 4. Move more tests into cron builds
>>>>>>>>>>>      * The budget version of 3); move certain tests that are
>>> either
>>>>>>>>>>>        expensive (like some runtime tests that take minutes) or in
>>>>>>>>>>>        rarely modified modules (like gelly) into cron jobs.
>>>>>>>>>>> 5. Gradle
>>>>>>>>>>>      * Gradle was brought up a few times for it's built-in support
>>>>> for
>>>>>>>>>>>        differential builds; basically providing 2) without the
>>>>>>>> overhead
>>>>>>>>>>>        of maintaining additional scripts.
>>>>>>>>>>>      * To date no PoC was provided that shows it working in our CI
>>>>>>>>>>>        environment (i.e., handling splits & caching etc).
>>>>>>>>>>>      * This is the most disruptive change by a fair margin, as it
>>>>>>>> would
>>>>>>>>>>>        affect the entire project, developers and potentially users
>>>>> (f
>>>>>>>>>>>        they build from source).
>>>>>>>>>>> 6. CI service
>>>>>>>>>>>      * Our current artifact caching setup on Travis is basically a
>>>>>>>>>>>        hack; we're basically abusing the Travis cache, which is
>>>>> meant
>>>>>>>>>>>        for long-term caching, to ship build artifacts across jobs.
>>>>>>>> It's
>>>>>>>>>>>        brittle at times due to timing/visibility issues and on
>>>>>>>> branches
>>>>>>>>>>>        the cleanup processes can interfere with running builds. It
>>>>> is
>>>>>>>>>>>        also not as effective as it could be.
>>>>>>>>>>>      * There are CI services that provide build artifact caching
>>> out
>>>>>>>> of
>>>>>>>>>>>        the box, which could be useful for us.
>>>>>>>>>>>      * To date, no PoC for using another CI service has been
>>>>> provided.
>>>>>>>>>>>
>>>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Reducing build times

Robert Metzger
Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <[hidden email]> wrote:

> Will using more powerful for the project make it more difficult to
> ensure that contributor builds are still running in a reasonable time?
>
> As an example of this happening on Travis, contributors currently cannot
> run all e2e tests since they timeout, but on apache we have a larger
> timeout.
>
> On 03/09/2019 18:57, Robert Metzger wrote:
> > Hi all,
> >
> > I wanted to give a short update on this:
> > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > working on making all modules compile and test with Gradle. We've also
> > identified some problematic areas (shading being the most obvious one)
> > which we will analyse as part of the PoC.
> > The goal is to see how much Gradle helps to parallelise our build, and to
> > avoid duplicate work (incremental builds).
> >
> > - I am working on setting up a Flink testing infrastructure based on
> Azure
> > Pipelines, using more powerful hardware. Alibaba kindly provided me with
> > two 32 core machines (temporarily), and another company reached out to
> > privately, looking into options for cheap, fast machines :)
> > If nobody in the community disagrees, I am going to set up Azure
> Pipelines
> > with our apache/flink GitHub as a build infrastructure that exists next
> to
> > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> > equally or even more reliable than Travis, and I want to see what the
> > required maintenance work is.
> > On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> > nice options for us to improve the build experience (statistics about
> tests
> > (flaky tests etc.), nice docker support, plenty of free build resources
> for
> > open source projects, ...)
> >
> > Best,
> > Robert
> >
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <[hidden email]>
> wrote:
> >
> >> Hi all,
> >>
> >> I have summarized all arguments mentioned so far + some additional
> >> research into a Wiki page here:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >>
> >> I'm happy to hear further comments on my summary! I'm pretty sure we can
> >> find more pro's and con's for the different options.
> >>
> >> My opinion after looking at the options:
> >>
> >>     - Flink relies on an outdated build tool (Maven), while a good
> >>     alternative is well-established (gradle), and will likely provide a
> much
> >>     better CI and local build experience through incremental build and
> cached
> >>     intermediates.
> >>     Scripting around Maven, or splitting modules / test execution /
> >>     repositories won't solve this problem. We should rather spend the
> effort in
> >>     migrating to a modern build tool which will provide us benefits in
> the long
> >>     run.
> >>     - Flink relies on a fairly slow build service (Travis CI), while
> >>     simply putting more money onto the problem could cut the build time
> at
> >>     least in half.
> >>     We should consider using a build service that provides bigger
> machines
> >>     to solve our build time problem.
> >>
> >> My opinion is based on many assumptions (gradle is actually as fast as
> >> promised (haven't used it before), we can build Flink with gradle, we
> find
> >> sponsors for bigger build machines) that we need to test first through
> PoCs.
> >>
> >> Best,
> >> Robert
> >>
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <[hidden email]>
> >> wrote:
> >>
> >>> I did a quick test: a normal "mvn clean install -DskipTests
> >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >>> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >>> Flink won’t work, because some expected stuff is not packaged and most
> of
> >>> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>>
> >>> Aljoscha
> >>>
> >>>> On 18. Aug 2019, at 19:52, Robert Metzger <[hidden email]>
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I wanted to understand the impact of the hardware we are using for
> >>> running
> >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> >>> [1].
> >>>> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine
> >>> type.
> >>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes
> >>> *1:21
> >>>> h*.
> >>>>
> >>>> What is interesting are the per-module build time differences.
> >>>> Modules which are parallelizing tests well greatly benefit from the
> >>>> additional cores:
> >>>> "flink-tests" 36:51 min vs 4:33 min
> >>>> "flink-runtime" 23:41 min vs 3:47 min
> >>>> "flink-table-planner" 15:54 min vs 3:13 min
> >>>>
> >>>> On the other hand, we have modules which are not parallel at all:
> >>>> "flink-connector-kafka": 16:32 min vs 15:19 min
> >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> >>>> Also, the checkstyle plugin is not scaling at all.
> >>>>
> >>>> Chesnay reported some significant speedups by reusing forks.
> >>>> I don't know how much effort it would be to make the Kafka tests
> >>>> parallelizable. In total, they currently use 30 minutes on the big
> >>> machine
> >>>> (while 31 CPUs are idling :) )
> >>>>
> >>>> Let me know what you think about these results. If the community is
> >>>> generally interested in further investigating into that direction, I
> >>> could
> >>>> look into software to orchestrate this, as well as sponsors for such
> an
> >>>> infrastructure.
> >>>>
> >>>> [1] https://docs.travis-ci.com/user/reference/overview/
> >>>>
> >>>>
> >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <[hidden email]>
> >>> wrote:
> >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see
> >>> this
> >>>>> quite easily by looking at the compile step in the misc profile
> >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules
> that
> >>>>> longer than a fraction of a section are usually caused by shading
> lots
> >>>>> of classes. Note that I cannot tell you how much of this is spent on
> >>>>> relocations, and how much on writing the jar.
> >>>>>
> >>>>> Personally, I'd very much like us to move all shading to
> flink-shaded;
> >>>>> this would finally allows us to use newer maven versions without
> >>> needing
> >>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >>>>> affair in some cases; IIRC calcite could be difficult to handle.
> >>>>>
> >>>>> On another note, this would also simplify switching the main repo to
> >>>>> another build system, since you would no longer had to deal with
> >>>>> relocations, just packaging + merging NOTICE files.
> >>>>>
> >>>>> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >>>>> compatibility checks, checkstyle, layered shading (e.g.,
> flink-runtime
> >>>>> and flink-dist, where both relocate dependencies and one is bundled
> by
> >>>>> the other), and, most importantly, CI (and really, without CI being
> >>>>> covered in a PoC there's nothing to discuss).
> >>>>>
> >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >>>>>> Speaking of flink-shaded, do we have any idea what the impact of
> >>> shading
> >>>>> is on the build time? We could get rid of shading completely in the
> >>> Flink
> >>>>> main repository by moving everything that we shade to flink-shaded.
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <[hidden email]> wrote:
> >>>>>>>
> >>>>>>> +1 to Till's points on #2 and #5, especially the potential
> >>>>> non-disruptive,
> >>>>>>> gradual migration approach if we decide to go that route.
> >>>>>>>
> >>>>>>> To add on, I want to point it out that we can actually start with
> >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's
> >>> of
> >>>>> much
> >>>>>>> smaller size, totally isolated from and not interfered with flink
> >>>>> project
> >>>>>>> [2], and it actually covers most of our practical feature
> >>> requirements
> >>>>> for
> >>>>>>> a build tool - all making it an ideal experimental field.
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/flink-shaded
> >>>>>>> [2] https://github.com/apache/flink
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <
> [hidden email]>
> >>>>> wrote:
> >>>>>>>> For the sake of keeping the discussion focused and not cluttering
> >>> the
> >>>>>>>> discussion thread I would suggest to split the detailed reporting
> >>> for
> >>>>>>>> reusing JVMs to a separate thread and cross linking it from here.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Till
> >>>>>>>>
> >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
> >>> [hidden email]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Update:
> >>>>>>>>>
> >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
> >>> right
> >>>>>>>>> away, while flink-tests has the potential for huge savings, but
> we
> >>>>> have
> >>>>>>>>> to figure out some issues first.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >>>>>>>>>
> >>>>>>>>> 4/8 profiles failed.
> >>>>>>>>>
> >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved
> in
> >>>>>>>>> libraries (table-planner).
> >>>>>>>>>
> >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to
> >>>>>>>>> producer leaks, and no speed up could be confirmed so far:
> >>>>>>>>>
> >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
> >>>>>>>>> kafka-producer-network-thread | producer-239
> >>>>>>>>>         at org.junit.Assert.fail(Assert.java:88)
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >>>>>>>>> The tests profile failed due to various errors in migration
> tests:
> >>>>>>>>>
> >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected
> >>>>>>>> accumulator
> >>>>>>>>> results within time limit.
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one
> >>> above
> >>>>>>>>> failed after 19 minutes and is only missing the migration tests
> >>> (which
> >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between
> 15
> >>> to
> >>>>> 20
> >>>>>>>>> minutes here.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Finally, the misc profiles fails in YARN:
> >>>>>>>>>
> >>>>>>>>> java.lang.AssertionError
> >>>>>>>>>         at
> >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >>>>>>>>> No significant speedup could be observed in other modules; for
> >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> >>>>>>>>>
> >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> >>>>>>>>>> There appears to be a general agreement that 1) should be looked
> >>>>> into;
> >>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests;
> >>> will
> >>>>>>>>>> report back the results.
> >>>>>>>>>>
> >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> >>>>>>>>>>> Hello everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> improving our build times is a hot topic at the moment so let's
> >>>>>>>>>>> discuss the different ways how they could be reduced.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Current state:
> >>>>>>>>>>>
> >>>>>>>>>>> First up, let's look at some numbers:
> >>>>>>>>>>>
> >>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total
> >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
> >>>>>>>>>>> complete from start to finish. The run time may fluctuate of
> >>> course
> >>>>>>>>>>> depending on the current Travis load. This applies both to
> >>> builds on
> >>>>>>>>>>> the Apache and flink-ci Travis.
> >>>>>>>>>>>
> >>>>>>>>>>> At the time of writing, the current queue time for PR jobs
> >>>>> (reminder:
> >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means
> >>> that
> >>>>>>>>>>> we are processing builds at the rate that they come in),
> however
> >>> we
> >>>>>>>>>>> are in an admittedly quiet period right now.
> >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h
> as
> >>>>>>>>>>> everyone was scrambling to get their changes merged in time for
> >>> the
> >>>>>>>>>>> feature freeze.
> >>>>>>>>>>>
> >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where
> pending
> >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the
> >>> PR
> >>>>>>>>>>> was closed, which should prove especially useful during the
> rush
> >>>>>>>>>>> hours we see before feature-freezes.)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Past approaches
> >>>>>>>>>>>
> >>>>>>>>>>> Over the years we have done rather few things to improve this
> >>>>>>>>>>> situation (hence our current predicament).
> >>>>>>>>>>>
> >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
> >>>>> reduction
> >>>>>>>>>>> in total build times was the introduction of cron jobs, which
> >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations
> >>> (different
> >>>>>>>>>>> scala/hadoop versions) to 1.
> >>>>>>>>>>>
> >>>>>>>>>>> The separation into multiple build profiles was only a
> >>> work-around
> >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
> >>>>>>>>>>> obvious potential of reducing run time, but we're currently
> >>> hitting
> >>>>> a
> >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they
> >>> nearly
> >>>>>>>>>>> consume an entire profile by themselves (and thus no further
> >>>>>>>>>>> splitting is possible).
> >>>>>>>>>>>
> >>>>>>>>>>> The rework that introduced stages, at the time of introduction,
> >>> did
> >>>>>>>>>>> also not provide a speed up, although this changed slightly
> once
> >>>>> more
> >>>>>>>>>>> profiles were added and some optimizations to the caching have
> >>> been
> >>>>>>>>>>> made.
> >>>>>>>>>>>
> >>>>>>>>>>> Very recently we modified the surefire-plugin configuration for
> >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
> >>> providing
> >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen
> any
> >>>>>>>>>>> negative consequences.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Suggestions
> >>>>>>>>>>>
> >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total
> times
> >>>>> that
> >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily
> >>> mine
> >>>>>>>>>>> nor may I agree with all of them).
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
> >>>>>>>>>>>      * We've seen significant speedups in the blink planner,
> and
> >>>>> this
> >>>>>>>>>>>        should be applicable for all modules. However, I presume
> >>>>>>>> there's
> >>>>>>>>>>>        a reason why we disabled JVM reuse (information on this
> >>> would
> >>>>>>>> be
> >>>>>>>>>>>        appreciated)
> >>>>>>>>>>> 2. Custom differential build scripts
> >>>>>>>>>>>      * Setup custom scripts for determining which modules
> might be
> >>>>>>>>>>>        affected by change, and manipulate the splits
> accordingly.
> >>>>> This
> >>>>>>>>>>>        approach is conceptually quite straight-forward, but has
> >>>>> limits
> >>>>>>>>>>>        since it has to be pessimistic; i.e. a change in
> flink-core
> >>>>>>>>>>>        _must_ result in testing all modules.
> >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> >>>>> demand.
> >>>>>>>>>>>      * With the introduction of the ci-bot we now have
> >>> significantly
> >>>>>>>>>>>        more options on how to handle PR builds. One option
> could
> >>> be
> >>>>> to
> >>>>>>>>>>>        only run basic tests when the PR is created (which may
> be
> >>>>> only
> >>>>>>>>>>>        modified modules, or all unit tests, or another low-cost
> >>>>>>>>>>>        scheme), and then have a committer trigger other builds
> >>> (full
> >>>>>>>>>>>        test run, e2e tests, etc...) on demand.
> >>>>>>>>>>> 4. Move more tests into cron builds
> >>>>>>>>>>>      * The budget version of 3); move certain tests that are
> >>> either
> >>>>>>>>>>>        expensive (like some runtime tests that take minutes)
> or in
> >>>>>>>>>>>        rarely modified modules (like gelly) into cron jobs.
> >>>>>>>>>>> 5. Gradle
> >>>>>>>>>>>      * Gradle was brought up a few times for it's built-in
> support
> >>>>> for
> >>>>>>>>>>>        differential builds; basically providing 2) without the
> >>>>>>>> overhead
> >>>>>>>>>>>        of maintaining additional scripts.
> >>>>>>>>>>>      * To date no PoC was provided that shows it working in
> our CI
> >>>>>>>>>>>        environment (i.e., handling splits & caching etc).
> >>>>>>>>>>>      * This is the most disruptive change by a fair margin, as
> it
> >>>>>>>> would
> >>>>>>>>>>>        affect the entire project, developers and potentially
> users
> >>>>> (f
> >>>>>>>>>>>        they build from source).
> >>>>>>>>>>> 6. CI service
> >>>>>>>>>>>      * Our current artifact caching setup on Travis is
> basically a
> >>>>>>>>>>>        hack; we're basically abusing the Travis cache, which is
> >>>>> meant
> >>>>>>>>>>>        for long-term caching, to ship build artifacts across
> jobs.
> >>>>>>>> It's
> >>>>>>>>>>>        brittle at times due to timing/visibility issues and on
> >>>>>>>> branches
> >>>>>>>>>>>        the cleanup processes can interfere with running
> builds. It
> >>>>> is
> >>>>>>>>>>>        also not as effective as it could be.
> >>>>>>>>>>>      * There are CI services that provide build artifact
> caching
> >>> out
> >>>>>>>> of
> >>>>>>>>>>>        the box, which could be useful for us.
> >>>>>>>>>>>      * To date, no PoC for using another CI service has been
> >>>>> provided.
> >>>>>>>>>>>
> >>>>>
> >>>
>
>
12